SGD

Created
TagsNN

SGD is a stochastic approximation of gradient descent optimization. It replaces the actual gradient that is calculated from the entire dataset by an estimate that is calculated from a randomly selected subset of the data.

Advantages and Disadvantages of Stochastic Gradient Descent | Asquero
Advantages of Stochastic Gradient Descent, Disadvantages of Stochastic Gradient Descent
https://www.asquero.com/article/advantages-and-disadvantages-of-stochastic-gradient-descent/

SGD oscillates across the slops of the ravine while only making hesitant progress along the bottom towards the local optimum. Momentum helps accelerate SGD in the relevant direction and dampens oscillations.

an optimization to use a random data in learning to reduce the computation load drastically

Stochastic Gradient Descent (SGD)

Stochastic Gradient Descent (SGD) is a fundamental optimization algorithm used in machine learning and deep learning for minimizing the loss function of a model. It is a variant of gradient descent, where instead of using the entire dataset to compute the gradient of the loss function (as in batch gradient descent), it uses a single sample or a mini-batch of samples. This approach makes SGD much faster and more scalable to large datasets compared to batch gradient descent.

Definition and Operation

The basic idea behind SGD involves taking steps proportional to the negative of the gradient (or approximate gradient) of the objective function with respect to the model parameters. The objective function is typically a loss function that measures the difference between the predicted values by the model and the actual values in the training data. The steps are defined by:

θ=θηθJ(θ;x(i),y(i))\theta = \theta - \eta \nabla_\theta J(\theta; x^{(i)}, y^{(i)})

Where:

Pros of SGD

Cons of SGD

Variants and Improvements

Several variants and improvements of SGD have been developed to address its shortcomings, including:

Python Code Example

Here's a simple example of implementing SGD for linear regression using Python:

import numpy as np

# Assuming X_train and y_train are the features and labels respectively
def sgd(X_train, y_train, learning_rate=0.01, n_epochs=100):
    n_samples, n_features = X_train.shape
    # Initialize weights and bias to zeros
    weights = np.zeros(n_features)
    bias = 0

    for _ in range(n_epochs):
        for i in range(n_samples):
            xi = X_train[i]
            yi = y_train[i]
            prediction = np.dot(xi, weights) + bias
            error = prediction - yi

            # Compute gradients
            weights_gradient = xi * error
            bias_gradient = error

            # Update parameters
            weights -= learning_rate * weights_gradient
            bias -= learning_rate * bias_gradient

    return weights, bias

This code snippet demonstrates the basic implementation of SGD for a linear regression model. In practice, for deep learning tasks, frameworks like TensorFlow and PyTorch provide built-in optimizers, including SGD and its variants, that handle these computations efficiently and offer additional features like momentum and adaptive learning rates.