SGD
Created | |
---|---|
Tags | NN |
SGD is a stochastic approximation of gradient descent optimization. It replaces the actual gradient that is calculated from the entire dataset by an estimate that is calculated from a randomly selected subset of the data.


SGD oscillates across the slops of the ravine while only making hesitant progress along the bottom towards the local optimum. Momentum helps accelerate SGD in the relevant direction and dampens oscillations.
an optimization to use a random data in learning to reduce the computation load drastically
Stochastic Gradient Descent (SGD)
Stochastic Gradient Descent (SGD) is a fundamental optimization algorithm used in machine learning and deep learning for minimizing the loss function of a model. It is a variant of gradient descent, where instead of using the entire dataset to compute the gradient of the loss function (as in batch gradient descent), it uses a single sample or a mini-batch of samples. This approach makes SGD much faster and more scalable to large datasets compared to batch gradient descent.
Definition and Operation
The basic idea behind SGD involves taking steps proportional to the negative of the gradient (or approximate gradient) of the objective function with respect to the model parameters. The objective function is typically a loss function that measures the difference between the predicted values by the model and the actual values in the training data. The steps are defined by:
Where:
- \(\theta\) represents the parameters of the model.
- \(\eta\) is the learning rate, a scalar that determines the size of the steps.
- \(J(\theta; x^{(i)}, y^{(i)})\) is the loss function evaluated on a single training example \((x^{(i)}, y^{(i)})\) or a mini-batch of examples.
Pros of SGD
- Efficiency: SGD can be significantly faster than batch gradient descent because it updates the parameters more frequently.
- Scalability: It is well-suited for large datasets and can handle online learning scenarios where the data arrives in a stream.
- Escape Local Minima: Due to its stochastic nature, SGD has a better chance of escaping local minima, making it useful for non-convex optimization problems like training neural networks.
Cons of SGD
- Variance: The stochastic nature of the algorithm means that the updates are noisy, which can lead to significant variance in the training process and may require carefully tuning the learning rate.
- Convergence Rate: While it may quickly decrease the loss function initially, it might oscillate around the minimum towards the end, requiring the use of techniques like learning rate schedules to converge efficiently.
Variants and Improvements
Several variants and improvements of SGD have been developed to address its shortcomings, including:
- Momentum: Helps accelerate SGD in the relevant direction and dampens oscillations by adding a fraction of the previous update vector to the current update.
- Nesterov Accelerated Gradient (NAG): A modification of the momentum method that has a stronger theoretical justification for faster convergence.
- Adaptive Learning Rate Methods: Algorithms like AdaGrad, RMSprop, and Adam adjust the learning rate for each parameter dynamically, improving performance on complex optimization landscapes.
Python Code Example
Here's a simple example of implementing SGD for linear regression using Python:
import numpy as np
# Assuming X_train and y_train are the features and labels respectively
def sgd(X_train, y_train, learning_rate=0.01, n_epochs=100):
n_samples, n_features = X_train.shape
# Initialize weights and bias to zeros
weights = np.zeros(n_features)
bias = 0
for _ in range(n_epochs):
for i in range(n_samples):
xi = X_train[i]
yi = y_train[i]
prediction = np.dot(xi, weights) + bias
error = prediction - yi
# Compute gradients
weights_gradient = xi * error
bias_gradient = error
# Update parameters
weights -= learning_rate * weights_gradient
bias -= learning_rate * bias_gradient
return weights, bias
This code snippet demonstrates the basic implementation of SGD for a linear regression model. In practice, for deep learning tasks, frameworks like TensorFlow and PyTorch provide built-in optimizers, including SGD and its variants, that handle these computations efficiently and offer additional features like momentum and adaptive learning rates.