Gradient Descent

Created	@April 23, 2022
Tags	Basic Concepts

Gradient Descent Overview

Gradient Descent is a first-order iterative optimization algorithm for finding the minimum of a function. In the context of machine learning, it's used to minimize the cost function of a model, which measures the difference between the model's predictions and the actual data. By iteratively moving towards the direction of the steepest decrease in the cost function, gradient descent aims to find the model parameters (weights) that minimize the cost.

How Gradient Descent Works

Initialization: Start with an initial set of parameters (weights). These can be initialized randomly or set to a specific value.

Compute the Gradient: The gradient of the cost function with respect to each parameter is calculated. The gradient is a vector that points in the direction of the steepest increase of the cost function. Its magnitude tells how fast the cost function changes if you move in that direction.

Update the Parameters: The parameters are updated by moving a small step in the opposite direction of the gradient. The size of the step is determined by the learning rate, a hyperparameter that controls how much we adjust the parameters by on each iteration.

Repeat: This process is repeated until the gradient is close to zero, indicating that the minimum (or a local minimum) of the cost function has been found.

Mathematical Formulation

Given a cost function $J(\theta)$, where $\theta$ represents the parameters of the model, the update rule for gradient descent is:

$\theta_{\text{new}} = \theta_{\text{old}} - \alpha \cdot \nabla J(\theta_{\text{old}})$

$\theta_{\text{old}}$ : Current parameters

$\alpha$ : Learning rate

$\nabla J(\theta_{\text{old}})$ : Gradient of the cost function with respect to the parameters

Types of Gradient Descent

Batch Gradient Descent: Uses the entire training dataset to calculate the gradient of the cost function for each iteration. It's precise but can be very slow and computationally expensive for large datasets.

Stochastic Gradient Descent (SGD): Updates the parameters for each training example one at a time. It's much faster and can be used for online learning, but it's more erratic in finding the minimum.

Mini-batch Gradient Descent: Strikes a balance between batch and stochastic versions. It updates the parameters for a small subset of the training data at each iteration. This is the most commonly used variant in practice due to its efficiency and convergence speed.

Python Example

Here's a simple Python example implementing gradient descent to find the minimum of a quadratic function $f(x) = x^2$ :

def gradient_descent(starting_point, learning_rate, num_iterations):
    x = starting_point
    for i in range(num_iterations):
        gradient = 2 * x  # Derivative of f(x) = x^2
        x = x - learning_rate * gradient
        print(f"Iteration {i+1}: x = {x}, f(x) = {x**2}")
    return x

# Example usage
minimum = gradient_descent(starting_point=10, learning_rate=0.1, num_iterations=20)

Pros and Cons

Pros:

Simplicity: Gradient descent is conceptually simple and easy to implement.

Flexibility: It can be used for a wide range of optimization problems.

Cons:

Sensitivity to Learning Rate: Choosing an inappropriate learning rate can lead to divergence or slow convergence.

Local Minima and Saddle Points: Can get stuck in local minima (or saddle points in high-dimensional spaces).

Applications

Gradient descent is ubiquitous in machine learning for training almost any kind of model, from linear regression to deep neural networks, due to its simplicity and effectiveness in navigating complex, high-dimensional parameter spaces to find optimal model parameters.

Gradient descent is an optimization algorithm used to minimize some function by iteratively moving in the direction of steepest descent. In machine learning, we use gradient descent to update the values of parameters of our model. Parameter refer to the coefficients in linear regression and weights in neural networks

def gradient_descent_update(m, b, x, y, learning_rate):
    N = len(x)
    m_gradient = sum(-2 * x[i] * (y[i] - (m * x[i] + b)) for i in range(N)) / N
    b_gradient = sum(-2 * (y[i] - (m * x[i] + b)) for i in range(N)) / N
    
    m_updated = m - learning_rate * m_gradient
    b_updated = b - learning_rate * b_gradient
    
    return m_updated, b_updated