Gradient Descent vs Stochastic GD vs Mini-Batch SGD

Published in

Analytics Vidhya

6 min readMar 6, 2021

Warning: Just in case the terms “partial derivative” or “gradient” sound unfamiliar, I suggest checking out these resources!

https://www.mathsisfun.com/calculus/derivatives-partial.html

https://www.khanacademy.org/math/multivariable-calculus/multivariable-derivatives/gradient-and-directional-derivatives/v/gradient

PS, Batch gradient descent = Gradient descent

Gradient descent is an iterative algorithm whose purpose is to make changes to a set of parameters (i.e. features of dataset) in hopes of reaching an optimal set of parameters that leads to the lowest loss function value possible.

A loss or cost or objective function (any of these naming conventions work in practice) is the function whose value we seek to minimize. The form of the loss function looks like this:

When performing Gradient descent, each time we update the parameters, we expect to observe a change in min f(w). That is at each iteration, the gradient of the function that contains parameters in w is taken so that changes in the function with respect to parameters brings us closer to the goal of reaching an optimal set of parameters that will ultimately lead to the lowest possible loss function value.

The model that we use (e.g. least squares, logistic regression, etc.) needs to learn this optimal parameter set so that the model’s predictions are very close to the target we aim to obtain.

So we take the gradient of the loss function with respect to the parameter vector, w.

W can be a d-dimensional vector or a matrix that is updated after each iteration in Gradient descent by taking the opposite gradient direction moving toward the global minimum. By global minimum, I refer to the point where the value of the function in which we wish to optimize is smaller than the values given by all other feasible points. Below is the general form of how the vector of parameters w is updated.

The learning rate α determines the size of the steps we take towards reaching this global minimum.

Depending on how the learning rate is set, convergence towards the global optimum can vary.

Picture from: https://www.math.purdue.edu/~nwinovic/deep_learning_optimization.html

The picture to the left represents the primary goal of performing Gradient descent. Start with an initial vector w, and minimize the distance between w and w*.

Now, there are important properties (e.g. convexity, strict convexity etc.) that certain functions might have that can make descent methods’ lives easier. If you are interested in reading more about this, please check out this link below to lecture notes that I found pretty informative.

https://sboyles.github.io/teaching/ce377k/convexity.pdf

To recap before moving on, these are the steps involved with Gradient descent:

Initially start with a random set of parameter values, this can be a weight vector: w
Use the entire dataset to compute the partial derivative with respect to each parameter inside w
Update the set of parameters in w
Return the updated w and the value of the loss function

It’s important to note that when a parameter in a weight vector changes, the change that occurs in the loss function should not change by a much larger factor. If this occurs, the model or system is claimed to be ill-conditioned. Gradient descent will take many iterations than warranted to converge to an optimal weight vector. It’s important to track changes in w and the loss function value when programming Gradient descent. Specify the amount of times or iterations that w is updated.

Let’s look at an example. Let’s say we have 40,000 training samples and 20 features. This means we are using 40,000 samples to update each of our 20 features. We’re computing 40,000 terms for each of the partial derivatives taken from our vector w! This means 800,000 partial derivatives computed to reach the lowest loss function value.

So, Gradient descent requires calculating the gradient using the entire dataset to perform updates to the model’s parameters. In practice, the computational cost of Gradient descent can be very high and the time taken to reach the optimal weight vector w can be very long.

But what if we didn’t want to use the entire dataset that could contain millions of training samples to update our parameter vector/matrix w?

Stochastic Gradient Descent (SGD) is a variation of Gradient descent that randomly samples one training sample from the dataset to be used to compute the gradient per iteration. Stochastic Gradient Descent works well because we are using just one data point to calculate the gradient, update the weight vector w, and compute the loss function value. The equation for updating w slightly changes to. The superscript, i, denotes a single data point being used to update theta below.

Picture from: https://www.geeksforgeeks.org/ml-stochastic-gradient-descent-sgd/

At each iteration of the algorithm, Stochastic Gradient Descent takes one sample at random and computes all of the partial derivatives. Back to our example from before, one sample is randomly taken at a time from our n = 40,000 sample size, 20 partial derivatives are computed because we have 20 parameters, and this process continues for a specified number of epochs.

Since SGD randomly samples one at a time from a dataset (when coding this from scratch, make sure to shuffle your dataset at each iteration), eventually it must go through the entire dataset. One pass through the entire dataset is called an epoch. Epochs can be used in comparisons between Gradient Descent and SGD to compare the rate at which the starting parameters in w approach their optimal values.

The Stochastic Gradient Descent algorithm:

Initialize the weight vector
For t = 0,1,…,T, where t denotes an epoch, do the following:

a. Randomly sample a data point from the dataset

b. Use a training sample, and weight to calculate the gradient

c. Update the weight vector

d. Decrease the learning rate/step-size (good in practice)

3) Return the (near) optimal weight vector

Processing single samples in SGD requires performing many single matrix-vector (or vector-vector) multiplications which gets computationally expensive.

Now, what if we wanted to not only use 1 sample per iteration, but instead wanted to get things moving by using a batch of samples somewhere between 1 data point and the entire dataset. Sampling more than one sample to compute the gradient for SGD such that 1 < b < n is referred to as Mini-Batch Stochastic Gradient Descent (MBSGD). Some may call their algorithmic implementation SGD and simultaneously use mini-batches to solve their optimization problem, so keep an eye out for this!

In MBSGD, the gradient is computed over mini-batches of training samples so that the parameters are updated and the loss function value (hopefully) continues to decrease over each iteration. Advantages to Mini-Batch Stochastic Gradient Descent include:

Model parameters are updated more frequently which allows for stronger convergence towards optimal parameter values
Computationally more effective as MBSGD does not employ the full dataset. As using millions of training samples to update potentially thousands of parameters is extremely inefficient

MBSGD is very close in algorithmic implementation. When using the mini-batch method, randomly sample b data points from the dataset to calculate the gradient. Each epoch in MBSGD has n/b iterations (n is the size of the dataset) to use b samples to compute the gradient and loss function.

For deep learning tasks, in order to fit the memory requirements of CPU and GPU hardware, specify the mini-batch size before training a model on data. Mini-batch sizes such as 8, 32, 64, 128, and so forth are good-sized batches when implementing MBSGD. Always keep track of how the loss function is changing.

Remember, in each of these Gradient descent methods we desperately want to minimize the loss function. So if the function’s value is hitting a wall after a certain number of epochs in the case of MBSGD, it could mean it’s necessary to fine-tune the batch size and/or fine-tune several other values (i.e learning rate, regularization/weight decay, momentum, etc.).

I hope you enjoyed reading this. Happy learning!

Analytics Vidhya

Gradient Descent vs Stochastic GD vs Mini-Batch SGD

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Published in Analytics Vidhya

Written by Ethan Irby

Responses (1)