Machine learning : Gradient Descent

What is gradient descent and need for it. Different types of gradient descent, their advantages and disadvantages.

5 min readOct 8, 2018

For a basic understanding of neural networks

What is Gradient Descent?

Gradient descent is an iterative machine learning optimization algorithm to reduce the cost function so that we have models that makes accurate predictions.

Cost function(C) or Loss function measures the difference between the actual output and predicted output from the model. Cost function are a convex function.

Why do we need Gradient Descent?

In neural network our goal is to train the model to have optimized weights(w) to make better prediction.

we get optimized weights using gradient descent.

How can we find the optimized weights?

This is explained best with a classic mountain problem.

Gradient descent is similar to mountaineering problem to find the lowest valley for a mountain

In the mountaineering problem we want to reach the lowest point for a mountain and we have zero visibility.

we do not know if we are on the top of the mountain or in the middle of the mountain or very close to the bottom.

Our best option is to check the terrain near us and to identify from where we need to descend to reach the bottom. We need to do this iteratively till there is no more scope to descend and that is when we would have reached the bottom.

we will discuss later in the article what can we do if feel we have reached the bottom(local minimum point) but there is another lowest point for the mountain(global minimum point).

Gradient descent helps us solve the same problem mathematically.

We randomly initialize all the weights for a neural network to a value close to zero but not zero.

we calculate the gradient, ∂c/∂ω which is a partial derivative of cost with respect to weight.

α is learning rate, helps adjust the weights with respect to gradient descent

we need to update the weights for all the neurons simultaneously

Learning Rate

Learning rate controls how much we should adjust the weights with respect to the loss gradient. Learning rates are randomly initialized.

Lower the value of the learning rate, slower will be the convergence to global minima.

A higher value for learning rate will not allow the gradient descent to converge

Since our goal is to minimize the cost function to find the optimized value for weights, we run multiple iterations with different weights and calculate the cost to arrive at a minimum cost as shown below

It is possible to have two different bottom for the mountain in the same way we can also get local and global minimum points between the cost and weights.

Global minima is minimum point for the entire domain and local minima is a sub optimal point where we get a relatively minimum point but is not the global minimum point as shown below.

How can we avoid local minima and always try and get the optimized weights based on global minima?

so first let’s understand the different type of gradient descent

Different types of Gradient descents are

Batch Gradient Descent
Stochastic Gradient Descent
Mini batch Gradient Descent

Batch Gradient Descent

In batch gradient we use the entire dataset to compute the gradient of the cost function for each iteration of the gradient descent and then update the weights.

Since we use the entire dataset to compute the gradient convergence is slow.

If the dataset is huge and contains millions or billions of data points then it is memory as well as computationally intensive.

Batch gradient descent uses the entire dataset to calculate each iteration of gradient descent

Advantages of Batch Gradient Descent

Theoretical analysis of weights and convergence rates are easy to understand

Disadvantages of Batch Gradient Descent

Perform redundant computation for the same training example for large datasets
Can be very slow and intractable as large datasets may not fit in the memory
As we take the entire dataset for computation we can update the weights of the model for the new data

Stochastic Gradient descent

In stochastic gradient descent we use a single datapoint or example to calculate the gradient and update the weights with every iteration.

we first need to shuffle the dataset so that we get a completely randomized dataset. As the dataset is randomized and weights are updated for each single example, update of the weights and the cost function will be noisy jumping all over the place as shown below

Random sample helps to arrive at a global minima and avoids getting stuck at a local minima.

Learning is much faster and convergence is quick for a very large dataset.

Advantages of Stochastic Gradient Descent

Learning is much faster than batch gradient descent
Redundancy is computation is removed as we take one training sample at a time for computation
Weights can be updated on the fly for the new data samples as we take one training sample at a time for computation

Disadvantages of Stochastic Gradient Descent

As we frequently update weights, Cost function fluctuates heavily

Mini Batch Gradient descent

Mini-batch gradient is a variation of stochastic gradient descent where instead of single training example, mini-batch of samples is used.

Mini batch gradient descent is widely used and converges faster and is more stable.

Batch size can vary depending on the dataset.

As we take a batch with different samples,it reduces the noise which is variance of the weight updates and that helps to have a more stable converge faster.