Machine learning : Gradient Descent

What is gradient descent and need for it. Different types of gradient descent, their advantages and disadvantages.

For a basic understanding of neural networks

What is Gradient Descent?

Cost function(C) or Loss function measures the difference between the actual output and predicted output from the model. Cost function are a convex function.

Why do we need Gradient Descent?

we get optimized weights using gradient descent.

How can we find the optimized weights?

Image for post
Image for post
Gradient descent is similar to mountaineering problem to find the lowest valley for a mountain

In the mountaineering problem we want to reach the lowest point for a mountain and we have zero visibility.

we do not know if we are on the top of the mountain or in the middle of the mountain or very close to the bottom.

Our best option is to check the terrain near us and to identify from where we need to descend to reach the bottom. We need to do this iteratively till there is no more scope to descend and that is when we would have reached the bottom.

we will discuss later in the article what can we do if feel we have reached the bottom(local minimum point) but there is another lowest point for the mountain(global minimum point).

Gradient descent helps us solve the same problem mathematically.

We randomly initialize all the weights for a neural network to a value close to zero but not zero.

we calculate the gradient, ∂c/∂ω which is a partial derivative of cost with respect to weight.

α is learning rate, helps adjust the weights with respect to gradient descent

Image for post
Image for post
w is the weights for the neurons, α is learning rate, C is the cost and ∂c/∂ω is the gradient

we need to update the weights for all the neurons simultaneously

Learning Rate

Lower the value of the learning rate, slower will be the convergence to global minima.

A higher value for learning rate will not allow the gradient descent to converge

Since our goal is to minimize the cost function to find the optimized value for weights, we run multiple iterations with different weights and calculate the cost to arrive at a minimum cost as shown below

Image for post
Image for post

It is possible to have two different bottom for the mountain in the same way we can also get local and global minimum points between the cost and weights.

Global minima is minimum point for the entire domain and local minima is a sub optimal point where we get a relatively minimum point but is not the global minimum point as shown below.

Image for post
Image for post

How can we avoid local minima and always try and get the optimized weights based on global minima?

so first let’s understand the different type of gradient descent

Different types of Gradient descents are

  • Batch Gradient Descent
  • Stochastic Gradient Descent
  • Mini batch Gradient Descent

Batch Gradient Descent

Since we use the entire dataset to compute the gradient convergence is slow.

If the dataset is huge and contains millions or billions of data points then it is memory as well as computationally intensive.

Image for post
Image for post
Batch gradient descent uses the entire dataset to calculate each iteration of gradient descent

Advantages of Batch Gradient Descent

  • Theoretical analysis of weights and convergence rates are easy to understand

Disadvantages of Batch Gradient Descent

  • Perform redundant computation for the same training example for large datasets
  • Can be very slow and intractable as large datasets may not fit in the memory
  • As we take the entire dataset for computation we can update the weights of the model for the new data

Stochastic Gradient descent

we first need to shuffle the dataset so that we get a completely randomized dataset. As the dataset is randomized and weights are updated for each single example, update of the weights and the cost function will be noisy jumping all over the place as shown below

Image for post
Image for post

Random sample helps to arrive at a global minima and avoids getting stuck at a local minima.

Learning is much faster and convergence is quick for a very large dataset.

Advantages of Stochastic Gradient Descent

  • Learning is much faster than batch gradient descent
  • Redundancy is computation is removed as we take one training sample at a time for computation
  • Weights can be updated on the fly for the new data samples as we take one training sample at a time for computation

Disadvantages of Stochastic Gradient Descent

  • As we frequently update weights, Cost function fluctuates heavily
Image for post
Image for post

Mini Batch Gradient descent

Mini batch gradient descent is widely used and converges faster and is more stable.

Batch size can vary depending on the dataset.

As we take a batch with different samples,it reduces the noise which is variance of the weight updates and that helps to have a more stable converge faster.

Image for post
Image for post

Advantages of Min Batch Gradient Descent

  • Reduces variance of the parameter update and hence lead to stable convergence
  • Speeds the learning
  • Helpful to estimate the approximate location of the actual minimum

Disadvantages of Mini Batch Gradient Descent

  • Loss is computed for each mini batch and hence total loss needs to be accumulated across all mini batches

This will give you a basic idea about gradient descent.

Written by

Loves learning, sharing, and discovering myself. Passionate about Machine Learning and Deep Learning

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store