With the mini-batch gradient descent, we don’t update the parameters of the network after iterating through every single data point in our training set. Instead, we update the parameters of the network after iterating through some n number of data points.
Say n is 32, then it implies that we update the parameter of the network after iterating through every 32 data points in our training set.
Gradient descent - Update the parameter of the network after iterating through all the data points present in the training set.
Stochastic gradient descent - Update the parameter of the network after iterating through every single data points present in the training set.
Mini-batch gradient descent - Update the parameter of the network after iterating through some n number of data points present in the training set.
One problem we face with SGD and mini-batch gradient descent is that there will be too many oscillations in the gradient steps. This oscillation happens because we update the parameter of the network after iterating through every point or every n data points and thus the direction of the update will possess some variances causing oscillation in the gradient steps.
This oscillation leads to slow training time and makes it's hard to reach the convergence. To avoid this issue we use momentum-based gradient descent.
One issue we encounter with the momentum-based gradient descent method is that it causes us to miss out on the minimum value. Suppose, we are near to attaining convergence and when the value of momentum is high, then the momentum pushes the gradient step high and we miss out on the minimum value, that is we overshoot the minimum value.
Nesterov accelerated momentum is used to solve the issue faced with the momentum-based method. With the Nesterov accelerated momentum, we calculate gradients at the lookahead position, instead of calculating gradient at the current position. The lookahead position implies the position where the momentum would take us to.