Analytical gradients are the gradients we calculated through backpropagation and the numerical gradients are the numerical approximation to the gradients.
In gradient checking, first, we compute the analytical and approximated numerical gradients. Then we compare the analytical and numerical gradients. If they are not the same then there is an error with our implementation.
we don’t have to check whether analytical and numerical gradients are exactly the same since the numerical gradient is just an approximation. So, we compute the difference between the analytical and numerical gradients and if their difference is very small say 1e-7 then our implementation is correct else we have a buggy implementation.
A function is called a convex function when it has only one minimum value and a function is called a non-convex function when it has more than one minimum value.
With gradient descent, we update the parameters of the model only after iterating through all data points present in our training set. Let’s say we have 10 million data points. Now, even to perform a single parameter update, we have to iterate through all the 10 million data points and then we update the parameter of the network. This is will be a very time-consuming task and takes us a lot of training time.
So, to combat this drawback of gradient descent, we can stochastic gradient descent.
With stochastic gradient descent, we don’t have to update the parameters of the network only after iterating through all data points present in our training set. Instead, we can update the parameter of the network after iterating through every single point in the training set.