
Gradient descent guarantees to reach the local minimum when iteration approaches to infinity. However, that is not applicable in reality. Gradient descent iterations have to be terminated by a reasonable value. Moreover, gradient descent converges slowly. Herein, momentum improves the performance of the gradient descent considerably. Thus, cost might converge faster with less iterations if momentum is involved in the weight update formula.
🙋♂️ You may consider to enroll my top-rated machine learning course on Udemy

Furthermore, momentum changes the path you take to the local minimum. Standard gradient descent might get you stuck in local minimum but this point might be far away from the global minimum. Incorporating momentum might reach to global minimum.

We’ve already showed the proof of weight update formula. Weight would be updated as illustrated below when standard gradient descent applied.
double derivative = weightToNodeDelta * weightToNodeValue * (1 - weightToNodeValue) * weightFromNodeValue; weights.get(j).setValue( weights.get(j).getValue() + learningRate * derivative );
Incorporating momentum modifies the weight update formula in the following form.
double derivative = weightToNodeDelta * weightToNodeValue * (1 - weightToNodeValue) * weightFromNodeValue; weights.get(j).setValue( weights.get(j).getValue() + learningRate * derivative + momentum * previousDerivative ); previousDerivative = derivative * 1;
Converge
Suppose that back propagation is applied and neural networks repository is run for xor dataset with the following variables. Then, momentum effect on cost will be monitored.
int[] hiddenNodes = {3}; double learningRate = 0.1, momentum = 0.3; int epoch = 3000; String activation = "sigmoid";
In this way, we can improve the learning speed as demonstrated below. For instace, standard gradient descent reduces the cost to the value of 0.09 in 50th iteration whereas incorporating momentum decreases the cost same value in 26th iteration. This means that momentum speeds up the calculation almost 2 times faster. In other words, system is optimized almost 50%.
So, we’ve focused on momentum incorporation in weight update procedure in neural networks. Although, momentum incorporation is an optional add on, it is very common in real world applications. Because, this approach would improve convergence considerably.
Support this blog if you do like!
3 Comments