Incorporating Momentum Into Neural Networks Learning

Newton’s cradle is the most popular example of momentum conservation. A lifted and released sphere strikes the stationary spheres and force is transmitted through the stationary spheres. This action pushes the last sphere upward. This shows that the last ball receives the momentum of the first ball. We would apply similar principle in neural networks to improve learning speed. The idea including momentum into neural networks learning is incorporating previous update in the current change.

newtons-cradle
Newton’s crandle

Gradient descent guarantees to reach the local minimum when iteration approaches to infinity. However, that is not applicable in reality. Gradient descent iterations have to be terminated by a reasonable value. Moreover, gradient descent converges slowly. Herein, momentum improves the performance of the gradient descent considerably. Thus, cost might converge faster with less iterations if momentum is involved in the weight update formula.


🙋‍♂️ You may consider to enroll my top-rated machine learning course on Udemy

Decision Trees for Machine Learning

Furthermore, momentum changes the path you take to the local minimum. Standard gradient descent might get you stuck in local minimum but this point might be far away from the global minimum. Incorporating momentum might reach to global minimum.

path-change-via-momentum
Momentum Changes The Taken Path

We’ve already showed the proof of weight update formula. Weight would be updated as illustrated below when standard gradient descent applied.

double derivative = weightToNodeDelta
 * weightToNodeValue * (1 - weightToNodeValue)
 * weightFromNodeValue;

weights.get(j).setValue(
  weights.get(j).getValue()
  + learningRate * derivative
 );

Incorporating momentum modifies the weight update formula in the following form.

double derivative = weightToNodeDelta
 * weightToNodeValue * (1 - weightToNodeValue)
 * weightFromNodeValue;

weights.get(j).setValue(
  weights.get(j).getValue()
  + learningRate * derivative
  + momentum * previousDerivative
 );

previousDerivative = derivative * 1;

Converge

Suppose that back propagation is applied and neural networks repository is run for xor dataset with the following variables. Then, momentum effect on cost will be monitored.

int[] hiddenNodes = {3};
double learningRate = 0.1, momentum = 0.3;
int epoch = 3000;
String activation = "sigmoid";

In this way, we can improve the learning speed as demonstrated below. For instace, standard gradient descent reduces the cost to the value of 0.09 in 50th iteration whereas incorporating momentum decreases the cost same value in 26th iteration. This means that momentum speeds up the calculation almost 2 times faster. In other words, system is optimized almost 50%.

momentum-effect-on-cost

So, we’ve focused on momentum incorporation in weight update procedure in neural networks. Although, momentum incorporation is an optional add on, it is very common in real world applications. Because, this approach would improve convergence considerably.






Like this blog? Support me on Patreon

Buy me a coffee