The Insider’s Guide to Adam Optimization Algorithm for Deep Learning

Adam is the super star optimization algorithm of Deep Learning. Optimization algorithms aim to find optimum weights, minimize error and maximize accuracy. (Stochastic) Gradient descent is the most common optimization algorithm for neural networks. We find partial derivative of total error with respect to each weight and use this calculation to update weights.

wi = wi – α.(∂Error/∂wi)


🙋‍♂️ You may consider to enroll my top-rated machine learning course on Udemy

Decision Trees for Machine Learning

Or more basically we can demonstrate classical gradient descent as

wi = wi – α.dwi

This is common because it works slowly but surely.

Rising of Adam

In 2015, Adam optimization algorithm is raised. The name of the algorithm refers to adaptive moment estimation. Actually, it is an extension of stochastic gradient descent. But, it tends to converge cost function to zero faster for such problems.

adam-eve
Adam and Eve by Lucas Cranach

In Adam, we will include vdw and sdw variables instead of raw partial derivatives of error with respect to the weight. You might think vdw is similar to momentum, and sdw is similar to RMSProp.

Constructing Adam

Firstly, we will assign initial values of vdw and sdw for every weight to 0.

vdwi = 0; sdwi = 0

Then, vdw and sdw values will be calculated for each weight.





vdwi = β1 . vdwi + (1 – β1) . wi

sdwi = β2 . sdwi + (1 – β2) . (wi)2

After then, calculated vdw and sdw will be modified by power of current epoch index value.

vdwi_corrected = vdwi / (1 – β1epoch)

sdwi_corrected = sdwi / (1 – β2epoch)

Finally, each weight will be updated as its previous value plus vdw and sdw corrected values.

wi = wi – α.(vdw_corrected / (√(sdw_corrected) + ε))

Herein, recommended values for Adam variables are demonstrated below. Additionaly, learning rate α should be tuned.

β1 = 0.9; β2 = 0.999; ε = 10-8

As seen, the algorithm requires really little memory. That’s why, it is easy to adapt daily life problems.





Code wins arguments

I also reflect Adam logic into a code to monitor how it works.

Firstly, initialize Adam parameters.

vdw = [0.0 for i in range(num_of_layers)]
sdw = [0.0 for i in range(num_of_layers)]
epsilon = np.array([pow(10, -8)])
beta1 = 0.9; beta2 = 0.999

Then, applied backpropagation and find dw – partial derivative of total error with respect to each weight. Thereafter, update weights. The following code stores weights as matrix. You can find the entire code from my GitHub profile.

for epoch in range(epochs):
 for j in range(num_of_instances):
  for k in range(num_of_layers - 1):
   dw = findDelta() #partial derivative of total error with respect to the weight

   if optimization_algorithm == 'gradient-descent':
    w[j] = w[j] + learningRate * dw
   elif optimization_algorithm == 'adam':
    vdw[j] = beta1 * vdw[j] + (1 - beta1) * dw
    sdw[j] = beta2 * sdw[j] + (1 - beta2) * pow(dw, 2)
    vdw_corrected = vdw[j] / (1-pow(beta1, epoch+1))
    sdw_corrected = sdw[j] / (1-pow(beta2,epoch+1))
    w[j] = w[j] + learningRate * (vdw_corrected / (np.sqrt(sdw_corrected) + epsilon))

Performance

When I applied both gradient descent and Adam optimization algorithm for XOR problem for same configuration (same learning rate, and same initial weights), Adam tends to converge error to zero much faster.

adam-over-xor-v2
Adam vs Classical Gradient Descent Over XOR Problem

 

Today, Adam is much more meaningful for very complex neural networks and deep learning models with really big data.

Covering with a metaphor

Would you image that what if optimization algorithms were car brands? Gradient descent would be Volvo. It moves with slowly but surely steps. Also, it gives confidence to consumers. On the other hand, Adam optimization algorithm would be Tesla. Because, no other algorithm has insane mode for today! Also, Adam usually outperforms the rest of optimization algorithms.

 






Like this blog? Support me on Patreon

Buy me a coffee


6 Comments

  1. In the code snippet you use ‘epoch+1’. Yet in your formula, it is simply ‘epoch’.
    Am I missing something ?
    Is epoch a simple counter, by the way ?
    Thanks for this limpid tutorial !

    1. Thank you for your kindly feedback
      Focus on this: for epoch in range(10). For loop runs from 0 to 9. Epoch should begin 1, and should end 10. That’s why, I added +1 to epoch in that line.

      Alternatively, you can write a for loop for epoch in range(1, 10+1). In this case, for loop runs for 1 to 10 and you can say epoch instead of epoch + 1

      And yes epoch is a simple counter. Some resources mention it as learning time.

      1. I found that this works better when counting each time the weights are updated.

  2. by updating the weights the text describes
    wi = wi – α…
    but the code uses
    w[j] = w[j] + learningRate…
    on GitHub the learningRate is positive, vdw and sdw are implemented as described.

    Is this just a typo or does the sign hides somewhere else?

    1. You are absolutely right! GitHub code should be fixed with minus sign. Thank you for your feedback.

  3. Thank you for the quick response 🙂

    Nevertheless in my case the change of the sign does not make any difference and they behave in the same way with the same errorrates

Comments are closed.