The Insider's Guide to Adam Optimization Algorithm for Deep Learning

Adam is the super star optimization algorithm of Deep Learning. Optimization algorithms aim to find optimum weights, minimize error and maximize accuracy. (Stochastic) Gradient descent is the most common optimization algorithm for neural networks. We find partial derivative of total error with respect to each weight and use this calculation to update weights.

w_i = w_i – α.(∂Error/∂w_i)

🙋‍♂️ You may consider to enroll my top-rated machine learning course on Udemy

Or more basically we can demonstrate classical gradient descent as

w_i = w_i – α.dw_i

This is common because it works slowly but surely.

Rising of Adam

In 2015, Adam optimization algorithm is raised. The name of the algorithm refers to adaptive moment estimation. Actually, it is an extension of stochastic gradient descent. But, it tends to converge cost function to zero faster for such problems.

adam-eve — Adam and Eve by Lucas Cranach

In Adam, we will include vdw and sdw variables instead of raw partial derivatives of error with respect to the weight. You might think vdw is similar to momentum, and sdw is similar to RMSProp.

Constructing Adam

Firstly, we will assign initial values of vdw and sdw for every weight to 0.

vdw_i = 0; sdw_i = 0

Then, vdw and sdw values will be calculated for each weight.

vdw_i = β1 . vdw_i + (1 – β1) . w_i

sdw_i = β2 . sdw_i + (1 – β2) . (w_i)²

After then, calculated vdw and sdw will be modified by power of current epoch index value.

vdw_i_corrected = vdw_i / (1 – β1^epoch)

sdw_i_corrected = sdw_i / (1 – β2^epoch)

Finally, each weight will be updated as its previous value plus vdw and sdw corrected values.

w_i = w_i – α.(vdw_corrected / (√(sdw_corrected) + ε))

Herein, recommended values for Adam variables are demonstrated below. Additionaly, learning rate α should be tuned.

β1 = 0.9; β2 = 0.999; ε = 10^-8

As seen, the algorithm requires really little memory. That’s why, it is easy to adapt daily life problems.

Code wins arguments

I also reflect Adam logic into a code to monitor how it works.

Firstly, initialize Adam parameters.

vdw = [0.0 for i in range(num_of_layers)]
sdw = [0.0 for i in range(num_of_layers)]
epsilon = np.array([pow(10, -8)])
beta1 = 0.9; beta2 = 0.999

Then, applied backpropagation and find dw – partial derivative of total error with respect to each weight. Thereafter, update weights. The following code stores weights as matrix. You can find the entire code from my GitHub profile.

for epoch in range(epochs):
 for j in range(num_of_instances):
  for k in range(num_of_layers - 1):
   dw = findDelta() #partial derivative of total error with respect to the weight

   if optimization_algorithm == 'gradient-descent':
    w[j] = w[j] + learningRate * dw
   elif optimization_algorithm == 'adam':
    vdw[j] = beta1 * vdw[j] + (1 - beta1) * dw
    sdw[j] = beta2 * sdw[j] + (1 - beta2) * pow(dw, 2)
    vdw_corrected = vdw[j] / (1-pow(beta1, epoch+1))
    sdw_corrected = sdw[j] / (1-pow(beta2,epoch+1))
    w[j] = w[j] + learningRate * (vdw_corrected / (np.sqrt(sdw_corrected) + epsilon))

Performance

When I applied both gradient descent and Adam optimization algorithm for XOR problem for same configuration (same learning rate, and same initial weights), Adam tends to converge error to zero much faster.

adam-over-xor-v2 — Adam vs Classical Gradient Descent Over XOR Problem

Today, Adam is much more meaningful for very complex neural networks and deep learning models with really big data.

Covering with a metaphor

Would you image that what if optimization algorithms were car brands? Gradient descent would be Volvo. It moves with slowly but surely steps. Also, it gives confidence to consumers. On the other hand, Adam optimization algorithm would be Tesla. Because, no other algorithm has insane mode for today! Also, Adam usually outperforms the rest of optimization algorithms.

Like this blog? Support me on Patreon

6 Comments

Jean-Marie says:

June 16, 2019 at 9:58 am

In the code snippet you use ‘epoch+1’. Yet in your formula, it is simply ‘epoch’.
Am I missing something ?
Is epoch a simple counter, by the way ?
Thanks for this limpid tutorial !
1. Sefik Serengil says:
  
  June 16, 2019 at 10:20 am
  
  Thank you for your kindly feedback
  Focus on this: for epoch in range(10). For loop runs from 0 to 9. Epoch should begin 1, and should end 10. That’s why, I added +1 to epoch in that line.
  
  Alternatively, you can write a for loop for epoch in range(1, 10+1). In this case, for loop runs for 1 to 10 and you can say epoch instead of epoch + 1
  
  And yes epoch is a simple counter. Some resources mention it as learning time.
  1. Derek Stinson says:
    
    April 9, 2020 at 3:29 pm
    
    I found that this works better when counting each time the weights are updated.
Metag says:

November 7, 2019 at 9:22 am

by updating the weights the text describes
wi = wi – α…
but the code uses
w[j] = w[j] + learningRate…
on GitHub the learningRate is positive, vdw and sdw are implemented as described.

Is this just a typo or does the sign hides somewhere else?
1. Sefik Serengil says:
  
  November 7, 2019 at 12:01 pm
  
  You are absolutely right! GitHub code should be fixed with minus sign. Thank you for your feedback.
Metag says:

November 7, 2019 at 9:00 pm

Thank you for the quick response 🙂

Nevertheless in my case the change of the sign does not make any difference and they behave in the same way with the same errorrates

Comments are closed.

The Insider’s Guide to Adam Optimization Algorithm for Deep Learning