The Insider's Guide to Adam Optimization Algorithm for Deep Learning

Adam is the super star optimization algorithm of Deep Learning. Optimization algorithms aim to find optimum weights, minimize error and maximize accuracy. (Stochastic) Gradient descent is the most common optimization algorithm for neural networks. We find partial derivative of total error with respect to each weight and use this calculation to update weights.

w_i = w_i – α.(∂Error/∂w_i)

🙋‍♂️ You may consider to enroll my top-rated machine learning course on Udemy

Or more basically we can demonstrate classical gradient descent as

w_i = w_i – α.dw_i

This is common because it works slowly but surely.

Rising of Adam

In 2015, Adam optimization algorithm is raised. The name of the algorithm refers to adaptive moment estimation. Actually, it is an extension of stochastic gradient descent. But, it tends to converge cost function to zero faster for such problems.

adam-eve — Adam and Eve by Lucas Cranach

In Adam, we will include vdw and sdw variables instead of raw partial derivatives of error with respect to the weight. You might think vdw is similar to momentum, and sdw is similar to RMSProp.

Constructing Adam

Firstly, we will assign initial values of vdw and sdw for every weight to 0.

vdw_i = 0; sdw_i = 0

Then, vdw and sdw values will be calculated for each weight.

vdw_i = β1 . vdw_i + (1 – β1) . w_i

sdw_i = β2 . sdw_i + (1 – β2) . (w_i)²

After then, calculated vdw and sdw will be modified by power of current epoch index value.

vdw_i_corrected = vdw_i / (1 – β1^epoch)

sdw_i_corrected = sdw_i / (1 – β2^epoch)

Finally, each weight will be updated as its previous value plus vdw and sdw corrected values.

w_i = w_i – α.(vdw_corrected / (√(sdw_corrected) + ε))

Herein, recommended values for Adam variables are demonstrated below. Additionaly, learning rate α should be tuned.

β1 = 0.9; β2 = 0.999; ε = 10^-8

As seen, the algorithm requires really little memory. That’s why, it is easy to adapt daily life problems.

Code wins arguments

I also reflect Adam logic into a code to monitor how it works.

Firstly, initialize Adam parameters.

vdw = [0.0 for i in range(num_of_layers)]
sdw = [0.0 for i in range(num_of_layers)]
epsilon = np.array([pow(10, -8)])
beta1 = 0.9; beta2 = 0.999

Then, applied backpropagation and find dw – partial derivative of total error with respect to each weight. Thereafter, update weights. The following code stores weights as matrix. You can find the entire code from my GitHub profile.

for epoch in range(epochs):
 for j in range(num_of_instances):
  for k in range(num_of_layers - 1):
   dw = findDelta() #partial derivative of total error with respect to the weight

   if optimization_algorithm == 'gradient-descent':
    w[j] = w[j] + learningRate * dw
   elif optimization_algorithm == 'adam':
    vdw[j] = beta1 * vdw[j] + (1 - beta1) * dw
    sdw[j] = beta2 * sdw[j] + (1 - beta2) * pow(dw, 2)
    vdw_corrected = vdw[j] / (1-pow(beta1, epoch+1))
    sdw_corrected = sdw[j] / (1-pow(beta2,epoch+1))
    w[j] = w[j] + learningRate * (vdw_corrected / (np.sqrt(sdw_corrected) + epsilon))

Performance

When I applied both gradient descent and Adam optimization algorithm for XOR problem for same configuration (same learning rate, and same initial weights), Adam tends to converge error to zero much faster.

adam-over-xor-v2 — Adam vs Classical Gradient Descent Over XOR Problem

Today, Adam is much more meaningful for very complex neural networks and deep learning models with really big data.

Covering with a metaphor

Would you image that what if optimization algorithms were car brands? Gradient descent would be Volvo. It moves with slowly but surely steps. Also, it gives confidence to consumers. On the other hand, Adam optimization algorithm would be Tesla. Because, no other algorithm has insane mode for today! Also, Adam usually outperforms the rest of optimization algorithms.

Support this blog if you do like!

6 Comments

Metag says:

November 7, 2019 at 9:00 pm

Thank you for the quick response 🙂

Nevertheless in my case the change of the sign does not make any difference and they behave in the same way with the same errorrates

Log in to Reply