wi = wi – α.(∂Error/∂wi)
🙋♂️ You may consider to enroll my top-rated machine learning course on Udemy

Or more basically we can demonstrate classical gradient descent as
wi = wi – α.dwi
This is common because it works slowly but surely.
Rising of Adam
In 2015, Adam optimization algorithm is raised. The name of the algorithm refers to adaptive moment estimation. Actually, it is an extension of stochastic gradient descent. But, it tends to converge cost function to zero faster for such problems.

In Adam, we will include vdw and sdw variables instead of raw partial derivatives of error with respect to the weight. You might think vdw is similar to momentum, and sdw is similar to RMSProp.
Constructing Adam
Firstly, we will assign initial values of vdw and sdw for every weight to 0.
vdwi = 0; sdwi = 0
Then, vdw and sdw values will be calculated for each weight.
vdwi = β1 . vdwi + (1 – β1) . wi
sdwi = β2 . sdwi + (1 – β2) . (wi)2
After then, calculated vdw and sdw will be modified by power of current epoch index value.
vdwi_corrected = vdwi / (1 – β1epoch)
sdwi_corrected = sdwi / (1 – β2epoch)
Finally, each weight will be updated as its previous value plus vdw and sdw corrected values.
wi = wi – α.(vdw_corrected / (√(sdw_corrected) + ε))
Herein, recommended values for Adam variables are demonstrated below. Additionaly, learning rate α should be tuned.
β1 = 0.9; β2 = 0.999; ε = 10-8
As seen, the algorithm requires really little memory. That’s why, it is easy to adapt daily life problems.
Code wins arguments
I also reflect Adam logic into a code to monitor how it works.
Firstly, initialize Adam parameters.
vdw = [0.0 for i in range(num_of_layers)]
sdw = [0.0 for i in range(num_of_layers)]
epsilon = np.array([pow(10, -8)])
beta1 = 0.9; beta2 = 0.999
Then, applied backpropagation and find dw – partial derivative of total error with respect to each weight. Thereafter, update weights. The following code stores weights as matrix. You can find the entire code from my GitHub profile.
for epoch in range(epochs): for j in range(num_of_instances): for k in range(num_of_layers - 1): dw = findDelta() #partial derivative of total error with respect to the weight if optimization_algorithm == 'gradient-descent': w[j] = w[j] + learningRate * dw elif optimization_algorithm == 'adam': vdw[j] = beta1 * vdw[j] + (1 - beta1) * dw sdw[j] = beta2 * sdw[j] + (1 - beta2) * pow(dw, 2) vdw_corrected = vdw[j] / (1-pow(beta1, epoch+1)) sdw_corrected = sdw[j] / (1-pow(beta2,epoch+1)) w[j] = w[j] + learningRate * (vdw_corrected / (np.sqrt(sdw_corrected) + epsilon))
Performance
When I applied both gradient descent and Adam optimization algorithm for XOR problem for same configuration (same learning rate, and same initial weights), Adam tends to converge error to zero much faster.

Today, Adam is much more meaningful for very complex neural networks and deep learning models with really big data.
Covering with a metaphor
Would you image that what if optimization algorithms were car brands? Gradient descent would be Volvo. It moves with slowly but surely steps. Also, it gives confidence to consumers. On the other hand, Adam optimization algorithm would be Tesla. Because, no other algorithm has insane mode for today! Also, Adam usually outperforms the rest of optimization algorithms.
Support this blog if you do like!
Thank you for the quick response 🙂
Nevertheless in my case the change of the sign does not make any difference and they behave in the same way with the same errorrates