Neural networks are one of the most powerful machine learning algorithm. However, its background might confuse brains because of complex mathematical calculations. In this post, math behind the neural network learning algorithm and state of the art are mentioned.
🙋♂️ You may consider to enroll my top-rated machine learning course on Udemy
Perceptron
Backpropagation is a way to train multilayer perceptrons (or its widely known name neural networks). Legacy forms of neural networks are regular perceptrons. To understand backpropagation better, see what perceptrons are?
Perceptrons can learn linear problems such as AND or OR Gates. However, they fail against non-linear problems such as XOR Gate. That’s why, we need multilayer perceptrons and its training method backpropagation.
Multilayer Perceptron
Backpropagation is very common algorithm to implement neural network learning. The algorithm is basically includes following steps for all historical instances. Firstly, feeding forward propagation is applied (left-to-right) to compute network output. That’s the forecast value whereas actual value is already known. Secondly, difference of the forecast and actual value is calculated and it is called as error. Thirdly, error is reflected to the all the weighs and weights are updated based on calculated error. Finally, these procedures are applied until custom epoch count (e.g. epoch=1000).
We’ll work on 4 layered network as illustrated below. 2 nodes exist in both input and hidden layers whereas output layer consists of 1 node. In addition, bias units (+1) appear on input and hidden layers.
Weight initialization
Backpropagation algorithm aims to find optimum weight values to calculate output with minimum error. It requires to apply forward propagation first. Herein, weights must be initialized randomly. In this way, output could be calculated. Calculated output is compared with actual output, and the difference would be reflected to weights as error.
Forward Propagation
netinputh1 = 1.w1 + i1w3 + i2w5
netoutputh1 = sigmoid(netinputh1) = 1 / ( 1 + e-netinputh1 )
Sigmoid is one of the most common activation function but it is not must. You may check other alternatives. Suppose that sigmoid is the activation function in this post.
netinputh2 = 1.w2 + i1w4 + i2w6
netoutputh2 = sigmoid(netinputh2)
netinputh4 = 1.w7 +netoutputh1.w9 + netoutputh2.w11
netoutputh4 = sigmoid(netinputh4)
netinputh5 = 1.w8 +netoutputh1.w10 + netoutputh2.w12
netoutputh5 = sigmoid(netinputh5)
netinputy = 1.w13 +netoutputh4.w14 + netoutputh5.w15
netoutputy = sigmoid(netinputy)
Error calculation
Error of network would be calculated by the following formula.
Error = (actualy – netoutputy)2 / 2
The relation with derivative
Remember the definition of derivative. What happens to output when you change input as small as possible. Herein, we initialized all weights randomly and that causes an error. We need to find the optimum weights. What if we ask the following question. What happens to error when we change i-indexed weight as small as possible?
This is the derivative of the error with respect to the wi or basically ∂Error / ∂wi.
Tip: Derivative of the error function is demonstrated as a simple form. Because that function is picked up as error function. Backpropagation aims to calculate minimum error value.
Backpropagation
Weights indicate the success of the network. The backpropagation algorithm looks for the optimum weights based on previous experiences. That’s why, calculated error in previous step is reflected to all weights. Thus, weights could be updated based on previous errors.
For instance, how much the calculated error should be reflected to the w15? That’s the question. In other words, how to calculate ∂Error / ∂w15? Chain rule helps us to calculate this equation.
∂Error / ∂w15 = (∂Error / ∂netoutputy) . (∂netoutputy / ∂netinputy) . (∂netinputy / ∂w15)
Let’s calculate the three multiplier of the equation.
• Error = (actualy – netoutputy)2 / 2
∂Error / ∂netoutputy = [2.(actualy – netoutputy)2-1 /2].(-1)
∂Error / ∂netoutputy = netoutputy – actualy
• netoutputy = 1 / (1+e-netinputy)
Derivative of sigmoid function is easy to demonstrate as mentioned on a previous post.
∂netoutputy / ∂netinputy = netoutputy . (1 – netoutputy)
If activation function you picked up is different than sigmoid, you should learn how to derive.
• netinputy = w13 +netoutputh4.w14 + netoutputh5.w15
∂netinputy / ∂w15 = netoutputh5
• ∂Error / ∂w15 = (netoutputy – actualy) . netoutputy . (1 – netoutputy) . netoutputh5
We could illustrate the error of output node as δy instead of (netoutputy – actualy). netoutputy . (1 – netoutputy). Then, the equation would be transformed as the following formula.
• ∂Error / ∂w15 = δy . netoutputh5
The other weights (w13, w14) connected from 2nd hidden layer to output layer would be calculated in same principle.
• ∂Error / ∂w14 = δy . netoutputh4
• ∂Error / ∂w13 = δy . 1
Let’s move a layer left, make calculations for w7 to w12 (weights between 1st hidden layer and 2nd hidden layer).
∂Error / ∂w12 = (∂Error / ∂netoutputy) . (∂netoutputy / ∂netinputy) . (∂netinputy / ∂netoutputh5) . (∂netoutputh5 / ∂netinputh5) . (∂netinputh5 / ∂w12)
• ∂Error / ∂netoutputy = netoutputy – actualy (already calculated)
• ∂netoutputy / ∂netinputy = netoutputy . (1 – netoutputy) (already calculated)
• netinputy = w13 +netoutputh4.w14 + netoutputh5.w15
∂netinputy / ∂netoutputh5 = w15
• netoutputh5 = 1 / (1+e-netinputh5)
∂netoutputh5 / ∂netinputh5 = netoutputh5 . (1- netoutputh5)
• netinputh5 = w8 +netoutputh1.w10 + netoutputh2.w12
∂netinputh5 / ∂w12 = netoutputh2
To sum up, reflection of total error to w12 is illustrated below:
∂Error / ∂w12 = (netoutputy – actualy) . ( netoutputy . (1 – netoutputy) ) . w15 . netoutputh5 . (1- netoutputh5) . netoutputh2
∂Error / ∂w12 = δy . w15 . netoutputh5 . (1- netoutputh5) . netoutputh2
We should use the term δh5 instead of δy . w15, then the equation will be transformed as:
∂Error / ∂w12 = δh5 . netoutputh5 . (1- netoutputh5) . netoutputh2
After then, weights w7 to w11 should be calculated similarly
∂Error / ∂w11 = δh4 . netoutputh4 . (1- netoutputh4) . netoutputh2
∂Error / ∂w10 = δh5 . netoutputh5 . (1- netoutputh5) . netoutputh1
∂Error / ∂w9 = δh4 . netoutputh4 . (1- netoutputh4) . netoutputh1
∂Error / ∂w8 = δh5 . netoutputh5 . (1- netoutputh5) . 1
∂Error / ∂w7 = δh4 . netoutputh4 . (1- netoutputh4) . 1
Finally, we could generalize formulas for weights connected from input layer to 1st hidden layer, now.
∂Error / ∂w6 = (∂Error / ∂netoutputy) . (∂netoutputy / ∂netinputy) . [(∂netinputy / ∂netoutputh5).(∂netoutputh5 / ∂netinputh5). (∂netinputh5 / ∂netoutputh2) + (∂netinputy / ∂netoutputh4).(∂netoutputh4 / ∂netinputh4). (∂netinputh4 / ∂netoutputh2] . (∂netoutputh2 / ∂netinputh2) . (∂netinputh2 / ∂w6)
• netinputh5 = w8 +netoutputh1.w10 + netoutputh2.w12
∂netinputh5 / ∂netoutputh2 = w12
• netinputh4 = w7 +netoutputh1.w9 + netoutputh2.w11
∂netinputh4 / ∂netoutputh2 = w11
• netinputh2 = w2 + i1w4 + i2w6
∂netinputh2 / ∂w6 = i2
∂Error / ∂w6 = δy . [w15. netoutputh5. (1-netoutputh5) . w12 + w14. netoutputh4. (1-netoutputh4) . w11] . netoutputh2. (1-netoutputh2) . i2
∂Error / ∂w6 = [δy . w15. netoutputh5. (1-netoutputh5) . w12 + δy . w14. netoutputh4. (1-netoutputh4) . w11] . netoutputh2. (1-netoutputh2) . i2
∂Error / ∂w6 = [δh5 . w12 + δh4 . w11] . netoutputh2. (1-netoutputh2) . i2
∂Error / ∂w6 = δh2 . netoutputh2. (1-netoutputh2) . i2
Similarly, weights w1 to w5 could be calculated based on same principle.
∂Error / ∂w5 = [δh5 . w10 + δh4 . w9] . netoutputh1. (1-netoutputh1) . i2
∂Error / ∂w5 = δh1 . netoutputh1. (1-netoutputh1) . i2
∂Error / ∂w4 = [δh5 . w12 + δh4 . w11] . netoutputh2. (1-netoutputh2) . i1
∂Error / ∂w4 = δh2 . netoutputh2. (1-netoutputh2) . i1
∂Error / ∂w3 = [δh5 . w10 + δh4 . w9] . netoutputh1. (1-netoutputh1) . i1
∂Error / ∂w3 = δh1 . netoutputh1. (1-netoutputh1) . i1
∂Error / ∂w2 = [δh5 . w12 + δh4 . w11] . netoutputh2. (1-netoutputh2) . 1
∂Error / ∂w2 = δh2 . netoutputh2. (1-netoutputh2) . 1
∂Error / ∂w1 = [δh5 . w10 + δh4 . w9] . netoutputh1. (1-netoutputh1) . 1
∂Error / ∂w1 = δh1 . netoutputh1. (1-netoutputh1) . 1
As you might probably realize, derivative calculation could be transformed in generalized form as demonstrated below.
∂Error / ∂wi = δtonode . netoutputtonode. (1-netoutputtonode) . netoutputfromnode
After all, we formulized the error reflections for all weights. Now, we can update weights by the stockastic gradient descent formula. The following formula is applied for all i values. In this equation, α refers to learning rate and should be low value (e.g. α=0.1)
wi = wi – α . (∂Error / ∂wi)
Caution! all derivative values (∂Error / ∂wi) must be calculated first, after then weight update formula must be applied. If ∂Error / ∂w15 is calculated and w15 updated simultaneously, gradient descent will fail. Doing the right thing must be calculating ∂Error / ∂w15, ∂Error / ∂w14, …, ∂Error / ∂w1 respectively, after then updating w15, w14, … , w1 respectively.
In this post, we focused on backpropagation algorithm to find optimum weight values. We also investigate how errors reflected to weights and how weights updated based on reflected errors. I’ve also shared fully implementation of the backpropagation algorithm on my GitHub profile for both Java and Python.
Support this blog if you do like!
Thank you so much for you help me about thing i don’t understand for a long time.
Hi Sefik
Thanks a lot for such a detailed analysis on Backpropagation. I had a query on this. When you say:
The Delta Y in equation of hidden Layer 2
∂Error / ∂w12 = δy . w15 . netoutputh5 . (1- netoutputh5) . netoutputh2
We should use the term δh5 instead of δy . w15
When i look at input to 1st hidden layer where you say
∂Error / ∂w6 = [δy . w15. netoutputh5. (1-netoutputh5) . w12 + δy . w14. netoutputh4. (1-netoutputh4) . w11] . netoutputh2. (1-netoutputh2) . i2
which becomes
∂Error / ∂w6 = [δh5 . w12 + δh4 . w11] . netoutputh2. (1-netoutputh2) . i2
∂Error / ∂w6 = δh2 . netoutputh2. (1-netoutputh2) . i2
When i thought thru, i see that the δh5 should replace ” δy . w15 . netoutputh5 . (1- netoutputh5) ” during the calculation ∂Error / ∂w12 . Please let me know if my understanding is correct