The Math Behind Neural Networks Learning with Backpropagation

Neural networks are one of the most powerful machine learning algorithm. However, its background might confuse brains because of complex mathematical calculations. In this post, math behind the neural network learning algorithm and state of the art are mentioned.

j-cvmldm

Backpropagation is very common algorithm to implement neural network learning. The algorithm is basically includes following steps for all historical instances. Firstly, feeding forward propagation is applied (left-to-right) to compute network output. That’s the forecast value whereas actual value is already known. Secondly, difference of the forecast and actual value is calculated and it is called as error. Thirdly, error is reflected to the all the weighs and weights are updated based on calculated error. Finally, these procedures are applied until custom epoch count (e.g. epoch=1000).

We’ll work on 4 layered network as illustrated below. 2 nodes exist in both input and hidden layers whereas output layer consists of 1 node. In addition, bias units (+1) appear on input and hidden layers.

nn-model.png

Weight initialization

Backpropagation algorithm aims to find optimum weight values to calculate output with minimum error. It requires to apply forward propagation first. Herein, weights must be initialized randomly. In this way, output could be calculated. Calculated output is compared with actual output, and the difference would be reflected to weights as error.

Forward Propagation

netinputh1 = 1.w1 + i1w3 + i2w5

netoutputh1 = sigmoid(netinputh1) = 1 / ( 1 + e-netinputh1 )

netinputh2 = 1.w2 + i1w4 + i2w6

netoutputh2 = sigmoid(netinputh2)

netinputh4 = 1.w7 +netoutputh1.w9 + netoutputh2.w11

netoutputh4 = sigmoid(netinputh4)

netinputh5 = 1.w8 +netoutputh1.w10 + netoutputh2.w12

netoutputh5 = sigmoid(netinputh5)

netinputy = 1.w13 +netoutputh4.w14 + netoutputh5.w15

netoutputy = sigmoid(netinputy)

Error calculation

Error of network would be calculated by the following formula.

Error = (actualy – netoutputy)2 / 2

Tip: Derivative of the error function is demonstrated as a simple form. Because that function is picked up as error function. Backpropagation aims to calculate minimum error value.

Backpropagation

Weights indicate the success of the network. The backpropagation algorithm looks for the optimum weights based on previous experiences. That’s why, calculated error in previous step is reflected to all weights. Thus, weights could be updated based on previous errors.

For instance, how much the calculated error should be reflected to the w15? That’s the question. In other words, how to calculate ∂Error / ∂w15? Chain rule helps us to calculate this equation.

w15
Chain Rule Based Error Reflection to w15

Error / ∂w15 = (∂Error / ∂netoutputy) . (∂netoutputy / ∂netinputy) . (∂netinputy / ∂w15)

Let’s calculate the three multiplier of the equation.

• Error = (actualy – netoutputy)2 / 2

Error / ∂netoutputy = [2.(actualy – netoutputy)2-1 /2].(-1)

Error / ∂netoutputy = netoutput– actualy

• netoutputy = 1 / (1+e-netinputy)

Derivative of sigmoid function is easy to demonstrate as mentioned on previous post.

netoutputy / ∂netinputy = netoutputy . (1 – netoutputy)

• netinputy = w13 +netoutputh4.w14 + netoutputh5.w15

netinputy / ∂w15 = netoutputh5

• ∂Error / ∂w15 = (netoutput– actualy) . netoutputy . (1 – netoutputy) . netoutputh5

We could illustrate the error of output node as δy  instead of (netoutput– actualy). netoutputy . (1 – netoutputy). Then, the equation would be transformed as the following formula.

• ∂Error / ∂w15 = δy . netoutputh5

The other weights (w13, w14) connected from 2nd hidden layer to output layer would be calculated in same principle.

• ∂Error / ∂w14 = δy . netoutputh4

• ∂Error / ∂w13 = δy . 1

Let’s move a layer left, make calculations for w7 to w12 (weights between 1st hidden layer and 2nd hidden layer).

w12
Reflecting the Error to w12

Error / ∂w12 = (∂Error / ∂netoutputy) . (∂netoutputy / ∂netinputy) . (∂netinputy / ∂netoutputh5) . (∂netoutputh5 / ∂netinputh5) . (∂netinputh5 / ∂w12)

• ∂Error / ∂netoutputy = netoutput– actual(already calculated)

• ∂netoutputy / ∂netinputy = netoutputy . (1 – netoutputy) (already calculated)

• netinputy = w13 +netoutputh4.w14 + netoutputh5.w15

netinputy / ∂netoutputh5 = w15

• netoutputh5 = 1 / (1+e-netinputh5)

netoutputh5 / ∂netinputh5 = netoutputh5 . (1- netoutputh5)

• netinputh5 = w8 +netoutputh1.w10 + netoutputh2.w12

netinputh5 / ∂w12 = netoutputh2

To sum up, reflection of total error to w12 is illustrated below:

Error / ∂w12 = (netoutput– actualy) . ( netoutputy . (1 – netoutputy) ) . w15 . netoutputh5 . (1- netoutputh5) . netoutputh2

Error / ∂w12 = δy . w15 . netoutputh5 . (1- netoutputh5) . netoutputh2

We should use the term δh5 instead of δy . w15, then the equation will be transformed as:

Error / ∂w12 = δh5 . netoutputh5 . (1- netoutputh5) . netoutputh2

After then, weights w7 to w11 should be calculated similarly

Error / ∂w11 = δh4 . netoutputh4 . (1- netoutputh4) . netoutputh2

Error / ∂w10 = δh5 . netoutputh5 . (1- netoutputh5) . netoutputh1

Error / ∂w9 = δh4 . netoutputh4 . (1- netoutputh4) . netoutputh1

Error / ∂w8 = δh5 . netoutputh5 . (1- netoutputh5) . 1

Error / ∂w7 = δh4 . netoutputh4 . (1- netoutputh4) . 1

Finally, we could generalize formulas for weights connected from input layer to 1st hidden layer, now.

w6
Error Reflection to w6 (You should click the image to expand)

Error / ∂w6 = (∂Error / ∂netoutputy) . (∂netoutputy / ∂netinputy) . [(∂netinputy / ∂netoutputh5).(∂netoutputh5 / ∂netinputh5). (∂netinputh5 / ∂netoutputh2) + (∂netinputy / ∂netoutputh4).(∂netoutputh4 / ∂netinputh4). (∂netinputh4 / ∂netoutputh2] . (∂netoutputh2 / ∂netinputh2) . (∂netinputh2 / ∂w6)

• netinputh5 = w8 +netoutputh1.w10 + netoutputh2.w12

netinputh5 / ∂netoutputh2 = w12

• netinputh4 = w7 +netoutputh1.w9 + netoutputh2.w11

netinputh4 / ∂netoutputh2 = w11

• netinputh2 = w2 + i1w4 + i2w6

netinputh2 / ∂w6 = i2

Error / ∂w6 = δy . [w15. netoutputh5. (1-netoutputh5) . w12 + w14. netoutputh4. (1-netoutputh4) . w11] . netoutputh2. (1-netoutputh2) . i2

Error / ∂w6 =  [δy . w15. netoutputh5. (1-netoutputh5) . w12 + δy . w14. netoutputh4. (1-netoutputh4) . w11] . netoutputh2. (1-netoutputh2) . i2

Error / ∂w6 =  [δh5 . w12 + δh4 . w11] . netoutputh2. (1-netoutputh2) . i2

Error / ∂w6 =  δh2 . netoutputh2. (1-netoutputh2) . i2

Similarly, weights w1 to w5 could be calculated based on same principle.

Error / ∂w5 =  [δh5 . w10 + δh4 . w9] . netoutputh1. (1-netoutputh1) . i2

Error / ∂w5 =  δh1 . netoutputh1. (1-netoutputh1) . i2

Error / ∂w4 =  [δh5 . w12 + δh4 . w11] . netoutputh2. (1-netoutputh2) . i1

Error / ∂w4 =  δh2 . netoutputh2. (1-netoutputh2) . i1

Error / ∂w3 =  [δh5 . w10 + δh4 . w9] . netoutputh1. (1-netoutputh1) . i1

Error / ∂w3 =  δh1 . netoutputh1. (1-netoutputh1) . i1

Error / ∂w2 =  [δh5 . w12 + δh4 . w11] . netoutputh2. (1-netoutputh2) . 1

Error / ∂w2 =  δh2 . netoutputh2. (1-netoutputh2) . 1

Error / ∂w1 =  [δh5 . w10 + δh4 . w9] . netoutputh1. (1-netoutputh1) . 1

Error / ∂w1 =  δh1 . netoutputh1. (1-netoutputh1) . 1

As you might probably realize, derivative calculation could be transformed in generalized form as demonstrated below.

Error / ∂wi =  δtonode . netoutputtonode. (1-netoutputtonode) . netoutputfromnode

After all, we formulized the error reflections for all weights. Now, we can update weights by the stockastic gradient descent formula. The following formula is applied for all i values. In this equation, α refers to learning rate and should be low value (e.g. α=0.1)

wi = wi – α . (∂Error / ∂wi)

Caution! all derivative values (∂Error / ∂wi) must be calculated first, after then weight update formula must be applied. If ∂Error / ∂w15 is calculated and w15 updated simultaneously, gradient descent will fail. Doing the right thing must be calculating ∂Error / ∂w15, ∂Error / ∂w14, …, ∂Error / ∂w1 respectively, after then updating w15, w14, … , w1 respectively.

In this post, we focused on backpropagation algorithm to find optimum weight values. We also investigate how errors reflected to weights and how weights updated based on reflected errors. I’ve also shared fully implementation of the backpropagation algorithm on my GitHub profile.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s