The Math Behind Neural Networks Learning with Backpropagation

Neural networks are one of the most powerful machine learning algorithm. However, its background might confuse brains because of complex mathematical calculations. In this post, math behind the neural network learning algorithm and state of the art are mentioned.


🙋‍♂️ You may consider to enroll my top-rated machine learning course on Udemy

Decision Trees for Machine Learning

j-cvmldm

Perceptron

Backpropagation is a way to train multilayer perceptrons (or its widely known name neural networks). Legacy forms of neural networks are regular perceptrons. To understand backpropagation better, see what perceptrons are?

Perceptrons can learn linear problems such as AND or OR Gates. However, they fail against non-linear problems such as XOR Gate. That’s why, we need multilayer perceptrons and its training method backpropagation.

Multilayer Perceptron

Backpropagation is very common algorithm to implement neural network learning. The algorithm is basically includes following steps for all historical instances. Firstly, feeding forward propagation is applied (left-to-right) to compute network output. That’s the forecast value whereas actual value is already known. Secondly, difference of the forecast and actual value is calculated and it is called as error. Thirdly, error is reflected to the all the weighs and weights are updated based on calculated error. Finally, these procedures are applied until custom epoch count (e.g. epoch=1000).

We’ll work on 4 layered network as illustrated below. 2 nodes exist in both input and hidden layers whereas output layer consists of 1 node. In addition, bias units (+1) appear on input and hidden layers.

nn-model.png

Weight initialization

Backpropagation algorithm aims to find optimum weight values to calculate output with minimum error. It requires to apply forward propagation first. Herein, weights must be initialized randomly. In this way, output could be calculated. Calculated output is compared with actual output, and the difference would be reflected to weights as error.

Forward Propagation

netinputh1 = 1.w1 + i1w3 + i2w5





netoutputh1 = sigmoid(netinputh1) = 1 / ( 1 + e-netinputh1 )

Sigmoid is one of the most common activation function but it is not must. You may check other alternatives. Suppose that sigmoid is the activation function in this post.

netinputh2 = 1.w2 + i1w4 + i2w6

netoutputh2 = sigmoid(netinputh2)

netinputh4 = 1.w7 +netoutputh1.w9 + netoutputh2.w11

netoutputh4 = sigmoid(netinputh4)

netinputh5 = 1.w8 +netoutputh1.w10 + netoutputh2.w12

netoutputh5 = sigmoid(netinputh5)

netinputy = 1.w13 +netoutputh4.w14 + netoutputh5.w15

netoutputy = sigmoid(netinputy)





Error calculation

Error of network would be calculated by the following formula.

Error = (actualy – netoutputy)2 / 2

The relation with derivative

Remember the definition of derivative. What happens to output when you change input as small as possible. Herein, we initialized all weights randomly and that causes an error. We need to find the optimum weights. What if we ask the following question. What happens to error when we change i-indexed weight as small as possible?

definition_of_derivative
Derivative definition

This is the derivative of the error with respect to the wi or basically ∂Error / ∂wi.

Tip: Derivative of the error function is demonstrated as a simple form. Because that function is picked up as error function. Backpropagation aims to calculate minimum error value.

Backpropagation

Weights indicate the success of the network. The backpropagation algorithm looks for the optimum weights based on previous experiences. That’s why, calculated error in previous step is reflected to all weights. Thus, weights could be updated based on previous errors.

For instance, how much the calculated error should be reflected to the w15? That’s the question. In other words, how to calculate ∂Error / ∂w15? Chain rule helps us to calculate this equation.

w15
Chain Rule Based Error Reflection to w15

Error / ∂w15 = (∂Error / ∂netoutputy) . (∂netoutputy / ∂netinputy) . (∂netinputy / ∂w15)

Let’s calculate the three multiplier of the equation.

• Error = (actualy – netoutputy)2 / 2





Error / ∂netoutputy = [2.(actualy – netoutputy)2-1 /2].(-1)

Error / ∂netoutputy = netoutput– actualy

• netoutputy = 1 / (1+e-netinputy)

Derivative of sigmoid function is easy to demonstrate as mentioned on a previous post.

netoutputy / ∂netinputy = netoutputy . (1 – netoutputy)

If activation function you picked up is different than sigmoid, you should learn how to derive.

• netinputy = w13 +netoutputh4.w14 + netoutputh5.w15

netinputy / ∂w15 = netoutputh5

• ∂Error / ∂w15 = (netoutput– actualy) . netoutputy . (1 – netoutputy) . netoutputh5

We could illustrate the error of output node as δy  instead of (netoutput– actualy). netoutputy . (1 – netoutputy). Then, the equation would be transformed as the following formula.





• ∂Error / ∂w15 = δy . netoutputh5

The other weights (w13, w14) connected from 2nd hidden layer to output layer would be calculated in same principle.

• ∂Error / ∂w14 = δy . netoutputh4

• ∂Error / ∂w13 = δy . 1

Let’s move a layer left, make calculations for w7 to w12 (weights between 1st hidden layer and 2nd hidden layer).

w12
Reflecting the Error to w12

Error / ∂w12 = (∂Error / ∂netoutputy) . (∂netoutputy / ∂netinputy) . (∂netinputy / ∂netoutputh5) . (∂netoutputh5 / ∂netinputh5) . (∂netinputh5 / ∂w12)

• ∂Error / ∂netoutputy = netoutput– actual(already calculated)

• ∂netoutputy / ∂netinputy = netoutputy . (1 – netoutputy) (already calculated)

• netinputy = w13 +netoutputh4.w14 + netoutputh5.w15

netinputy / ∂netoutputh5 = w15





• netoutputh5 = 1 / (1+e-netinputh5)

netoutputh5 / ∂netinputh5 = netoutputh5 . (1- netoutputh5)

• netinputh5 = w8 +netoutputh1.w10 + netoutputh2.w12

netinputh5 / ∂w12 = netoutputh2

To sum up, reflection of total error to w12 is illustrated below:

Error / ∂w12 = (netoutput– actualy) . ( netoutputy . (1 – netoutputy) ) . w15 . netoutputh5 . (1- netoutputh5) . netoutputh2

Error / ∂w12 = δy . w15 . netoutputh5 . (1- netoutputh5) . netoutputh2

We should use the term δh5 instead of δy . w15, then the equation will be transformed as:

Error / ∂w12 = δh5 . netoutputh5 . (1- netoutputh5) . netoutputh2

After then, weights w7 to w11 should be calculated similarly





Error / ∂w11 = δh4 . netoutputh4 . (1- netoutputh4) . netoutputh2

Error / ∂w10 = δh5 . netoutputh5 . (1- netoutputh5) . netoutputh1

Error / ∂w9 = δh4 . netoutputh4 . (1- netoutputh4) . netoutputh1

Error / ∂w8 = δh5 . netoutputh5 . (1- netoutputh5) . 1

Error / ∂w7 = δh4 . netoutputh4 . (1- netoutputh4) . 1

Finally, we could generalize formulas for weights connected from input layer to 1st hidden layer, now.

w6
Error Reflection to w6 (You should click the image to expand)

Error / ∂w6 = (∂Error / ∂netoutputy) . (∂netoutputy / ∂netinputy) . [(∂netinputy / ∂netoutputh5).(∂netoutputh5 / ∂netinputh5). (∂netinputh5 / ∂netoutputh2) + (∂netinputy / ∂netoutputh4).(∂netoutputh4 / ∂netinputh4). (∂netinputh4 / ∂netoutputh2] . (∂netoutputh2 / ∂netinputh2) . (∂netinputh2 / ∂w6)

• netinputh5 = w8 +netoutputh1.w10 + netoutputh2.w12

netinputh5 / ∂netoutputh2 = w12

• netinputh4 = w7 +netoutputh1.w9 + netoutputh2.w11





netinputh4 / ∂netoutputh2 = w11

• netinputh2 = w2 + i1w4 + i2w6

netinputh2 / ∂w6 = i2

Error / ∂w6 = δy . [w15. netoutputh5. (1-netoutputh5) . w12 + w14. netoutputh4. (1-netoutputh4) . w11] . netoutputh2. (1-netoutputh2) . i2

Error / ∂w6 =  [δy . w15. netoutputh5. (1-netoutputh5) . w12 + δy . w14. netoutputh4. (1-netoutputh4) . w11] . netoutputh2. (1-netoutputh2) . i2

Error / ∂w6 =  [δh5 . w12 + δh4 . w11] . netoutputh2. (1-netoutputh2) . i2

Error / ∂w6 =  δh2 . netoutputh2. (1-netoutputh2) . i2

Similarly, weights w1 to w5 could be calculated based on same principle.

Error / ∂w5 =  [δh5 . w10 + δh4 . w9] . netoutputh1. (1-netoutputh1) . i2

Error / ∂w5 =  δh1 . netoutputh1. (1-netoutputh1) . i2





Error / ∂w4 =  [δh5 . w12 + δh4 . w11] . netoutputh2. (1-netoutputh2) . i1

Error / ∂w4 =  δh2 . netoutputh2. (1-netoutputh2) . i1

Error / ∂w3 =  [δh5 . w10 + δh4 . w9] . netoutputh1. (1-netoutputh1) . i1

Error / ∂w3 =  δh1 . netoutputh1. (1-netoutputh1) . i1

Error / ∂w2 =  [δh5 . w12 + δh4 . w11] . netoutputh2. (1-netoutputh2) . 1

Error / ∂w2 =  δh2 . netoutputh2. (1-netoutputh2) . 1

Error / ∂w1 =  [δh5 . w10 + δh4 . w9] . netoutputh1. (1-netoutputh1) . 1

Error / ∂w1 =  δh1 . netoutputh1. (1-netoutputh1) . 1

As you might probably realize, derivative calculation could be transformed in generalized form as demonstrated below.

Error / ∂wi =  δtonode . netoutputtonode. (1-netoutputtonode) . netoutputfromnode





After all, we formulized the error reflections for all weights. Now, we can update weights by the stockastic gradient descent formula. The following formula is applied for all i values. In this equation, α refers to learning rate and should be low value (e.g. α=0.1)

wi = wi – α . (∂Error / ∂wi)

Caution! all derivative values (∂Error / ∂wi) must be calculated first, after then weight update formula must be applied. If ∂Error / ∂w15 is calculated and w15 updated simultaneously, gradient descent will fail. Doing the right thing must be calculating ∂Error / ∂w15, ∂Error / ∂w14, …, ∂Error / ∂w1 respectively, after then updating w15, w14, … , w1 respectively.

In this post, we focused on backpropagation algorithm to find optimum weight values. We also investigate how errors reflected to weights and how weights updated based on reflected errors. I’ve also shared fully implementation of the backpropagation algorithm on my GitHub profile for both Java and Python.


Like this blog? Support me on Patreon

Buy me a coffee


23 Comments

  1. Hi Sefik

    Thanks a lot for such a detailed analysis on Backpropagation. I had a query on this. When you say:

    The Delta Y in equation of hidden Layer 2
    ∂Error / ∂w12 = δy . w15 . netoutputh5 . (1- netoutputh5) . netoutputh2

    We should use the term δh5 instead of δy . w15

    When i look at input to 1st hidden layer where you say
    ∂Error / ∂w6 = [δy . w15. netoutputh5. (1-netoutputh5) . w12 + δy . w14. netoutputh4. (1-netoutputh4) . w11] . netoutputh2. (1-netoutputh2) . i2

    which becomes
    ∂Error / ∂w6 = [δh5 . w12 + δh4 . w11] . netoutputh2. (1-netoutputh2) . i2

    ∂Error / ∂w6 = δh2 . netoutputh2. (1-netoutputh2) . i2

    When i thought thru, i see that the δh5 should replace ” δy . w15 . netoutputh5 . (1- netoutputh5) ” during the calculation ∂Error / ∂w12 . Please let me know if my understanding is correct

Comments are closed.