The Math Behind Neural Networks Learning with Backpropagation

Neural networks are one of the most powerful machine learning algorithm. However, its background might confuse brains because of complex mathematical calculations. In this post, math behind the neural network learning algorithm and state of the art are mentioned.

🙋‍♂️ You may consider to enroll my top-rated machine learning course on Udemy

j-cvmldm

Perceptron

Backpropagation is a way to train multilayer perceptrons (or its widely known name neural networks). Legacy forms of neural networks are regular perceptrons. To understand backpropagation better, see what perceptrons are?

Perceptrons can learn linear problems such as AND or OR Gates. However, they fail against non-linear problems such as XOR Gate. That’s why, we need multilayer perceptrons and its training method backpropagation.

Multilayer Perceptron

Backpropagation is very common algorithm to implement neural network learning. The algorithm is basically includes following steps for all historical instances. Firstly, feeding forward propagation is applied (left-to-right) to compute network output. That’s the forecast value whereas actual value is already known. Secondly, difference of the forecast and actual value is calculated and it is called as error. Thirdly, error is reflected to the all the weighs and weights are updated based on calculated error. Finally, these procedures are applied until custom epoch count (e.g. epoch=1000).

We’ll work on 4 layered network as illustrated below. 2 nodes exist in both input and hidden layers whereas output layer consists of 1 node. In addition, bias units (+1) appear on input and hidden layers.

Weight initialization

Backpropagation algorithm aims to find optimum weight values to calculate output with minimum error. It requires to apply forward propagation first. Herein, weights must be initialized randomly. In this way, output could be calculated. Calculated output is compared with actual output, and the difference would be reflected to weights as error.

Forward Propagation

netinput_h1 = 1.w₁ + i₁w₃ + i₂w₅

netoutput_h1 = sigmoid(netinput_h1) = 1 / ( 1 + e^-netinputh1)

Sigmoid is one of the most common activation function but it is not must. You may check other alternatives. Suppose that sigmoid is the activation function in this post.

netinput_h2 = 1.w₂ + i₁w₄ + i₂w₆

netoutput_h2 = sigmoid(netinput_h2)

netinput_h4 = 1.w₇ +netoutput_h1.w₉ + netoutput_h2.w₁₁

netoutput_h4 = sigmoid(netinput_h4)

netinput_h5 = 1.w₈ +netoutput_h1.w₁₀ + netoutput_h2.w₁₂

netoutput_h5 = sigmoid(netinput_h5)

netinput_y = 1.w₁₃ +netoutput_h4.w₁₄ + netoutput_h5.w₁₅

netoutput_y = sigmoid(netinput_y)

Error calculation

Error of network would be calculated by the following formula.

Error = (actual_y – netoutput_y)² / 2

The relation with derivative

Remember the definition of derivative. What happens to output when you change input as small as possible. Herein, we initialized all weights randomly and that causes an error. We need to find the optimum weights. What if we ask the following question. What happens to error when we change i-indexed weight as small as possible?

This is the derivative of the error with respect to the w_i or basically ∂_Error / ∂_wi.

Tip: Derivative of the error function is demonstrated as a simple form. Because that function is picked up as error function. Backpropagation aims to calculate minimum error value.

Backpropagation

Weights indicate the success of the network. The backpropagation algorithm looks for the optimum weights based on previous experiences. That’s why, calculated error in previous step is reflected to all weights. Thus, weights could be updated based on previous errors.

For instance, how much the calculated error should be reflected to the w₁₅? That’s the question. In other words, how to calculate ∂_Error / ∂_w15? Chain rule helps us to calculate this equation.

Chain Rule Based Error Reflection to w15

∂_Error / ∂_w15 = (∂_Error / ∂_netoutputy) . (∂_netoutputy/ ∂_netinputy) . (∂_netinputy/ ∂_w15)

Let’s calculate the three multiplier of the equation.

• Error = (actual_y – netoutput_y)² / 2

∂_Error / ∂_netoutputy= [2.(actual_y – netoutput_y)^2-1 /2].(-1)

∂_Error / ∂_netoutputy= netoutput_y– actual_y

• netoutput_y = 1 / (1+e^-netinputy)

Derivative of sigmoid function is easy to demonstrate as mentioned on a previous post.

∂_netoutputy/ ∂_netinputy = netoutput_y . (1 – netoutput_y)

If activation function you picked up is different than sigmoid, you should learn how to derive.

• netinput_y = w₁₃ +netoutput_h4.w₁₄ + netoutput_h5.w₁₅

∂_netinputy/ ∂_w15 = netoutput_h5

• ∂_Error / ∂_w15 = (netoutput_y– actual_y) . netoutput_y . (1 – netoutput_y) . netoutput_h5

We could illustrate the error of output node as δ_y instead of (netoutput_y– actual_y). netoutput_y . (1 – netoutput_y). Then, the equation would be transformed as the following formula.

• ∂_Error / ∂_w15 = δ_y . netoutput_h5

The other weights (w13, w14) connected from 2nd hidden layer to output layer would be calculated in same principle.

• ∂_Error / ∂_w14 = δ_y . netoutput_h4

• ∂_Error / ∂_w13 = δ_y . 1

Let’s move a layer left, make calculations for w7 to w12 (weights between 1st hidden layer and 2nd hidden layer).

∂_Error / ∂_w12 = (∂_Error / ∂_netoutputy) . (∂_netoutputy/ ∂_netinputy) . (∂_netinputy/ ∂_netoutputh5) . (∂_netoutputh5/ ∂_netinputh5) . (∂_netinputh5/ ∂_w12)

• ∂_Error / ∂_netoutputy = netoutput_y– actual_y(already calculated)

• ∂_netoutputy/ ∂_netinputy = netoutput_y . (1 – netoutput_y) (already calculated)

• netinput_y = w₁₃ +netoutput_h4.w₁₄ + netoutput_h5.w₁₅

∂_netinputy/ ∂_netoutputh5 = w₁₅

• netoutput_h5 = 1 / (1+e^-netinputh5)

∂_netoutputh5/ ∂_netinputh5 = netoutput_h5 . (1- netoutput_h5)

• netinput_h5 = w₈ +netoutput_h1.w₁₀ + netoutput_h2.w₁₂

∂_netinputh5/ ∂_w12 = netoutput_h2

To sum up, reflection of total error to w₁₂is illustrated below:

∂_Error / ∂_w12 = (netoutput_y– actual_y) . ( netoutput_y . (1 – netoutput_y) ) . w₁₅ . netoutput_h5 . (1- netoutput_h5) . netoutput_h2

∂_Error / ∂_w12 = δ_y . w₁₅ . netoutput_h5 . (1- netoutput_h5) . netoutput_h2

We should use the term δ_h5instead of δ_y . w₁₅, then the equation will be transformed as:

∂_Error / ∂_w12 = δ_h5 . netoutput_h5 . (1- netoutput_h5) . netoutput_h2

After then, weights w7 to w11 should be calculated similarly

∂_Error / ∂_w11 = δ_h4 . netoutput_h4 . (1- netoutput_h4) . netoutput_h2

∂_Error / ∂_w10 = δ_h5 . netoutput_h5 . (1- netoutput_h5) . netoutput_h1

∂_Error / ∂_w9 = δ_h4 . netoutput_h4 . (1- netoutput_h4) . netoutput_h1

∂_Error / ∂_w8 = δ_h5 . netoutput_h5 . (1- netoutput_h5) . 1

∂_Error / ∂_w7 = δ_h4 . netoutput_h4 . (1- netoutput_h4) . 1

Finally, we could generalize formulas for weights connected from input layer to 1st hidden layer, now.

Error Reflection to w6 (You should click the image to expand)

∂_Error / ∂_w6 = (∂_Error / ∂_netoutputy) . (∂_netoutputy/ ∂_netinputy) . [(∂_netinputy/ ∂_netoutputh5).(∂_netoutputh5/ ∂_netinputh5). (∂_netinputh5/ ∂_netoutputh2) + (∂_netinputy/ ∂_netoutputh4).(∂_netoutputh4/ ∂_netinputh4). (∂_netinputh4/ ∂_netoutputh2] . (∂_netoutputh2/ ∂_netinputh2) . (∂_netinputh2/ ∂_w6)

• netinput_h5 = w₈ +netoutput_h1.w₁₀ + netoutput_h2.w₁₂

∂_netinputh5/ ∂_{netoutputh2 = w₁₂}

• netinput_h4 = w₇ +netoutput_h1.w₉ + netoutput_h2.w₁₁

∂_netinputh4/ ∂_netoutputh2 = w₁₁

• netinput_h2 = w₂ + i₁w₄ + i₂w₆

∂_netinputh2/ ∂_w6 = i₂

∂_Error / ∂_w6 = δ_y . [w₁₅. netoutput_h5. (1-netoutput_h5) . w₁₂+ w₁₄. netoutput_h4. (1-netoutput_h4) . w₁₁] . netoutput_h2. (1-netoutput_h2) . i₂

∂_Error / ∂_w6 = [δ_y . w₁₅. netoutput_h5. (1-netoutput_h5) . w₁₂+ δ_y . w₁₄. netoutput_h4. (1-netoutput_h4) . w₁₁] . netoutput_h2. (1-netoutput_h2) . i₂

∂_Error / ∂_w6 = [δ_h5 . w₁₂+ δ_h4 . w₁₁] . netoutput_h2. (1-netoutput_h2) . i₂

∂_Error / ∂_w6 = δ_h2 . netoutput_h2. (1-netoutput_h2) . i₂

Similarly, weights w1 to w5 could be calculated based on same principle.

∂_Error / ∂_w5 = [δ_h5 . w₁₀+ δ_h4 . w₉] . netoutput_h1. (1-netoutput_h1) . i₂

∂_Error / ∂_w5 = δ_h1. netoutput_h1. (1-netoutput_h1) . i₂

∂_Error / ∂_w4 = [δ_h5 . w₁₂+ δ_h4 . w₁₁] . netoutput_h2. (1-netoutput_h2) . i₁

∂_Error / ∂_w4 = δ_h2 . netoutput_h2. (1-netoutput_h2) . i₁

∂_Error / ∂_w3 = [δ_h5 . w₁₀+ δ_h4 . w₉] . netoutput_h1. (1-netoutput_h1) . i₁

∂_Error / ∂_w3 = δ_h1. netoutput_h1. (1-netoutput_h1) . i₁

∂_Error / ∂_w2= [δ_h5 . w₁₂+ δ_h4 . w₁₁] . netoutput_h2. (1-netoutput_h2) . 1

∂_Error / ∂_w2= δ_h2. netoutput_h2. (1-netoutput_h2) . 1

∂_Error / ∂_w1 = [δ_h5 . w₁₀+ δ_h4 . w₉] . netoutput_h1. (1-netoutput_h1) . 1

∂_Error / ∂_w1 = δ_h1. netoutput_h1. (1-netoutput_h1) . 1

As you might probably realize, derivative calculation could be transformed in generalized form as demonstrated below.

∂_Error / ∂_wi = δ_tonode. netoutput_tonode. (1-netoutput_tonode) . netoutput_fromnode

After all, we formulized the error reflections for all weights. Now, we can update weights by the stockastic gradient descent formula. The following formula is applied for all i values. In this equation, α refers to learning rate and should be low value (e.g. α=0.1)

w_i = w_i – α . (∂_Error / ∂_wi)

Caution! all derivative values (∂_Error / ∂_wi) must be calculated first, after then weight update formula must be applied. If ∂_Error / ∂_w15 is calculated and w₁₅ updated simultaneously, gradient descent will fail. Doing the right thing must be calculating ∂_Error / ∂_w15, ∂_Error / ∂_{w14, …, ∂_Error / ∂_w1}respectively, after then updating w₁₅, w₁₄, … , w₁ respectively.

In this post, we focused on backpropagation algorithm to find optimum weight values. We also investigate how errors reflected to weights and how weights updated based on reflected errors. I’ve also shared fully implementation of the backpropagation algorithm on my GitHub profile for both Java and Python.

Like this blog? Support me on Patreon