A Gentle Introduction to Cross-Entropy Loss Function

Neural networks produce multiple outputs in multi-class classification problems. However, they do not have ability to produce exact outputs, they can only produce continuous results. We would apply some additional steps to transform continuous results to exact classification results.

Applying softmax function normalizes outputs in scale of [0, 1]. Also, sum of outputs will always be equal to 1 when softmax is applied. After then, applying one hot encoding transforms outputs in binary form. That’s why, softmax and one hot encoding would be applied respectively to neural networks output layer. Finally, true labeled output would be predicted classification output. Herein, cross entropy function correlate between probabilities and one hot encoded labels.

🙋‍♂️ You may consider to enroll my top-rated machine learning course on Udemy

one-hot-encoding — Applying one hot encoding to probabilities

Cross Entropy Error Function

We need to know the derivative of loss function to back-propagate. If loss function were MSE, then its derivative would be easy (expected and predicted output). Things become more complex when error function is cross entropy.

E = – ∑ c_i . log(p_i) + (1 – c_i ). log(1 – p_i)

c refers to one hot encoded classes (or labels) whereas p refers to softmax applied probabilities. Base of log is e in the equation above.

PS: some sources might define the function as E = – ∑ c_i . log(p_i).

Derivative

Notice that we would apply softmax to calculated neural networks scores and probabilities first. Cross entropy is applied to softmax applied probabilities and one hot encoded classes calculated second. That’s why, we need to calculate the derivative of total error with respect to the each score.

∂E/∂score_i

chain-rule-for-cross-entrophy-v1 — Backwardly error calculation

We can apply chain rule to calculate the derivative.

chain-rule-for-cross-entrophy-v2 — Chain rule

∂E / ∂score_i = (∂E/∂p_i) . (∂p_i/score_i)

Let’s calculate these derivatives seperately.

∂E/∂p_i = ∂(- ∑[ c_i . log(p_i)+ (1 – c_i) . log(1 – p_i)])/∂p_i

Expanding the sum term

– ∑ =- ( c₁ . log(p₁) + (1 – c₁) . log(1 – p₁) )- (c₂ . log(p₂) + (1 – c₂) . log(1 – p₂))- …– (c_i . log(p_i) + (1 – c_i) . log(1 – p_i))– …- (c_n . log(p_n) + (1 – c_n). log(1 – p_n))

Now, we can derive the expanded term easily. Only bold mentioned part of the equation has a derivative with respect to the p_i.

∂E/∂p_i = ∂(-c_i . log(p_i) + (1 – c_i) . log(1 – p_i))/∂p_i = – ∂(c_i . log(p_i))/∂p_i– ∂((1 – c_i) . log(1 – p_i))/∂p_i

Notice that derivative of ln(x) is equal to 1/x.

– ∂(c_i . log(p_i))/∂p_i– ∂((1 – c_i) . log(1 – p_i))/∂p_i = -c_i/p_i – [(1 – c_i)/ (1 – p_i)] . ∂(1 – p_i)/∂p_i = -c_i/p_i – [(1 – c_i)/ (1 – p_i)] . (-1) = -c_i/ p_i + (1 – c_i)/ (1 – p_i)

∂E/∂p_i = – c_i / p_i + (1 – c_i)/ (1 – p_i)

Now, it is time to calculate the ∂p_i/score_i. However, we’ve already calculated the derivative of softmax function in a previous post.

∂p_i / ∂score_i = p_i.(1 – p_i)

Now, we can combine these equations.

∂E/∂score_i = (∂E/∂p_i).(∂p_i/score_i)

∂E/∂p_i = – c_i / p_i + (1 – c_i)/ (1 – p_i)

∂p_i / ∂score_i = p_i.(1 – p_i)

∂E/∂score_i = [- c_i / p_i + (1 – c_i)/ (1 – p_i)] . p_i. (1 – p_i)

∂E/∂score_i = (- c_i / p_i) . p_i. (1 – p_i) + [(1 – c_i) . p_i. (1 – p_i)]/ (1 – p_i)

∂E/∂score_i = – c_i + c_i . p_i + p_i – c_i . p_i = – c_i + p_i = p_i – c_i

∂E/∂score_i = p_i – c_i

As seen, derivative of cross entropy error function is pretty.

Let’s dance

These are the dance moves of the most common activation functions in deep learning. Ensure to turn the volume up 🙂

Like this blog? Support me on Patreon

5 Comments

Pingback: Facial Expression Recognition with Keras – Sefik Ilkin Serengil
generaluseage9091@gmail.com says:

February 6, 2018 at 10:35 am

Very complex material taught well. Thank you!
Pingback: Only Numpy: Implementing Mini VGG (VGG 7) and SoftMax Layer with Interactive Code | Copy Paste Programmers
laxmi says:

August 21, 2019 at 11:29 pm

very gentle introduction !!
Aryan says:

January 14, 2020 at 3:27 pm

Hello, very great material. I’m wondering after expanding the partial derivative of error function w.r.t. to the probability, why you did not apply the negative sign to the second statement?
-ci log(pi) -(1-ci)log(1-pi)

is it correct?

Comments are closed.

Cross Entropy Error Function

Derivative

Expanding the sum term

Let’s dance

Related

5 Comments