🙋♂️ You may consider to enroll my top-rated machine learning course on Udemy

Softmax function
σ(xj) = exj / (∑ (i=1 to n) exi ) (for j=1 to n)
First of all, softmax normalizes the input array in scale of [0, 1]. Also, sum of the softmax outputs is always equal to 1. So, neural networks model classifies the instance as a class that have an index of the maximum output.

For example, the following results will be retrieved when softmax is applied for the inputs above.
1- σ(x1) = ex1 / (ex1+ex2+ex3 ) = e2 / (e2 + e1+ e0.1) = 0.7
2- σ(x2) = ex2 / (ex1+ex2+ex3 ) = e1 / (e2 + e1+ e0.1) = 0.2
3- σ(x3) = ex3 / (ex1+ex2+ex3 ) = e0.1 / (e2 + e1+ e0.1) = 0.1
With attention to, inputs normalized between [0, 1]. Also, sum of the results are equal to 0.7 + 0.2 + 0.1 = 1.
Derivative
Herein, you might remember that activation function is consumed in feed-forward step whereas its derivative is consumed in backpropagation step. Now, we would find its partial derivative.
Quotient rule states that if a function can be expressed as a division of two differentiable functions, then its derivative can be expressed as illustrated below.
f(x) = g(x) / h(x)
f'(x) = (g'(x) h(x) – g(x) h'(x)) / h2(x)
We can apply quotient rule to softmax function
∂σ(xj) / ∂xj = [(exj)’.(∑ (i=1 to n) exi ) – (exj).(∑ (i=1 to n) exi )’ ] / (∑ (i=1 to n) exi )2
Firstly, you might remember that derivative of ex is ex again.
Derive the sum term
∑ (i=1 to n) exi = ex1+ ex2+ … + exj + … + exn
∂∑ (i=1 to n) exi / ∂xj = 0 + 0 + … + exj + … + 0
Now, we would apply these derivatives to quetient rule applied softmax function
∂σ(xj) / ∂xj = [exj.(∑ (i=1 to n) exi ) – exj.exj ] / (∑ (i=1 to n) exi )2
Here, we can apply the divisor to divident.
∂σ(xj) / ∂xj = exj.(∑ (i=1 to n) exi )/(∑ (i=1 to n) exi )2 – (exj.exj)/(∑ (i=1 to n) exi )2
∂σ(xj) / ∂xj = exj/(∑ (i=1 to n) exi ) – (exj.exj)/(∑ (i=1 to n) exi )2
∂σ(xj) / ∂xj = exj/(∑ (i=1 to n) exi ) -[exj/∑ (i=1 to n) exi]2
Catching a trick
Would you realize that pure softmax function appears in the equation above? Let’s apply replacement value.
σ(xj) = exj / (∑ (i=1 to n) exi )
∂σ(xj) / ∂xj = σ(xj) – σ(xj)2
∂σ(xj) / ∂xj = σ(xj).(1 – σ(xj))
That is the solution for derivative of the σ(xj) with respect to the (xj).
So, what would be the derivative of the σ(xj) with respect to the (xk) in case j is not equal to k? Again we would apply quotient rule to the term.
σ(xj) = exj / (∑ (i=1 to n) exi ) (for j=1 to n)
∂σ(xj) / ∂xk = [(exj)’.(∑ (i=1 to n) exi) – (exj).(∑ (i=1 to n) exi )’]/(∑ (i=1 to n) exi )2
(exj)’ = ∂exj/∂xk = 0 (Because exj is constant for xk)
∂σ(xj) / ∂xk = [0.(∑ (i=1 to n) exi) – (exj).(∑ (i=1 to n) exi )’]/(∑ (i=1 to n) exi )2
∑ (i=1 to n) exi = ex1+ ex2+ … + exk + … + exn
(∑ (i=1 to n) exi )’ = ∂(∑ (i=1 to n) exi)/∂xk = 0 + 0 + … + exk + 0 = exk
∂σ(xj) / ∂xk = [0.(∑ (i=1 to n) exi) – (exj).(exk)]/(∑ (i=1 to n) exi )2
∂σ(xj) / ∂xk = – (exj).(exk)/(∑ (i=1 to n) exi )2
∂σ(xj) / ∂xk = – [(exj).(exk)]/[∑ (i=1 to n) exi ∑ (i=1 to n) exi]
∂σ(xj) / ∂xk = – [exj/∑ (i=1 to n) exi][exk/∑ (i=1 to n) exi]
∂σ(xj) / ∂xk = – σ(exj).σ(exk)
Putting derivatives in a general form
σ(xj) = exj / (∑ (i=1 to n) exi ) (for j=1 to n)
∂σ(xj) / ∂xk = σ(xj).(1 – σ(xj)), if if j = k
∂σ(xj) / ∂xk = – σ(exj).σ(exk), if j != k
So, derivative of softmax function is easy to demonstrate surprisingly. As well as, we mostly consume softmax function in convolutional neural networks final layer. Because, CNN is very good at classifying image based things and classification studies mostly include more than 2 classes.
Let’s dance
These are the dance moves of the most common activation functions in deep learning. Ensure to turn the volume up 🙂
Support this blog if you do like!
What is i and j. I am having trouble understanding
In this first figure, there are 3 output classes labeled 2, 1 and 0.1. Here, j refers to all these outputs individually. I mean that:
if j ==1 then x[1] = 2
else if j == 2 then x[2] = 1
else if j == 3 then x[3] = 0.1
On the other hand, i refers to all those outputs. We create a for loop and i stores the index.
for i in range(3):
sum = sum + pow(e, x[i])
Is this explanation clear?