Softmax as a Neural Networks Activation Function

In fact, convolutional neural networks popularize softmax so much as an activation function. However, softmax is not a traditional activation function. For instance, the other activation functions produce a single output for a single input. In contrast, softmax produces multiple outputs for an input array. For this reason, we can build neural networks models that can classify more than 2 classes instead of binary class solution.

🙋‍♂️ You may consider to enroll my top-rated machine learning course on Udemy

Softmax function

σ(x_j) = e^x_j / (∑ (i=1 to n) e^x_i ) (for j=1 to n)

First of all, softmax normalizes the input array in scale of [0, 1]. Also, sum of the softmax outputs is always equal to 1. So, neural networks model classifies the instance as a class that have an index of the maximum output.

For example, the following results will be retrieved when softmax is applied for the inputs above.

1- σ(x₁) = e^x₁ / (e^x₁+e^x₂+e^x₃ ) = e² / (e² + e¹+ e^0.1) = 0.7

2- σ(x₂) = e^x₂ / (e^x₁+e^x₂+e^x₃ ) = e¹ / (e² + e¹+ e^0.1) = 0.2

3- σ(x₃) = e^x₃ / (e^x₁+e^x₂+e^x₃ ) = e^0.1 / (e² + e¹+ e^0.1) = 0.1

With attention to, inputs normalized between [0, 1]. Also, sum of the results are equal to 0.7 + 0.2 + 0.1 = 1.

Derivative

Herein, you might remember that activation function is consumed in feed-forward step whereas its derivative is consumed in backpropagation step. Now, we would find its partial derivative.

Quotient rule states that if a function can be expressed as a division of two differentiable functions, then its derivative can be expressed as illustrated below.

f(x) = g(x) / h(x)

f'(x) = (g'(x) h(x) – g(x) h'(x)) / h²(x)

We can apply quotient rule to softmax function

∂σ(x_j) / ∂x_j = [(e^x_j)’.(∑ (i=1 to n) e^x_i ) – (e^x_j).(∑ (i=1 to n) e^x_i )’ ] / (∑ (i=1 to n) e^x_i )²

Firstly, you might remember that derivative of e^x is e^x again.

Derive the sum term

∑ (i=1 to n) e^x_i = e^x₁+ e^x₂+ … + e^x_j + … + e^x_n

∂∑ (i=1 to n) e^x_i / ∂x_j = 0 + 0 + … + e^x_j + … + 0

Now, we would apply these derivatives to quetient rule applied softmax function

∂σ(x_j) / ∂x_j = [e^x_j.(∑ (i=1 to n) e^x_i ) – e^x_j.e^x_j ] / (∑ (i=1 to n) e^x_i )²

Here, we can apply the divisor to divident.

∂σ(x_j) / ∂x_j = e^x_j.(∑ (i=1 to n) e^x_i )/(∑ (i=1 to n) e^x_i )² – (e^x_j.e^x_j)/(∑ (i=1 to n) e^x_i )²

∂σ(x_j) / ∂x_j = e^x_j/(∑ (i=1 to n) e^x_i ) – (e^x_j.e^x_j)/(∑ (i=1 to n) e^x_i )²

∂σ(x_j) / ∂x_j = e^x_j/(∑ (i=1 to n) e^x_i ) -[e^x_j/∑ (i=1 to n) e^x_i]²

Catching a trick

Would you realize that pure softmax function appears in the equation above? Let’s apply replacement value.

σ(x_j) = e^x_j / (∑ (i=1 to n) e^x_i )

∂σ(x_j) / ∂x_j = σ(x_j) – σ(x_j)²

∂σ(x_j) / ∂x_j = σ(x_j).(1 – σ(x_j))

That is the solution for derivative of the σ(x_j) with respect to the (x_j).

So, what would be the derivative of the σ(x_j) with respect to the (x_k) in case j is not equal to k? Again we would apply quotient rule to the term.

σ(x_j) = e^x_j / (∑ (i=1 to n) e^x_i ) (for j=1 to n)

∂σ(x_j) / ∂x_k = [(e^x_j)’.(∑ (i=1 to n) e^x_i) – (e^x_j).(∑ (i=1 to n) e^x_i )’]/(∑ (i=1 to n) e^x_i )²

(e^x_j)’ = ∂e^x_j/∂x_k = 0 (Because e^x_j is constant for x_k)

∂σ(x_j) / ∂x_k = [0.(∑ (i=1 to n) e^x_i) – (e^x_j).(∑ (i=1 to n) e^x_i )’]/(∑ (i=1 to n) e^x_i )²

∑ (i=1 to n) e^x_i = e^x₁+ e^x₂+ … + e^x_k + … + e^x_n

(∑ (i=1 to n) e^x_i )’ = ∂(∑ (i=1 to n) e^x_i)/∂x_k = 0 + 0 + … + e^x_k + 0 = e^x_k

∂σ(x_j) / ∂x_k = [0.(∑ (i=1 to n) e^x_i) – (e^x_j).(e^x_k)]/(∑ (i=1 to n) e^x_i )²

∂σ(x_j) / ∂x_k = – (e^x_j).(e^x_k)/(∑ (i=1 to n) e^x_i )²

∂σ(x_j) / ∂x_k = – [(e^x_j).(e^x_k)]/[∑ (i=1 to n) e^x_i ∑ (i=1 to n) e^x_i]

∂σ(x_j) / ∂x_k = – [e^x_j/∑ (i=1 to n) e^x_i][e^x_k/∑ (i=1 to n) e^x_i]

∂σ(x_j) / ∂x_k = – σ(e^x_j).σ(e^x_k)

Putting derivatives in a general form

σ(x_j) = e^x_j / (∑ (i=1 to n) e^x_i ) (for j=1 to n)

∂σ(x_j) / ∂x_k = σ(x_j).(1 – σ(x_j)), if if j = k

∂σ(x_j) / ∂x_k = – σ(e^x_j).σ(e^x_k), if j != k

So, derivative of softmax function is easy to demonstrate surprisingly. As well as, we mostly consume softmax function in convolutional neural networks final layer. Because, CNN is very good at classifying image based things and classification studies mostly include more than 2 classes.

Let’s dance

These are the dance moves of the most common activation functions in deep learning. Ensure to turn the volume up 🙂

Support this blog if you do like!

7 Comments

Billy Smith says:

April 21, 2018 at 6:30 pm

What is i and j. I am having trouble understanding

Log in to Reply
1. Sefik Serengil says:
  
  April 21, 2018 at 7:08 pm
  
  In this first figure, there are 3 output classes labeled 2, 1 and 0.1. Here, j refers to all these outputs individually. I mean that:
  
  if j ==1 then x[1] = 2
  else if j == 2 then x[2] = 1
  else if j == 3 then x[3] = 0.1
  
  On the other hand, i refers to all those outputs. We create a for loop and i stores the index.
  for i in range(3):
  sum = sum + pow(e, x[i])
  
  Is this explanation clear?
  
  Log in to Reply

Softmax as a Neural Networks Activation Function

Softmax function