Softmax as a Neural Networks Activation Function

In fact, convolutional neural networks popularize softmax so much as an activation function. However, softmax is not a traditional activation function. For instance, the other activation functions produce a single output for a single input. In contrast, softmax produces multiple outputs for an input array. For this reason, we can build neural networks models that can classify more than 2 classes instead of binary class solution.


🙋‍♂️ You may consider to enroll my top-rated machine learning course on Udemy

Decision Trees for Machine Learning

Softmax function

σ(xj) = exj / (∑ (i=1 to n) exi ) (for j=1 to n)

First of all, softmax normalizes the input array in scale of [0, 1]. Also, sum of the softmax outputs is always equal to 1. So, neural networks model classifies the instance as a class that have an index of the maximum output.

softmax1
Softmax function

For example, the following results will be retrieved when softmax is applied for the inputs above.

1- σ(x1) = ex1 / (ex1+ex2+ex3 ) = e2 / (e2 + e1+ e0.1) = 0.7

2- σ(x2) = ex2 / (ex1+ex2+ex3 ) = e1 / (e2 + e1+ e0.1) = 0.2

3- σ(x3) = ex3 / (ex1+ex2+ex3 ) = e0.1 / (e2 + e1+ e0.1) = 0.1

With attention to, inputs normalized between [0, 1]. Also, sum of the results are equal to 0.7 + 0.2 + 0.1 = 1.

Derivative

Herein, you might remember that activation function is consumed in feed-forward step whereas its derivative is consumed in backpropagation step. Now, we would find its partial derivative.





Quotient rule states that if a function can be expressed as a division of two differentiable functions, then its derivative can be expressed as illustrated below.

f(x) = g(x) / h(x)

f'(x) = (g'(x) h(x) – g(x) h'(x)) / h2(x)

We can apply quotient rule to softmax function

∂σ(xj) / ∂xj = [(exj)’.(∑ (i=1 to n) exi ) – (exj).(∑ (i=1 to n) exi )’ ] / (∑ (i=1 to n) exi )2

Firstly, you might remember that derivative of ex is ex again.

Derive the sum term

∑ (i=1 to n) exi = ex1+ ex2+ … + exj + … + exn

∂∑ (i=1 to n) exi / ∂xj = 0 + 0 + … + exj + … + 0

Now, we would apply these derivatives to quetient rule applied softmax function

∂σ(xj) / ∂xj = [exj.(∑ (i=1 to n) exi ) – exj.exj ] / (∑ (i=1 to n) exi )2





Here, we can apply the divisor to divident.

∂σ(xj) / ∂xj = exj.(∑ (i=1 to n) exi )/(∑ (i=1 to n) exi )2 – (exj.exj)/(∑ (i=1 to n) exi )2

∂σ(xj) / ∂xj = exj/(∑ (i=1 to n) exi ) – (exj.exj)/(∑ (i=1 to n) exi )2

∂σ(xj) / ∂xj = exj/(∑ (i=1 to n) exi ) -[exj/∑ (i=1 to n) exi]2

Catching a trick

Would you realize that pure softmax function  appears in the equation above? Let’s apply replacement value.

σ(xj) = exj / (∑ (i=1 to n) exi )

∂σ(xj) / ∂xj = σ(xj) – σ(xj)2

∂σ(xj) / ∂xj = σ(xj).(1 – σ(xj))

That is the solution for derivative of the σ(xj) with respect to the (xj).

So, what would be the derivative of the σ(xj) with respect to the (xk) in case j is not equal to k? Again we would apply quotient rule to the term.





σ(xj) = exj / (∑ (i=1 to n) exi ) (for j=1 to n)

∂σ(xj) / ∂xk = [(exj)’.(∑ (i=1 to n) exi) – (exj).(∑ (i=1 to n) exi )’]/(∑ (i=1 to n) exi )2

(exj)’ = ∂exj/∂xk = 0 (Because exj is constant for xk)

∂σ(xj) / ∂xk = [0.(∑ (i=1 to n) exi) – (exj).(∑ (i=1 to n) exi )’]/(∑ (i=1 to n) exi )2

∑ (i=1 to n) exi =  ex1+ ex2+ … + exk + … + exn

(∑ (i=1 to n) exi )’ = ∂(∑ (i=1 to n) exi)/∂xk = 0 + 0 + … + exk + 0 = exk

∂σ(xj) / ∂xk = [0.(∑ (i=1 to n) exi) – (exj).(exk)]/(∑ (i=1 to n) exi )2

∂σ(xj) / ∂xk = – (exj).(exk)/(∑ (i=1 to n) exi )2

∂σ(xj) / ∂xk = – [(exj).(exk)]/[∑ (i=1 to n) exi ∑ (i=1 to n) exi]

∂σ(xj) / ∂xk = – [exj/∑ (i=1 to n) exi][exk/∑ (i=1 to n) exi]





∂σ(xj) / ∂xk = – σ(exj).σ(exk)

Putting derivatives in a general form

σ(xj) = exj / (∑ (i=1 to n) exi ) (for j=1 to n)

∂σ(xj) / ∂xk = σ(xj).(1 – σ(xj)), if if j = k

∂σ(xj) / ∂xk = – σ(exj).σ(exk), if j != k

So, derivative of softmax function is easy to demonstrate surprisingly. As well as, we mostly consume softmax function in convolutional neural networks final layer. Because, CNN is very good at classifying image based things and classification studies mostly include more than 2 classes.

Let’s dance

These are the dance moves of the most common activation functions in deep learning. Ensure to turn the volume up 🙂


Like this blog? Support me on Patreon

Buy me a coffee


7 Comments

    1. In this first figure, there are 3 output classes labeled 2, 1 and 0.1. Here, j refers to all these outputs individually. I mean that:

      if j ==1 then x[1] = 2
      else if j == 2 then x[2] = 1
      else if j == 3 then x[3] = 0.1

      On the other hand, i refers to all those outputs. We create a for loop and i stores the index.
      for i in range(3):
      sum = sum + pow(e, x[i])

      Is this explanation clear?

  1. If the input of the Softmax is a 1D vector, how come we get a matrix of derivatives?
    My understanding is that we only care for:
    ∂σ(xj) / ∂xk = σ(xj).(1 – σ(xj)), if if j = k

    Thanks in advance

    1. It is actually 1D array. Please look at softmax function illustration y = [2, 1, 0.1].

      You mean that y consists of single item? If yes, then it always returns 1 output as softmax output because e^x / e^x = 1. But this is not meaningful, that’s wjy softmax function is used for multiclass classification problems where n>=3.

      If problem is binary class classification (n=2), then you do not have to use softmax because e.g. sigmoid produces outputs in scale of [0, 1]

Comments are closed.