Convolutional neural networks popularize softmax so much as an activation function. However, softmax is not a traditional activation function. The other activation functions produce a single output for a single input whereas softmax produces multiple outputs for an input array. In this way, we can build neural networks models that can classify more than 2 classes instead of binary class solution.

## Softmax function

σ(x_{j}) = e^{xj} / (∑ (i=1 to n) e^{xi} ) (for j=1 to n)

Softmax normalizes the input array in scale of [0, 1]. Also, sum of the softmax outputs is always equal to 1. So, neural networks model classifies the instance as a class that have an index of the maximum output.

For example, the following results will be retrieved when softmax is applied for the inputs above.

σ(x_{1}) = e^{x1} / (e^{x1}+e^{x2}+e^{x3} ) = e^{2} / (e^{2} + e^{1}+ e^{0.1}) = 0.7

σ(x_{2}) = e^{x2} / (e^{x1}+e^{x2}+e^{x3} ) = e^{1} / (e^{2} + e^{1}+ e^{0.1}) = 0.2

σ(x_{3}) = e^{x3} / (e^{x1}+e^{x2}+e^{x3} ) = e^{0.1} / (e^{2} + e^{1}+ e^{0.1}) = 0.1

As seen, inputs normalized between [0, 1]. Also, sum of the results are equal to 0.7 + 0.2 + 0.1 = 1.

## Derivative

You might remember that activation function is consumed in feedforward step whereas its derivative is consumed in backpropagation step. Now, we would find its partial derivative.

Quotient rule states that if a function can be expressed as a division of two differentiable functions, then its derivative can be expressed as illustrated below.

f(x) = g(x) / h(x)

f'(x) = (g'(x) h(x) – g(x) h'(x)) / h

^{2}(x)

We can apply quotient rule to softmax function

∂σ(x_{j}) / ∂x_{j} = [(e^{xj})’.(∑ (i=1 to n) e^{xi} ) – (e^{xj}).(∑ (i=1 to n) e^{xi} )’ ] / (∑ (i=1 to n) e^{xi} )^{2}

Firstly, you might remember that derivative of e^{x} is e^{x} again.

Secondly, we need to derive the sum term

∑ (i=1 to n) e^{xi} = e^{x1}+ e^{x2}+ … + e^{xj} + … + e^{xn}

∂∑ (i=1 to n) e^{xi} / ∂x_{j} = 0 + 0 + … + e^{xj} + … + 0

Now, we would apply these derivatives to quetient rule applied softmax function

∂σ(x_{j}) / ∂x_{j} = [e^{xj}.(∑ (i=1 to n) e^{xi} ) – e^{xj}.e^{xj} ] / (∑ (i=1 to n) e^{xi} )^{2}

We can apply the divisor to divident.

∂σ(x_{j}) / ∂x_{j} = e^{xj}.(∑ (i=1 to n) e^{xi} )/(∑ (i=1 to n) e^{xi} )^{2} – (e^{xj}.e^{xj})/(∑ (i=1 to n) e^{xi} )^{2}

∂σ(x_{j}) / ∂x_{j} = e^{xj}/(∑ (i=1 to n) e^{xi} ) – (e^{xj}.e^{xj})/(∑ (i=1 to n) e^{xi} )^{2}

∂σ(x_{j}) / ∂x_{j} = e^{xj}/(∑ (i=1 to n) e^{xi} ) -[e^{xj}/∑ (i=1 to n) e^{xi}]^{2}

Would you realize that pure softmax function appears in the equation above? Let’s apply replacement value.

σ(x

_{j}) = e^{xj}/ (∑ (i=1 to n) e^{xi})

∂σ(x_{j}) / ∂x_{j} = σ(x_{j}) – σ(x_{j})^{2}

∂σ(x_{j}) / ∂x_{j} = σ(x_{j}).(1 – σ(x_{j}))

That is the solution for derivative of the σ(x_{j}) with respect to the (x_{j}). So, what would be the derivative of the σ(x_{j}) with respect to the (x_{k}) in case j is not equal to k? Again we would apply quotient rule to the term.

σ(x_{j}) = e^{xj} / (∑ (i=1 to n) e^{xi} ) (for j=1 to n)

∂σ(x_{j}) / ∂x_{k} = [(e^{xj})’.(∑ (i=1 to n) e^{xi}) – (e^{xj}).(∑ (i=1 to n) e^{xi} )’]/(∑ (i=1 to n) e^{xi} )^{2}

(e^{xj})’ = ∂e^{xj}/∂x_{k} = 0 (*Because e ^{xj} is constant for x_{k}*)

∂σ(x_{j}) / ∂x_{k} = [0.(∑ (i=1 to n) e^{xi}) – (e^{xj}).(∑ (i=1 to n) e^{xi} )’]/(∑ (i=1 to n) e^{xi} )^{2}

∑ (i=1 to n) e^{xi} = e^{x1}+ e^{x2}+ … + e^{xk} + … + e^{xn}

(∑ (i=1 to n) e^{xi} )’ = ∂(∑ (i=1 to n) e^{xi})/∂x_{k} = 0 + 0 + … + e^{xk} + 0 = e^{xk}

∂σ(x_{j}) / ∂x_{k} = [0.(∑ (i=1 to n) e^{xi}) – (e^{xj}).(e^{xk})]/(∑ (i=1 to n) e^{xi} )^{2}

∂σ(x_{j}) / ∂x_{k} = – (e^{xj}).(e^{xk})/(∑ (i=1 to n) e^{xi} )^{2}

∂σ(x_{j}) / ∂x_{k} = – [(e^{xj}).(e^{xk})]/[∑ (i=1 to n) e^{xi} ∑ (i=1 to n) e^{xi}]

∂σ(x_{j}) / ∂x_{k} = – [e^{xj}/∑ (i=1 to n) e^{xi}][e^{xk}/∑ (i=1 to n) e^{xi}]

∂σ(x_{j}) / ∂x_{k} = – σ(e^{xj}).σ(e^{xk})

Let’s put the derivatives in general form

σ(x

_{j}) = e^{xj}/ (∑ (i=1 to n) e^{xi}) (for j=1 to n)∂σ(x

_{j}) / ∂x_{k}= σ(x_{j}).(1 – σ(x_{j})), if if j = k∂σ(x

_{j}) / ∂x_{k}= – σ(e^{xj}).σ(e^{xk}), if j != k

So, derivative of softmax function is easy to demonstrate surprisingly. We mostly consume softmax function in convolutional neural networks final layer. Because, CNN is very good at classifying image based things and classification studies mostly include more than 2 classes.

## 3 Comments