Mish As Neural Networks Activation Function

Recently, Mish activation function is announced in deep learning world. Researchers report that it overperforms than both regular ReLU and Swish. The function is actually combination of popular activation functions.

mish-dance-move — Mish Dance Move (inspired from Imaginary)

It is a combination of identity, hyperbolic tangent and softplus. We should remember tanh and softplus functions at this point.

🙋‍♂️ You may consider to enroll my top-rated machine learning course on Udemy

y = x.tanh(softplus(x))

whereas tanh(x) = (e^x – e^-x) / (e^x + e^–^x) and softplus(x) = ln(1 + e^x)

Combining these two functions shows the single mish function

mish(x) = x . (e^{ln(1 + e^x)} – e^{-ln(1 + e^x)}) / (e^{ln(1 + e^x)} + e^{-ln(1 + e^x)})

This becomes a very complex function but its graph will recall you Swish activation function.

Zoomed version of mish and swish shows how different these functions are. I draw these graphs with Desmos.

Implementing mish in python is an easy task.

def tanh(x):
	return (np.exp(x) - np.exp(-x))/(np.exp(x) + np.exp(-x))

def softplus(x):
	return np.log(1 + np.exp(x))

def misc(x):
	return x * tanh(softplus(x))

Derivative

We needed the mish function in feed forward step in neural networks. We will also need its derivative in backpropagation step.

y = x . (e^{ln(1 + e^x)} – e^{-ln(1 + e^x)}) / (e^{ln(1 + e^x)} + e^{-ln(1 + e^x)})

Firstly, we can simplify the misc function. This makes derivative calculation easy. Remember that e to ln(x) was x. In this case, e to ln(1+e^x) will be (1 + e^x)

y = x.(e^x + 1 – 1/(e^x + 1)) / (e^x + 1 + 1/(e^x+1)) = [x.((e^x + 1)² – 1)/(e^x + 1)]/[((e^x + 1)² + 1)/(e^x + 1)]

The both dividend and divisor have (e^x + 1) as a divisor. We can simplify (e^x + 1) terms.

y = [x.((e^x + 1)² – 1)]/[((e^x + 1)² + 1)]

Remember quotient rule. y’ = (f/g)’ = (f’.g – f.g’) / g² . So, we can apply quotient rule to the term above.

f = [x.((e^x + 1)² – 1)] and g = [((e^x + 1)² + 1)]

Let’s find the derivative of f first

f’ = [x.((e^x + 1)² – 1)]’

Remember product rule. (a.b)’ = a’.b + a.b’. So, f is the multiplication of x and ((e^x + 1)² – 1).

f’ = x’ . ((e^x + 1)² – 1) + x. ((e^x + 1)² – 1)’

Finding the derivative of f requires to find [(e^x + 1)²]

[(e^x + 1)²]’ = 2.(e^x + 1).(e^x)’ = 2.(e^x + 1).(e^x)

Now, we can find the derivative of f

f’ = ((e^x + 1)² – 1) + x . (2e^x(e^x + 1)) = (e^x + 1)² – 1 + 2xe^x(e^x + 1)

It is time to find the derivative of g

g’ = [((e^x + 1)² + 1)]’

We’ve already found the derivative of [(e^x + 1)²]

g’ = 2e^x.(e^x + 1)

y’ = (f’.g – f.g’) / g² = ( [(e^x + 1)² – 1 + 2xe^x(e^x + 1)].[((e^x + 1)² + 1)] – [x.((e^x + 1)² – 1)][2e^x.(e^x + 1)] ) / ((e^x + 1)² + 1)²

Let’s say the term in the divisor to δ = ((e^x + 1)² + 1)

y’ = ( [e^2x + 1 + 2e^x – 1 + 2xe^2x + 2xe^x][e^2x + 1 + 2e^x + 1] – [e^2x + 1 + 2e^x -1][2xe^2x + 2xe^x] )/δ²

y’ = ( [e^2x + 2e^x + 2xe^2x + 2xe^x][e^2x + 2e^x + 2] – [e^2x + 2e^x][2xe^2x + 2xe^x] )/δ²

y’ = ([e^4x +2e^3x + 2e^2x + 2e^3x +4e^2x + 4e^x + 2xe^4x + 4xe^3x + 4xe^2x + 2xe^3x + 4xe^2x + 4xe^x] – [2xe^4x + 2xe^3x + 4xe^3x + 4xe^2x] )/δ²

y’ = [e^4x (1 + 2x – 2x) + e^3x (2 + 2 + 4x + 2x – 2x – 4x) + e^2x (2 + 4 + 4x + 4x – 4x) +e^x (4 + 4x)]/δ²

y’ = [e^4x + e^3x (4) + e^2x (6 + 4x) +e^x (4 + 4x)]/δ²

y’ = e^x . [e^3x + e^2x (4) + e^x (6 + 4x) + 4(1 + x)]/δ²

Let’s set the term in the square brackets to ω.

This is the final form of the derivative of Mish activation function.

dy/dx = e^x . ω / δ²

whereas ω = e^3x + e^2x (4) + e^x (6 + 4x) + 4(1 + x) and δ = (e^x + 1)² + 1

Implementing the derivative of mish in python is a little costly.

def mish_derivative(x):
	omega = np.exp(3*x) + 4*np.exp(2*x) + (6+4*x)*np.exp(x) + 4*(1 + x)
	delta = 1 + pow((np.exp(x) + 1), 2)
	derivative = np.exp(x) * omega / pow(delta, 2)
	return derivative

Keras

We would mostly not implement neural networks architecture including feed forward and backpropagation steps from scratch. Building neural networks in high level APIs such as Keras is more common. We can adapt Mish into Keras even though it is not an out-of-the-box function.

def mish(x):
   return x * keras.backend.tanh(keras.backend.softplus(x))

"""
model = Sequential()
...
model.add(Conv2D(64,(3, 3), activation = mish))
...
"""

Conclusion

So, we’ve mentioned a novel activation function Mish consisting of popular activation functions including identity, hyperbolic tangent tanh and softplus. Original paper skipped the derivative calculation step and gave the derivative directly. In this post, we’ve focused on the derivative calculation step also.

Even though, tens of functions has proposed so far, a few of them survived in daily studies. It seems that we would see several novel activation functions in the following days in deep learning world.

Let’s dance

These are the dance moves of the most common activation functions in deep learning. Ensure to turn the volume up 🙂

Like this blog? Support me on Patreon

Derivative

Keras

Conclusion

Let’s dance

Related