Google brain team announced Swish activation function as an alternative to ReLU in 2017. Actually, ReLU was the solution for second AI winter in the history. ReLU still plays an important role in deep learning studies even for today. But experiments show that this new activation function overperforms ReLU for deeper networks. BTW, some resources mention this function as sigmoid weighted linear unit or SiLU but this is a less common usage.
The function is formulated as x times sigmoid x. Sigmoid function was important activation function in the history but today it is a legacy one because of the vanishing gradient problem. A little modification makes this legacy activation function important again.
🙋♂️ You may consider to enroll my top-rated machine learning course on Udemy
y = x . sigmoid(x)
y = x . (1/(1+e-x)) = x / (1+e-x)
Notice that ReLU produces 0 output for negative inputs and it cannot be back-propagated. Herein, swish can partially handle this problem.
Derivative
Derivative of the function will be involved in back propagation step.
y = x . σ(x)
y = x . (1/(1+e-x))
The equation consists of two differentiable functions. We can apply product rule to the function. Remember what product rule is first.
(f . g)’ = f’.g + f.g’
y’ = x’ . σ(x) + x . σ(x)’
Remember the derivative of sigmoid function.
σ(x) = 1/(1+e-x)
σ(x)’ = σ(x).(1 – σ(x))
Derivative of the function includes derivative of sigmoid, too.
y’ = x’ . σ(x) + x . σ(x)’
y’ = σ(x) + x . σ(x) . (1 – σ(x)) = σ(x) + x . σ(x) – x. σ2(x)
The second term is equal to identity of swish function. Shift it to the left.
y’ = x . σ(x) + σ(x) – x. σ2(x) = y + σ(x) – x. σ2(x)
Now, 2nd and 3rd terms both have sigmoid multiplier. Let’s express them both as sigmoid common parenthesis.
y’ = y + σ(x) . (1 – x.σ(x))
The term in the parenthesis includes swish function again. We can express it as the function y.
y’ = y + σ(x) . (1 – y)
This is the most basic form for derivative of swish function.
Raw format
We can replace sigma function to content of sigmoid function and produce a raw equation.
y’ = x . (1/(1+e-x)) + 1/(1+e-x) . (1 – (x/(1+e-x)))
y’ = (x/(1+e-x)) + [1/(1+e-x)].[(1 + e-x – x)/(1+e-x)]
y’ = x/(1+e-x) + (1 + e-x – x)/(1+e-x)2
y’ = x.(1+e-x)/(1+e-x)2 + (1 + e-x – x)/(1+e-x)2
y’ = [x.(1+e-x) + (1 + e-x – x)]/(1+e-x)2
y’ = (x + x.e-x + 1 + e-x – x)/(1+e-x)2
y’ = (e-x(x + 1) + 1)/(1+e-x)2
Modifying Swish
Same authors published a new research paper just a week after. In this paper, they modified the function, and add a β multiplier in sigmoid. Interestingly, they called this new function swish again.
y = x . sigmoid(β.x)
y = x . (1/(1+e-βx)) = x / (1+e-βx)
Here, β is a parameter must be tuned. β must be different than 0 , otherwise it becomes a linear function. If β gets closer to ∞, then the function looks like ReLU. We have mention β as 1 in previous calculations. Authors proposed to assign β as 1 for reinforcement learning task in this new research.
Derivative of this new term would not be changed radically because β is constant.
Firstly, find the derivative for σ(β.x).
σ(β.x) = 1/(1+e-βx) = (1+e-βx)-1
σ(β.x)’ = (-1).(1+e-βx)-2.e-βx.(-β) = β . e-βx .(1+e-βx)-2 = (β . e-βx )/(1+e-βx)2
Put β out of the parenthesis
σ(β.x)’ = β.((e-βx )/(1+e-βx)2)
We will apply a little trick to form the derivative term simpler. Append plus and minus 1 to the numerator. This would not change the result.
σ(β.x)’ = β.((e-βx +1-1)/(1+e-βx)2)
Separate terms in the numerator as 1+e-βx and -1.
σ(β.x)’ = β.[(1+e-βx )/(1+e-βx)2 – 1/(1+e-βx)2]
The first term in the parenthesis includes 1+e-βx in both numerator and denominator. We can remove this term.
σ(β.x)’ = β.[1/(1+e-βx ) – 1/(1+e-βx)2]
Express 2nd term in the parenthesis as multiplier instead of squared.
σ(β.x)’ = β. [1/(1+e-βx ) – (1/(1+e-βx ))(1/(1+e-βx )) ]
Notice that σ was 1/(1+e-βx). Replace this terms with σ in the equation above.
σ(β.x)’ = β . [σ – σ.σ] = β. [σ.(1-σ)]
We’ve found the derivative for σ(β.x). Actually, it is equal to β times derivative of pure sigmoid.
Turn back to the modified swish function.
y = x . sigmoid(β.x)
Again, we’ll apply product rule to the term above.
y’ = x’ . σ(β.x) + x . σ(β.x)’
y’ = σ(β.x) + x . β. [σ.(1-σ)]
y’ = 1/(1+e-βx) + x . β . (1/(1+e-βx)) . (1 – 1/(1+e-βx))
Summary
I summarized both the swish function and its derivative below.
y = x . σ(x) where σ(x) = 1/(1+e-x)
dy/dx = y + σ(x) . (1 – y)
or dy/dx = (e-x(x + 1) + 1)/(1+e-x)2
So, we’ve mentioned a new activation function derived from a legacy activation function. This function can handle vanishing gradient problem that sigmoid cannot. Moreover, experiments show that swish works better than the ReLU – superstar activation function of deep learning. That’s a fact, computation of the function has higher cost than ReLU for both feed forwarding and back propogation.
Let’s dance
These are the dance moves of the most common activation functions in deep learning. Ensure to turn the volume up 🙂
Support this blog if you do like!
Note: σ(βx)=(1/(1+exp(-βx)), so when you replace (1/(1+exp(-βx)) with σ, you should actually be replacing it with σ(βx). So the derivative should be as follows.
σ'(βx)=β(σ(βx)(1-σ(βx))
and then
(xσ(βx))’ = xσ'(βx) + x’σ(βx)
= βx(σ(βx)(1-σ(βx)) + σ(βx)