Leaky ReLU as a Neural Networks Activation Function

Convolutional neural networks make ReLU activation function so popular. Common alternatives such as sigmoid or tanh have upper limits to saturate whereas ReLU doesn’t saturate for positive inputs. However, it still tends to saturate for negative inputs. Herein, we will do a small modification and the function will produce a constant times input value for negative inputs. In this way, function would not saturate for both direction.

Parametric ReLU or PReLU has a general form. It produces maximum value of x and αx. Additionaly, customized version of PReLU is Leaky ReLU  or LReLU. Constant multiplier α is equal to 0.1 for this customized function. Some sources mention that constant alpha as 0.01. Finally, Randomized ReLU picks up random alpha value for each session.


🙋‍♂️ You may consider to enroll my top-rated machine learning course on Udemy

Decision Trees for Machine Learning

leaky-relu-v2
Leaky ReLU Dance Move (Inspired from Imaginary)

Function

We will handle feed forward of PReLU as coded below.

def leaky_relu(alpha, x):
 if x<=0:
  return x
 else:
  return alpha * x

Graph is demonstrated below.

prelu
PReLU

Derivative

Similarly, derivative of the function is alpha for negative values whereas one for positive inputs. We’ll calculate the derivative as coded below. So, derivative of the PReLU is very similar to step function.

def derive_leaky_relu(alpha, x):
 if x>=0:
  return 1
 else:
  return alpha

Multiplying small numbers to small small numbers produces much smaller number. So, It may cause trouble in recurrent neural networks if constant multiplier is picked up between [0, 1] just like in LReLU. Because an unit is connected to itself in RNN and output of an unit is also input for itself.

Conclusion

Even though, its contribution on that field is a matter of debate and such researchers believe that it is unstable, some recent researches revealed that consuming PReLU improves converging performance systematically. Remember that constructing neural networks structure including optimal activation function is still state-of-the-art and no one can cross out without testing.

Let’s dance

These are the dance moves of the most common activation functions in deep learning. Ensure to turn the volume up 🙂


Like this blog? Support me on Patreon

Buy me a coffee