ReLU as Neural Networks Activation Function

Rectifier linear unit or its more widely known name as ReLU becomes popular for the past several years since its performance and speed. In contrast to other common activation functions, ReLU is a linear function. In other words, its derivative is either 0 or 1.Β However, you might remember that derivative of activation functions are included in backpropagation. So, what makes ReLU different from other linear functions?

relu-dance-move
ReLU Dance Move (Inspired from Imaginary)

ReLU function produces 0 when x is less than or equal to 0 whereas it would be equal to x when x is greater than 0. We can generalize the function output as max(0, x).


πŸ™‹β€β™‚οΈ You may consider to enroll my top-rated machine learning course on Udemy

Decision Trees for Machine Learning

relu-graph-v2
ReLU function

Previously, we’ve mentioned on softplus function. The secret is that ReLU function is very similar to softplus function except near 0. Moreover, smoothing ReLU arises softplus function as illustrated below.

relu-and-softplus-v2
ReLU and Softplus

Pros

Sigmoid function produces outputs in scale of [0, +1]. Similarly, tanh function produces results in scale of [-1, +1]. These functions would produce same results when they increased or decreased dramatically. This means that gradient of these functions would be equal for differet positive or negative large values. This reveals that the gradient of these functions vanishes as x value is increased or decreased. However, ReLU destroys gradient vanishing problem. Because its derivative is 1 when x is greater than 0 and its derivative is 0 when x is less than or equal to 0. In other words, derivative of ReLU is step function.

What’s more, the dataset must be normalized if output of activation function has upper and lower limit. We can skip this task for ReLU based systems. Because, the function produces outputs in scale of [0, +∞).

Finally, calculation of the function result and gradient is easy task because it does not include exponential calculations. Thus, we can process the both feed forward and back progate steps fastly. That’s why, experiments show ReLU is six times faster than other well known activation functions. That is the reason why ReLU is commonly used in convolutional neural networks.

Let’s dance

These are the dance moves of the most common activation functions in deep learning. Ensure to turn the volume up πŸ™‚


Like this blog? Support me on Patreon

Buy me a coffee


3 Comments

  1. In the sentence ==> In other words, its derivative is either 0 or 1. What you want to convey???
    As ReLU function is not differentiable at 0. Therefore there is a discontinuity. Instead we can use a LeakyReLU to overcome this problem

Comments are closed.