Rectifier linear unit or its more widely known name as ReLU becomes popular for the past several years since its performance and speed. In contrast to other common activation functions, ReLU is a linear function. In other words, its derivative is either 0 or 1. However, you might remember that derivative of activation functions are included in backpropagation. So, what makes ReLU different from other linear functions?
ReLU function produces 0 when x is less than or equal to 0 whereas it would be equal to x when x is greater than 0. The function output can be generalized as max(0, x).
Previously, we’ve mentioned on softplus function. The secret is that ReLU function is very similar to softplus function except near 0. Moreover, smoothing ReLU arises softplus function as illustrated below.
Sigmoid function produces outputs in scale of [0, +1]. Similarly, tanh function produces results in scale of [-1, +1]. These functions would produce same results when they increased or decreased dramatically. This means that gradient of these functions would be equal for differet positive or negative large values. This reveals that the gradient of these functions vanishes as x value is increased or decreased. However, gradient vanishing problem is destroyed for ReLU. Because its derivative is 1 when x is greater than 0 and its derivative is 0 when x is less than or equal to 0.
What’s more, the dataset must be normalized if output of activation function has upper and lower limit. We can skip this task for ReLU based systems. Because, the function produces outputs in scale of [0, +∞).
Finally, calculation of the function result and gradient is easy task because it does not include exponential calculations. Thus, the both feed forward and back progate steps can be processed fastly. That’s why, experiments show ReLU is six times faster than other well known activation functions. That is the reason why ReLU is commonly used in convolutional neural networks.