An Overview to Vanishing Gradient Problem

Legacy form of perceptrons cannot handle non-linear problems such as xor logic gate problem. This is because step function is non-differentiable that it exists as activation unit. This causes to live 1st AI winter.

It has been discovered that multi layered perceptrons can handle non-linear problems in 1986. The discovery causes to pass AI winter away. Unfortunately, that was the 1st AI winter!

🙋‍♂️ You may consider to enroll my top-rated machine learning course on Udemy

multi-layer-perceptron — Multi layer perceptron

This discovery requires to transform activation units to differentiable functions. In this way, we can back-propagate errors and apply learning. Herein, sigmoid and tanh are one of the most common activation functions. However, these functions come with a huge defects.

2nd AI Winter

Sigmoid function is meaningful for inputs between (-5, +5). In this scale, it has a derivative different than 0. This means that we can back-propage errors and apply learning.

Ian Goodfellow represents meaningfulness as mobility of Bart Simpson with his skateboard. Gravity contributes Bart to move if he is in range of [-5, +5].

sigmoid-meaningful — Bart Simpson can move if input is in scale of [-5, +5]

On the other hand, gravity will not contribute Bart to move if he is in a point grater than 5 or less than -5.

sigmoid-meaningless — Bart Simpson cannot move if he is not in range of [-5, +5]

This representation describes gradient vanishing problem very well. If derivative of activation function would always produce 0, then we cannot update weights. But this is the result. The question is that what causes to happen this result?

Wide and deep networks would cause to produce large outputs in every layer. Constructing wide and deep network with sigmoid activation unit reveals gradient vanishing or exploding problem. This ends us up in the AI winter again.

deep-learning-architecture — Deep neural networks

Sunny Days

2nd AI winter passed away in just 2011. Raising a simple activation function named ReLU shows us again sunny days. This function is identity function for positive inputs whereas it produces zero for negative inputs.

Let’s imagine that Bart’s mobility on this new function. Gravity causes Bart to move for any positive input.

relu-meaningful — Bart can move for any positive input

Wide network structures tend to produce mostly large positive inputs among layers. That’s why, most of gradient vanishing problems would be solved even though gravity would not contribute Bart to move for negative inputs.

relu-meaningless — Bart cannot move for negative inputs

You might consider to use Leaky ReLU as activation unit to handle this issue for negative inputs. Bart can move at any point for this new function! Leaky ReLU is a non-linear function, it is differentiable, and its derivative is different than 0 for any point except 0.

leaky-relu-meaningful — Bart can move at any point for Leaky ReLU

Testing

Let’s construct a wide and deep neural networks model. Basically, I’ll create a model for handwritten digit classification. There are 4 hidden layers consisting of 128, 64, 32 and 16 units respectively. Actually, it is not so deep.

classifier = tf.contrib.learn.DNNClassifier(
feature_columns=feature_columns
, n_classes=10 #0 to 9 - 10 classes
, hidden_units=[128, 64, 32, 16]
, optimizer=tf.train.GradientDescentOptimizer(learning_rate=0.1)
, activation_fn = tf.nn.sigmoid
)

As seen, model make disappointment. Accuracy is very low.

sigmoid-result — Result for sigmoid function

Only we need is to switch activation function to ReLU.

classifier = tf.contrib.learn.DNNClassifier(
feature_columns=feature_columns
, n_classes=10 #0 to 9 - 10 classes
, hidden_units=[128, 64, 32, 16]
, optimizer=tf.train.GradientDescentOptimizer(learning_rate=0.1)
, activation_fn = tf.nn.relu
)

As seen, accuracy will increase dramatically if activation unit were ReLU.

relu-result — Result for ReLU activation function

BTW, I’ve pushed the code into GitHub.

So, AI studies had unproductive period for almost 20 years between 1986 – 2006 because of activation units. Funnily, this challenging problem can be solved with a simple function usage. ReLU is the reason why we are much stronger in AI studies for these days.

Support this blog financially if you do like!

2 Comments

Craig says:

June 23, 2018 at 12:36 am

Your description of the “AI winter” is too simplistic. The AI winter idea applies mostly to symbolic AI, which had I think 3 winters (one very early due to machine translation). See Wikipedia. The neural network situation is difficult, most research stopped 1969 due to minsky papert book, boom started again with Hopfield 1982 then Rumelhart 1986, lasted until 1993ish. Deep learning started up about 2006 with Hinton training layers with RBMs. But Relu not used until mostly about 2011 (LeCun had a paper in 2009).

Log in to Reply
1. Sefik Serengil says:
  
  June 23, 2018 at 10:54 am
  
  Thank you for handling the subject. You are absolutely right. I fixed the date for first introduction of ReLU as 2011.
  
  BTW, I referenced 2nd AI winter dates from the link http://cswithjames.com/keras-6-vanishing-gradient-problem-relu/ .
  
  Log in to Reply