Legacy form of perceptrons cannot handle non-linear problems such as xor logic gate problem. This is because step function is non-differentiable that it exists as activation unit. This causes to live 1st AI winter.
It has been discovered that multi layered perceptrons can handle non-linear problems in 1986. The discovery causes to pass AI winter away. Unfortunately, that was the 1st AI winter!
đââď¸ You may consider to enroll my top-rated machine learning course on Udemy
This discovery requires to transform activation units to differentiable functions. In this way, we can back-propagate errors and apply learning. Herein, sigmoid and tanh are one of the most common activation functions. However, these functions come with a huge defects.
2nd AI Winter
Sigmoid function is meaningful for inputs between (-5, +5). In this scale, it has a derivative different than 0. This means that we can back-propage errors and apply learning.
Ian Goodfellow represents meaningfulness as mobility of Bart Simpson with his skateboard. Gravity contributes Bart to move if he is in range of [-5, +5].
On the other hand, gravity will not contribute Bart to move if he is in a point grater than 5 or less than -5.
This representation describes gradient vanishing problem very well. If derivative of activation function would always produce 0, then we cannot update weights. But this is the result. The question is that what causes to happen this result?
Wide and deep networks would cause to produce large outputs in every layer. Constructing wide and deep network with sigmoid activation unit reveals gradient vanishing or exploding problem. This ends us up in the AI winter again.
Sunny Days
2nd AI winter passed away in just 2011. Raising a simple activation function named ReLU shows us again sunny days. This function is identity function for positive inputs whereas it produces zero for negative inputs.
Let’s imagine that Bart’s mobility on this new function. Gravity causes Bart to move for any positive input.
Wide network structures tend to produce mostly large positive inputs among layers. That’s why, most of gradient vanishing problems would be solved even though gravity would not contribute Bart to move for negative inputs.
You might consider to use Leaky ReLU as activation unit to handle this issue for negative inputs. Bart can move at any point for this new function! Leaky ReLU is a non-linear function, it is differentiable, and its derivative is different than 0 for any point except 0.
Testing
Let’s construct a wide and deep neural networks model. Basically, I’ll create a model for handwritten digit classification. There are 4 hidden layers consisting of 128, 64, 32 and 16 units respectively. Actually, it is not so deep.
classifier = tf.contrib.learn.DNNClassifier( feature_columns=feature_columns , n_classes=10 #0 to 9 - 10 classes , hidden_units=[128, 64, 32, 16] , optimizer=tf.train.GradientDescentOptimizer(learning_rate=0.1) , activation_fn = tf.nn.sigmoid )
As seen, model make disappointment. Accuracy is very low.
Only we need is to switch activation function to ReLU.
classifier = tf.contrib.learn.DNNClassifier( feature_columns=feature_columns , n_classes=10 #0 to 9 - 10 classes , hidden_units=[128, 64, 32, 16] , optimizer=tf.train.GradientDescentOptimizer(learning_rate=0.1) , activation_fn = tf.nn.relu )
As seen, accuracy will increase dramatically if activation unit were ReLU.
BTW, I’ve pushed the code into GitHub.
So, AI studies had unproductive period for almost 20 years between 1986 – 2006 because of activation units. Funnily, this challenging problem can be solved with a simple function usage. ReLU is the reason why we are much stronger in AI studies for these days.
Support this blog if you do like!
Your description of the “AI winter” is too simplistic. The AI winter idea applies mostly to symbolic AI, which had I think 3 winters (one very early due to machine translation). See Wikipedia. The neural network situation is difficult, most research stopped 1969 due to minsky papert book, boom started again with Hopfield 1982 then Rumelhart 1986, lasted until 1993ish. Deep learning started up about 2006 with Hinton training layers with RBMs. But Relu not used until mostly about 2011 (LeCun had a paper in 2009).
Thank you for handling the subject. You are absolutely right. I fixed the date for first introduction of ReLU as 2011.
BTW, I referenced 2nd AI winter dates from the link http://cswithjames.com/keras-6-vanishing-gradient-problem-relu/ .