Almost every day a new innovation is announced in ML field. Such an extent that number of research papers published about machine learning is growing faster than Moore’s law. For example, second AI winter is over when vanishing gradient problem discovered and ReLU activation function introduced. However, this lasts almost 20 years. In 2017, Google researchers discovered that extended version of sigmoid function named Swish overperforms than ReLU. Then, it is shown that extended version of Swish named E-Swish overperforms many other activation functions including both ReLU and Swish.
This post covering
Herein, advanced frameworks cannot catch innovations. For example, you cannot use Swish based activation functions in Keras today. This might appear in the following patch but you may need to use an another activation function before related patch pushed. So, this post will guide you to consume a custom activation function out of the Keras and Tensorflow such as Swish or E-Swish.
🙋♂️ You may consider to enroll my top-rated machine learning course on Udemy
Code wins arguments
All you need is to create your custom activation function. In this case, I’ll consume swish which is x times sigmoid. Besides, I include this in a convolutional neural networks model.
import keras def swish(x): beta = 1.5 #1, 1.5 or 2 return beta * x * keras.backend.sigmoid(x) model = Sequential() #1st convolution layer model.add(Conv2D(32, (3, 3) #32 is number of filters and (3, 3) is the size of the filter. , activation = swish , input_shape=(28,28,1))) model.add(MaxPooling2D(pool_size=(2,2))) #2nd convolution layer model.add(Conv2D(64,(3, 3), activation = swish)) # apply 64 filters sized of (3x3) on 2nd convolution layer model.add(MaxPooling2D(pool_size=(2,2))) model.add(Flatten()) # Fully connected layer. 1 hidden layer consisting of 512 nodes model.add(Dense(512, activation = swish)) model.add(Dense(num_classes, activation='softmax')) model.compile(loss='categorical_crossentropy' , optimizer=keras.optimizers.Adam() , metrics=['accuracy'] ) model.fit_generator(x_train, y_train , epochs=epochs , validation_data=(x_test, y_test) )
Ok, but how?
Remember that we will use this activation function in feed forward step whereas we need to use its derivative in the backpropagation. We just define the activation function but we do offer its derivative. That’s the power of TensorFlow. The framework knows how to apply differentiation for backpropagation. This comes from importing keras backend module. If you design swish function without keras.backend then fitting would fail.
To sum up
So, we’ve mentioned how to include a new activation function for learning process in Keras / TensorFlow pair. Picking the most convenient activation function is the state-of-the-art for scientists just like structure (number of hidden layers, number of nodes in the hidden layers) and learning parameters (learning rate, epoch or learning rate). Now, you can design your own activation function or consume any newly introduced activation function just similar to the following picture.
My friend and colleague Giray inspires me to produce this post. I am grateful to him as usual.
Let’s dance
These are the dance moves of the most common activation functions in deep learning. Ensure to turn the volume up 🙂
Support this blog if you do like!