Using Custom Activation Functions in Keras

Almost every day a new innovation is announced in ML field. Such an extent that number of research papers published about machine learning is growing faster than Moore’s law. For example, second AI winter is over when vanishing gradient problem discovered and ReLU activation function introduced. However, this lasts almost 20 years. In 2017, Google researchers discovered that extended version of sigmoid function named Swish overperforms than ReLU. Then, it is shown that extended version of Swish named E-Swish overperforms many other activation functions including both ReLU and Swish.

ml-versus-moores-law
ML versus Moore’s law

This post covering

Herein, advanced frameworks cannot catch innovations. For example, you cannot use Swish based activation functions in Keras today. This might appear in the following patch but you may need to use an another activation function before related patch pushed. So, this post will guide you to consume a custom activation function out of the Keras and Tensorflow such as Swish or E-Swish.


🙋‍♂️ You may consider to enroll my top-rated machine learning course on Udemy

Decision Trees for Machine Learning

Code wins arguments

All you need is to create your custom activation function. In this case, I’ll consume swish which is x times sigmoid. Besides, I include this in a convolutional neural networks model.

import keras

def swish(x):
   beta = 1.5 #1, 1.5 or 2
   return beta * x * keras.backend.sigmoid(x)

model = Sequential()

#1st convolution layer
model.add(Conv2D(32, (3, 3) #32 is number of filters and (3, 3) is the size of the filter.
, activation = swish
, input_shape=(28,28,1)))
model.add(MaxPooling2D(pool_size=(2,2)))

#2nd convolution layer
model.add(Conv2D(64,(3, 3), activation = swish)) # apply 64 filters sized of (3x3) on 2nd convolution layer
model.add(MaxPooling2D(pool_size=(2,2)))

model.add(Flatten())

# Fully connected layer. 1 hidden layer consisting of 512 nodes
model.add(Dense(512, activation = swish))
model.add(Dense(num_classes, activation='softmax'))

model.compile(loss='categorical_crossentropy'
 , optimizer=keras.optimizers.Adam()
 , metrics=['accuracy']
)

model.fit_generator(x_train, y_train
 , epochs=epochs
 , validation_data=(x_test, y_test)
)

Ok, but how?

Remember that we will use this activation function in feed forward step whereas we need to use its derivative in the backpropagation. We just define the activation function but we do offer its derivative. That’s the power of TensorFlow. The framework knows how to apply differentiation for backpropagation. This comes from importing keras backend module. If you design swish function without keras.backend then fitting would fail.

To sum up

So, we’ve mentioned how to include a new activation function for learning process in Keras / TensorFlow pair. Picking the most convenient activation function is the state-of-the-art for scientists just like structure (number of hidden layers, number of nodes in the hidden layers) and learning parameters (learning rate, epoch or learning rate). Now, you can design your own activation function or consume any newly introduced activation function just similar to the following picture.

y_is_question.png
y is a question mark (Imaginary)

My friend and colleague Giray inspires me to produce this post. I am grateful to him as usual.

Let’s dance

These are the dance moves of the most common activation functions in deep learning. Ensure to turn the volume up 🙂


Like this blog? Support me on Patreon

Buy me a coffee