The post From Neural Networks To Deep Learning appeared first on Sefik Ilkin Serengil.

]]>You might be familiar with Iris flower data set. Iris flower has 3 different species: setosa, virginica and versicolor. A British statistician believes that size of the flower correlated with the species. He measured the top and bottom of flower as length and width for 150 instances.

A researcher is already measured top of flower (petal) and bottom of flower (sepal). We will construct a neural networks model and feed these four measurements as inputs. The flower has 3 different species and that is the one we would like to learn. That’s why, there are 3 output nodes in our model. Finally, we need to create at least a hidden layer in neural networks. The most common method to decide number of hidden nodes is average of input and output nodes. I will put 4 nodes in single hidden layer.

Let’s consturct this Keras model (mine is TensorFlow backend) in Python.

model = Sequential() model.add(Dense(4 #num of hidden units , input_shape=(4),))) #num of features in input layer model.add(Activation('sigmoid')) #activation function from input layer to 1st hidden layer model.add(Dense(3)) #num of classes in output layer model.add(Activation('sigmoid')) #activation function from 1st hidden layer to output layer

This model can learn 147/150 items quickly when we feed the dataset to this model. We got 98% accuracy, very satisfactory. You can find code of this implementation here and the raw dataset here (iris-attr.data and iris-labels.data).

This means that there is really correlation between flower size and species for Iris dataset. If you see an Iris flower and measure its size, then you can predict its species. Here, there is a huge problem you might sense. Firstly, how to collect train set. A researcher is already measured flower sizes for Iris but what about other flowers? You have to measure one by one and that is really a problem. Secondly, even though you have a train data, you have to measure every time you want to predict the species of a flower. This is a legacy approach. What if you take a photo of a flower and an application respond you its species? Wouldn’t it be more satisfactory?

The key point is taking photo in the previous sentence. We have extracted features at neural networks times. This is luxury now. We expect deep neural networks to extract features. Suppose that you collect images for 150 instances of Iris flower. These images can be expressed as pixels. You can feed pixel values of an image as input features to neural networks. Even a 25×25 pixel image is fed 625 input features (ignored RGB codes). Output layer remains same and it consists of 3 nodes because still there are 3 species of Iris flower. Moreover, you must increase the number of hidden layers. We could handle just using a hidden layer in our previous example, but here you must construct tens or hundreds of hidden layers.

In this way, you can collect samples easier than neural networks days. Furthermore, it is enough to take photo of new Iris flower to predict if your model is ready. This is more practical. You just need to feed data to develop a deep learning model. The data can be anything such as image, voice or just signal. We no longer interested in feature extraction. This is basically called deep learning or deep neural networks.

You might check this post. We just feed handwritten digit images and deep learning model learns these digits and says unseen ones successfully.

So, we have mentioned the conventional form of deep neural networks and focused on modifications of neural networks in timeline. Deep learning continues to improve everyday. Such an extent that number of research papers about it increase faster than Moore’s law. We can reduce the computational power and increase the accuracy in some advanced forms such as CNN.

The post From Neural Networks To Deep Learning appeared first on Sefik Ilkin Serengil.

]]>The post Cosine Similarity in Machine Learning appeared first on Sefik Ilkin Serengil.

]]>Dot product is a way to multiply vector. This approach produces scalar results. Let a and b be vectors.

a = ( a_{1}, a_{2}, …, a_{n})

b = (b_{1}, b_{2}, …, b_{n})

Definition of dot product state adding multiplication of same index items of a and b.

a . b = a_{1}b_{1} + a_{2}b_{2} + … + a_{n}b_{n}

If a and b stored as a column vector, then multiplying transposed version of a and b give same result. Notice that matrix operations can be handled much faster than for loops.

a . b = a^{T }b

Let a and b be vectors and theta be the angle between these vectors.

Let’s define a new vector c which is equal to a – b (or -a+b). As seen, a, b and c vectors create a valid triangle whereas vector c can be expressed as (a-b).

Herein, law of cosines states

||c||^{2} = ||a||^{2} + ||b||^{2} – 2||a|| ||b|| cosθ

where ||a||, ||b|| and ||c|| denote vector length of a, b and c respectively.

Remember that vector c is equal to a – b.

||c||^{2} = c.c = (a-b)(a-b) = a.a – a.b – b.a + b.b = ||a||^{2} + ||b||^{2} – a.b – b.a

Notice that -a.b and -b.a are equal to each other because they are dot products. Please remember these terms are scalar, not vectors.

We can rearrange the length of vector c squared as

||c||^{2} = ||a||^{2} + ||b||^{2} – 2 a.b

Let’s compare the law of cosine and this term.

||c||^{2} = ||a||^{2} + ||b||^{2} – 2||a|| ||b|| cosθ = ||a||^{2} + ||b||^{2} – 2 a.b

The only difference is that one equation is expressed as length of vectors and angle between them, and another equation is expressed as dot product.

– 2||a|| ||b|| cosθ = – 2 a.b

We can divide both side of equation to minus 2.

a.b = ||a|| ||b|| cosθ

Recall the definition of dot product.

a_{1}b_{1} + a_{2}b_{2} + … + a_{n}b_{n} = ||a|| ||b|| cosθ

I wonder the cosine theta term

cosθ = (a_{1}b_{1} + a_{2}b_{2} + … + a_{n}b_{n}) / ||a|| ||b||

Well, how to calculate the length of a vector?

Finding length of a vector is an easy task. Let V be a vector on a 2D space and (V1 = 3, V2 = 4). As you guess, length of this vector is 5. It originally comes from Pythagorean theorem.

Logic remains same for n-dimensional space. Formula of vector length calculation is shown below.

||V|| = √(∑ (i = 1 to n) Vi^{2})

Let a and b be vectors. Similarity formulation of these two vectors can be generalized as mentioned below.

cosine similarity = (a_{1}b_{1} + a_{2}b_{2} + … + a_{n}b_{n}) / (√(∑ (i = 1 to n) a_{i}^{2}) √(∑ (i = 1 to n) b_{i}^{2}))

or we can apply vectorization to find cosine similarity

cosine similarity = (a^{T }b) / (√(a^{T }a) √(b^{T }b))

In this way, similar vectors will produce high results.

Distance between similar vectors should be low. We can find the distance as 1 minus similarity. In this way, similar vectors should have low distance (e.g. < 0.20)

cosine distance = 1 – cosine similarity

We can adapt cosine similarity / distance calculation into python easily as illustared below.

def findCosineDistance(vector_1, vector_2): a = np.matmul(np.transpose(vector_1), vector_2) b = np.matmul(np.transpose(vector_1), vector_1) c = np.matmul(np.transpose(vector_2), vector_2) return 1 - (a / (np.sqrt(b) * np.sqrt(c)))

So, we have mentioned the theoretical background of cosine similarity in this post. This metric is mainly based on law of cosines. It produces efficient results so fast to understand how similar two vectors are.

The post Cosine Similarity in Machine Learning appeared first on Sefik Ilkin Serengil.

]]>The post Face Recognition with Keras appeared first on Sefik Ilkin Serengil.

]]>Oxford visual geometry group announced its deep face recognition architecture. We have been familiar with VGG in imagenet challenge. We can recognize hundreds of images just applying transfer learning. Also, we have used same model for style transfer.

Basically, we will apply transfer learning and use pre-trained weights of VGG Face model. Even though, imagenet version of VGG is almost same with VGG Face model, researchers feed dedicated training-set images to tune weights for face recognition.

What’s more, we will consume the model as auto-encoder to represent images as vectors.

That is not a must but I strongly recommend you to read these topics before reading this post.

Even though research paper is named Deep Face, researchers give VGG-Face name to the model. This might be because Facebook researchers also called their face recognition system DeepFace – without blank. VGG-Face is deeper than Facebook’s Deep Face, it has 22 layers and 37 deep units.

The structure of the VGG-Face model is demonstrated below. Only output layer is different than the imagenet version – you might compare.

Research paper denotes the layer structre as shown below.

Let’s construct the VGG Face model

model = Sequential() model.add(ZeroPadding2D((1,1),input_shape=(224,224, 3))) model.add(Convolution2D(64, (3, 3), activation='relu')) model.add(ZeroPadding2D((1,1))) model.add(Convolution2D(64, (3, 3), activation='relu')) model.add(MaxPooling2D((2,2), strides=(2,2))) model.add(ZeroPadding2D((1,1))) model.add(Convolution2D(128, (3, 3), activation='relu')) model.add(ZeroPadding2D((1,1))) model.add(Convolution2D(128, (3, 3), activation='relu')) model.add(MaxPooling2D((2,2), strides=(2,2))) model.add(ZeroPadding2D((1,1))) model.add(Convolution2D(256, (3, 3), activation='relu')) model.add(ZeroPadding2D((1,1))) model.add(Convolution2D(256, (3, 3), activation='relu')) model.add(ZeroPadding2D((1,1))) model.add(Convolution2D(256, (3, 3), activation='relu')) model.add(MaxPooling2D((2,2), strides=(2,2))) model.add(ZeroPadding2D((1,1))) model.add(Convolution2D(512, (3, 3), activation='relu')) model.add(ZeroPadding2D((1,1))) model.add(Convolution2D(512, (3, 3), activation='relu')) model.add(ZeroPadding2D((1,1))) model.add(Convolution2D(512, (3, 3), activation='relu')) model.add(MaxPooling2D((2,2), strides=(2,2))) model.add(ZeroPadding2D((1,1))) model.add(Convolution2D(512, (3, 3), activation='relu')) model.add(ZeroPadding2D((1,1))) model.add(Convolution2D(512, (3, 3), activation='relu')) model.add(ZeroPadding2D((1,1))) model.add(Convolution2D(512, (3, 3), activation='relu')) model.add(MaxPooling2D((2,2), strides=(2,2))) model.add(Convolution2D(4096, (7, 7), activation='relu')) model.add(Dropout(0.5)) model.add(Convolution2D(4096, (1, 1), activation='relu')) model.add(Dropout(0.5)) model.add(Convolution2D(2622, (1, 1))) model.add(Flatten()) model.add(Activation('softmax'))

Research group shared pre-trained weights on the group page under the path vgg_face_matconvnet/data/vgg_face.mat, but it is matlab compatible. Here, **your friendly neighborhood blogger** has already transformed pre-trained weights for Keras. Due to weight file is 500 MB, and GitHub enforces to upload files smaller than 25 MB, I had to upload pre-trained weights in Google Drive. You can find the pre-trained weights here.

from keras.models import model_from_json model.load_weights('vgg_face_weights.h5')

Finally, we’ll use previous layer of the output layer for representation. The following usage will give output of that layer.

vgg_face_descriptor = Model(inputs=model.layers[0].input , outputs=model.layers[-2].output)

In this way, we can represent images 2622 dimensional vector as illustarted below.

img1_representation = vgg_face_descriptor.predict(preprocess_image('1.jpg'))[0,:] img2_representation = vgg_face_descriptor.predict(preprocess_image('2.jpg'))[0,:]

Notice that VGG model expects 224x224x3 sized input images. Here, 3rd dimension refers to number of channels or RGB colors.

def preprocess_image(image_path): img = load_img(image_path, target_size=(224, 224)) img = img_to_array(img) img = np.expand_dims(img, axis=0) img = preprocess_input(img) return img

We’ve represented input images as vectors. We will decide both pictures are same person or not based on comparing these vector representations. Now, we need to find the distance of these vectors. There are two common ways to find the distance of two vectors: cosine distance and euclidean distance. Cosine distance is equal to 1 minus cosine similarity. No matter which measurement we adapt, they all serve for finding similarities between vectors.

def findCosineDistance(source_representation, test_representation): a = np.matmul(np.transpose(source_representation), test_representation) b = np.sum(np.multiply(source_representation, source_representation)) c = np.sum(np.multiply(test_representation, test_representation)) return 1 - (a / (np.sqrt(b) * np.sqrt(c))) def findEuclideanDistance(source_representation, test_representation): euclidean_distance = source_representation - test_representation euclidean_distance = np.sum(np.multiply(euclidean_distance, euclidean_distance)) euclidean_distance = np.sqrt(euclidean_distance) return euclidean_distance

We’ve represented images as vectors and find the similarity measures of two vectors. If both images are same person, then measurement should be small. Otherwise, the measurement should be large if two images are different person. Here, epsilon value states threshold.

epsilon = 0.40 #cosine similarity #epsilon = 120 #euclidean distance def verifyFace(img1, img2): img1_representation = vgg_face_descriptor.predict(preprocess_image(img1))[0,:] img2_representation = vgg_face_descriptor.predict(preprocess_image(img2))[0,:] cosine_similarity = findCosineSimilarity(img1_representation, img2_representation) euclidean_distance = findEuclideanDistance(img1_representation, img2_representation) if(cosine_similarity < epsilon): print("verified... they are same person") else: print("unverified! they are not same person!")

Cosine similarity should be less than 0.40 or euclidean distance should be less than 120 based on my observations. Thresholds might be tuned based on your problem. It is all up to you to choose the similarity measurement.

This is **one shot learning** process. I mean that we would not feed multiple images of a person to network. Suppose that we store a picture of a person on our database, and we would take a photo of that one in the entrance of building and verify him. This process can be called **face verification** instead of face recognition.

Some researchers call finding distances of represented two images as **Siamese networks**.

I tested the developed model with variations of Angelina Jolie and Jennifer Aniston. Surprisingly, the model can verify all instances I fed. For example, Angelina Jolie is either blonde or brunette in the following test set. She even wears a hat in a image.

The mode is very successful for true negative cases. Descriptor can overwhelmingly detect Angelina Jolie and Jennifer Aniston.

True positive results for Jennifer Aniston fascinate me. I might not detect the 3rd one (2nd row, 1st column). Jennifer is at least 10 years young in that photo.

I think the revolutionary testing was on Katy Perry. The face recognition model can recognize her even for dramatic appearance change.

Of course, I can test the model for limited number of instances. The model got 98.78% accuracy for labeled faces in the wild dataset. The dataset contains 13K images of 5K people. BTW, researchers fed 2.6 M images to tune the model weights.

Having the hair dyed or wearing hat just like in movies do not work against AI systems. Movie producers should find more creative solutions.

We can apply deep face recognition in real time as well. Face pictures in database represented as 2622 dimensional vector at program initialization once. Luckily, opencv can handle face detection. Then, we will represent detected face and check similarities.

So, we can recognize faces easily by combining transfer learning and auto-encoder concepts. Additionally, some linear algebra ideas such as cosine similarity contribute the decision. We’ve fed frontal images to the model directly. Finally, I pushed the source code of project on my GitHub profile. BTW, I run the code for TensorFlow backend.

The post Face Recognition with Keras appeared first on Sefik Ilkin Serengil.

]]>The post Driver of Machine Learning Success: Hardware appeared first on Sefik Ilkin Serengil.

]]>We have already known how to train computers since 30 years. The algorithm was found but data volume wasn’t enough and it starved for really big data. It also has required high computation power. Old hardwares cannot serve performly for that level of computations. Therefore, we could not run the algorithm as we wish. Then, we realized that processing units developed for 3D game lovers can also run complex neural networks computations because these units designed to apply large matrix operations. Moreover, everyone shares photos and information (check-in, comment) on internet. In this way, data shortage is solved. Finally, we have been modeling neural networks with 3 layers. Increasing number of layers created deep learning. Deep term came from here. So, all of these modifications have worked well unexpectedly. Now, we can teach everything if we feed data to these systems. – Cem Say, Watson Istanbul Summit 2017

Supportingly, Barbara‘s deep learning definition is my favorite. She defines deep learning as matrix multiplication, a lot of matrix multiplication.

Alan Kay means a lot for technology enthusiasts. He pioneers to develop first GUI concept, first object oriented programming language, and he designed first tablet PC in 70s. Beyond these inventions, he considers hardware as important as software.

*People who are really serious about software should make their own hardware* – Alan Kay, Creative Think Seminar (1982)

You can see that Apple adopts this quote as a principle in its products. Apple has already cited this quotation in ipad launching in 2007.

He was invited to the premiere. It might be because he is the inventor of prototype of a tablet computer or legacy version of iPad.

If you define a constant variable, it would be scalar. Matrices store arrays of scalar values in multi dimensional space. A vector is a 1-dimensional matrix. Herein, tensors state a matrix where each item is a matrix. You can model a neural network with scalar variables but this increase the computation time radically. Instead of this, if you model your network with Tensors, this will reduce the computation time dramatically. The name of Google’s deep learning framework TensorFlow

We’ve mentioned that trigger of machine learning adoption was graphic processing units but GPUs were not created for machine learning studies. Herein, tensor processing units are hardware designed to compute complex tensor operations as well. In other words, TPUs are hardware that specialized to accelerate machine learning workload.

Today, Google produces its own TPUs just like Alan Kay declared and Apple adopted. This technology was first announced in Google IO 2016. Even imagination is hard how these hardware accelerate the process.

*If you are training an image recognition model, Let’s say ResNet 50, it is a sort of standard benchmark right now. It was state of the art, not too long ago. And if you want to train that to 75% accuracy, which is what you expect from publication on the subject, that might previously have taken days a few years ago, when that paper was published. Now, that is down to about 12.5 hours on one of those cloud TPUs. And on the full TPU pod, you can do that in less than 12.5 minutes* – Zak Stone, TensorFlow Dev Summit 2018

BTW, these are not typical hardware. It cannot be bought anywhere, Google offers it on cloud instead of on-premise.

Respect to Alan Kay. Apparently, he still continues to change the world, even for the fields he hasn’t involved.

The post Driver of Machine Learning Success: Hardware appeared first on Sefik Ilkin Serengil.

]]>The post 5 Facts about Deep Learning and Neural Networks appeared first on Sefik Ilkin Serengil.

]]>Luckily, marketing people directly named cloud. If engineers had named this technology, it would most probably be remote computer access. Thanks to Erdem to raise the awareness. He is currently a marketing director who have an engineering background.

Formal definition of deep learning is wide and deep neural networks. Deep refers to number of layers. We need to define neural networks, too. Neural networks are mechanisms modelling human neural system. Well, do we really know how human nervous system work exactly? They are actually just mathematical models. Barbara describes the deep learning realistically

*My favorite definition of deep learning is matrix multiplication, a lot of matrix multiplication*

Similarly, Francois Chollet defines neural networks in a realistic way.

*Neural networks are a sad misnomer. They’re neither neural nor even networks. They’re chains of differentiable, parameterized geometric functions, trained with gradient descent (with gradients obtained via the chain rule). A small set of highschool-level ideas put together*

Both neural networks and deep learning seem to be named by marketers. Names might be much deeper than they are. Also, they have a sympathetic and catchy name. Who knows maybe naming might adopt this motivation first, and adoption may trigger the progress.

The post 5 Facts about Deep Learning and Neural Networks appeared first on Sefik Ilkin Serengil.

]]>The post A Developer’s Guide to Machine Learning appeared first on Sefik Ilkin Serengil.

]]>Machine learning practitioners have to have coding, math and communication skills. Developers already have high coding skills. However, they mostly lack in math oriented thinking. Transforming from code based perspectives to math based mindset can handle most of issues.

Imagine a basic neural cell. There are inputs connecting to weights. Here, each input multiplied by its own weight, and sum of input weight multiplications stores summation unit. Summation will be transformed to an activation function, and in this way output of the cell can be calculated.

Developers tend to apply this calculation with for loops. They might store features in an input array, similarly put weights in a weight array. Then, they would apply for loop, multiply same indexed input and weight array items in current iteration. Multiplication should also be added into a global variable.

import numpy as np inputs = np.array([1,0,1]) weights = np.array([0.3, 0.8, 0.4]) sum = 0 for i in range(inputs.shape[0]): sum = sum + inputs[i] * weights[i] print(sum)

This approach is fine. It will totally work. However, what if there are hundred of thousand input parameters? Compiler needs to find each variable in different memory slot. This causes a really performance problem. Alternatively, machine learning practitioners tend to solve same problem differently.

vectorized_sum = np.matmul(np.transpose(weights), inputs)

Inputs and weights are same sized vectors. Linear algebra says that matrices can be multiplied only if row size of a matrix is equal to column size of the other one. Here, we can handle matrix multiplication if we transpose weights vector. Then, inputs can be multiplied by transposed weights. The matrix multiplication operation will produce same result as for loops. This approach is called as vectorization.

Vectorization makes your code 4 times cleaner. As Bjarne Stroustrup, who created C++, mentioned that “I like my code elegant and efficient”. Remember the quote of Audrey Hepburn, “elegance is the only beauty that never fades”. Of course, this is not all!

Very complex deep neural networks structures can be reduced into matrix multiplication operations for both backwardly training and feeding-forwardly predictions. Vectorization is almost 150 times faster than traditional loops if input and weight pairs are really large.

*My favorite definition of Deep Learning is matrix multiplication, a lot of matrix multiplication… – Barbara Fusinska*

This is one of the reason why machine learning community adopts python. Even though there are matrices in high level languages such as Java or C#, there is no out-of-the-box function for matrix operations. Matrix multiplications can be handled by for loops in this high level languages. Herein, python offers overperforming matrix multiplication operations with numpy library. Similarly, tensorflow includes powerful matrix multiplication supporting. However, this is not meaning that python is much faster than the others. Python is faster only if you know how to apply linear algebra in your program.

Even though, matlab is much more powerful than python for matrix based computations, it is not good at data transformation, database connection and service integration topics. On the other hand, high level languages such as java and .net are good at these topics but they are not good at matrix operations. Herein, python is strong computation language and it also supports object oriented programming, database and service subjects. I imagine python as a programming language between matlab and java.

Modelling and training a neural network might last weeks but model can respond immediately if training is over. Herein, well accepted approach is separating training and prediction tasks. You should run training asynchronously as batch process, maybe once a week. High level languages such as java can consume tensorflow based pretrained models. In this way, you don’t need to do anything more, if your production environment is based on java. We can say that training is responsibility of data scientist whereas developers are in charge of prediction.

Today, unit tests are very important part of software development and devops process. Such that test driven development proposes to write unit tests even before coding. The question is that when to write unit tests. The answer is that when you know the exact result for that program. That’s why, you should not adapt this approach to machine learning lifecycle.

Suppose that you would like to develop an app to classify man and woman from taken photos. Let’s say you have 100 images for training. You created a model and this model can classify all of training images correctly. However, when you test the model with unseen images it can classify 30 of 100 correctly. This means that your model didn’t learn anything, it could only memorize the training instances.

Your model would be more successful if it can classify 70% correctly for both seen and unseen ones. This is the most challenging problem in machine learning studies called over- fitting. It is dangerous for researchers because overfitting misguides you. You might think that you’ve got a satisfactory accuracy, but it could only be in seen examples. That’s why, we will let the model fail to handle overfitting.

This is the reason why unit testing is not convenient in machine learning. Instead of this, we might separate 80% of data set for learning, and 20% of data set for testing. In this way, we can evaluate the model for unseen examples.

Most of enterprises appoint Chief Data Officer role directly reports to CEO or COO. This role is a must for a data driven organization. Beyond the CDO, some organizations hire Chief AI Officer roles. This is important because you cannot evaluate data science team members as software engineers.

You can expect to develop a new lines of code from software engineers. Well defined expectations exist for them. However, data scientists spent most of their time for tuning machine learning models. I mean that they design neural network structure, decide input features of the network, try different number of layers and nodes in the network, make decisions about activation function and optimization algorithm. Because there is no function to determine these configurations. This is the state-of-the-art part of the study. Still they have the luxury to fail because they are working on research projects. Similarly, data engineers responsible for preparing data for data scientists to be able to work. Typical software engineer team manager could not understand and evaluate the responsibilities of data science team members.

Google’s recent research, hidden technical debt in machine learning systems, reveals that only a small fraction of real world machine learning system is composed of the core machine learning code. The required surrounding infrastructure is wide and complex.

Machine learning practitioners mostly discuss about algorithms instead of technology or programming language.

There are mainly 3 different problem type in machine learning field. If your problem responds the question how much or how many, then the problem is regression. You might think that what will Apple shares be in next week. If your problem responds the question such that is it this or not, that would be classification. Recognizing cat photos is a common binary classification task. Number of classes could be greater than 2, this is multi class classification.

You can evaluate your predictions for both regression and classification because you have already known the actual results. I mean that your model can recognize an image as cat, but actually it is a dog, you can say that is incorrect. Actual labels are supervisors of your learning process. That’s why, both regression and classification tasks are kind of supervised learning.

What if your data doesn’t have these actual labels? If there is no supervisor labels, then learning type will be unsupervised learning. You might apply clusters based on some attributes.

Regression, classification and clustering must be handled different machine learning algorithms. For example, gradient boosted decision trees are very powerful algorithm to solve classification problems. You can spent a year and you can be an expert of GBDT, but still you cannot apply the algorithm for regression problems. So, you cannot solve regression problems with decision trees. Similarly, you can apply linear regression for regression problems but you cannot do it for classification tasks fairly.

Herein, neural networks are like Swiss army knifes. You can apply this algorithm for any kind of machine learning task. You just need to feed the data to the algorithm. They can solve regression, classification and regression tasks.

*My favorite machine learning is Neural Nets. That’s my favorite. My 2nd favorite machine learning is SVD. Everyone says, oh don’t you prefer gradient boosted trees? I know GBTs are great but I like NN best and I like SVD next best – Francois Chollet*

So, machine learning practitioners have to have math, code and communication skills. Here, developers are one of the most potential candidates for machine learning field because they are a past master at coding. However, they need to be transformed from from code based mindset to math based perspective. Also, data science projects are not how typical software projects work. Developers need to be adapted to a new data driven business manner instead of letting them to transform business manners just like in software development teams.

The post A Developer’s Guide to Machine Learning appeared first on Sefik Ilkin Serengil.

]]>The post Artistic Style Transfer with Deep Learning appeared first on Sefik Ilkin Serengil.

]]>Artistic style transfer (aka neural style transfer) enables to transform ordinary images to masterpieces. Actually, this is a combination of some deep learning techniques such as convolutional neural networks, transfer learning and auto-encoders. Even though its implementations can be found anywhere, it has a really hard theoretical background and that is why these implementations are complex to be understood. In this post, we will mention the background of style transfer and apply it from scratch.

First of all, this technique is not a typical neural networks operation. Typical neural networks tune weights based on input and output pairs. Here, we will use pre-trained network and will never update weights. We will update inputs instead of updating weights.

Original study uses VGG model as pre-trained network. We’ll consume same network in this post. But this is not a prerequisite, you can consume any other pre-trained neural networks. Basically, VGG network looks like the following illustration.

In this study, we would like to transfer style of an image to another one. The image we would like to transform is called **content** image whereas the image we would like to transfer its style is called **style** image. Then, style image’s brush strokes would be reflected to content image and this new image is called as **generated** image.

Content and style images are already exist. You might remember that we must initialize weights randomly in neural networks. Here, generated image will be initialized randomly instead of weights. Remember that this application is not a typical neural networks. Let’s construct the code for reading content and style images, generating random image for generated image.

def preprocess_image(image_path): img = load_img(image_path, target_size=(height, weight)) img = img_to_array(img) img = np.expand_dims(img, axis=0) img = preprocess_input(img) return img content_image = preprocess_image("content.jpeg") style_image = preprocess_image("style.jpg") height = 224; weight = 224 #original size of vgg random_pixels = np.random.randint(256, size=(1, height, weight, 3)) generated_image = preprocess_input(random_pixels, axis=0)

Normally, python stores images in 3D numpy array (1 dimension for RGB codes). However, VGG network designed to work with 4D inputs. If you transfer 3D numpy array to its input, you’ll face with the exception layer “*block1_conv1: expected ndim=4, found ndim=3*“. That’s why, we have added expand dimensions command in preprocessing step. This command will add a dummy dimension to handle this fault. Additionally, input features of VGG network is 224x224x3. That is why, content, style and generated images are size of 224×224.

Now, we are going to transfer those images to VGG network as input features. But, we need outputs of some layers instead of output of network. Remember that auto-encoders can be used to extract representation of data. Actually, we use VGG to extract representation of those images.

Luckily, Keras offers winning CNN models as out-of-the-box function.

from keras.applications import vgg19 content_model = vgg19.VGG19(input_tensor=K.variable(content_image), weights='imagenet') style_model = vgg19.VGG19(input_tensor=K.variable(content_image), weights='imagenet') generated_model = vgg19.VGG19(input_tensor=K.variable(generated_image), weights='imagenet')

We will store loss value twice, one for content and one for style. In typical neural networks, loss value is calculated by comparing actual output and model output (prediction). Here, we will compare compressed presentations of auto-encoded images. Please remember that auto-encoded compressed representations are actually outputs of some middle layers. Let’s store each output of a layer and layer name once network is run.

content_outputs = dict([(layer.name, layer.output) for layer in content_model.layers]) style_outputs = dict([(layer.name, layer.output) for layer in style_model.layers]) generated_outputs = dict([(layer.name, layer.output) for layer in generated_model.layers])

We’ll transfer randomly generated image and content image to same VGG network. Original work uses 5th block’s 2nd convolution layer (block5_conv2) to calculate content loss. This is not a must, you might use different layer to compress images in your work.

We have already transfer both content and generated images to VGG network in previous step. We can calculate content loss as squared difference of outputs of same layer for both content and generated one.

def content_loss(content, generated): return K.sum(K.square(content - generated)) loss = K.variable(0) content_features = content_outputs['block5_conv2'] generated_features = generated_outputs['block5_conv2'] contentloss = content_loss(content_features, generated_features)

This loss type is a little bit harder to calculate. Firstly, we will compare first 5 layer’s outputs.

Here, finding distances between gram matrices is expected. Gram matrix can be calculated by multiplying a matrix with its transposed version.

gram = K.dot(features, K.transpose(features))

We need to work on 2D matrices to calculate gram matrix. Basically batch flatten command transforms n dimensional matrix to 2 dimensional. Notice that the structure of VGG network. For istance, size of 3rd convolution layer is (56×56)x256. Here, 256 refers to number of filters in that layer. If shape of the layer transformed to 256x56x56, 56×56 sized matrices put alongside. Permute dimensions function will help us to organize matrices before flattening.

def gram_matrix(x): #put number of filters to 1st dimension first features = K.batch_flatten(K.permute_dimensions(x, (2, 0, 1))) gram = K.dot(features, K.transpose(features)) return gram

BTW, Visual demonstration of a gram matrix is illustrated below. You might think nc as number of filters.

Now, we can calculate style loss

def style_loss(style, generated): style_gram = gram_matrix(style) content_gram = gram_matrix(generated) channels = 3 size = height * weight return K.sum(K.square(style_gram - content_gram)) / (4. * (pow(channels,2)) * (pow(size,2))) #name of first 5 layers. you can check it by running content_model.summary() feature_layers = ['block1_conv1', 'block2_conv1', 'block3_conv1', 'block4_conv1', 'block5_conv1'] styleloss = K.variable(0) for layer_name in feature_layers: style_features = style_outputs[layer_name] generated_features = generated_outputs[layer_name] styleloss = styleloss + style_loss(style_features[0], generated_features[0])

We have calculated both content and style loss. We can calculate total loss right now.

alpha = 0.025; beta = 0.2 total_loss = alpha * contentloss + beta * styleloss

Remember that total loss is reflected to all weights backwardly in back propagation algorithm. Derivative of total error with respect to the each weight is calculated in neural networks learning procedure. This calculation is also called as gradient calculation. In style transfer, we need gradients with respect to the input instead of weights.

#gradients = K.gradients(total_loss, generated_model.trainable_weights) gradients = K.gradients(total_loss, generated_model.input) print(gradients)

In this way, (1, 224, 224, 3) shaped tensor will be calculated as gradients. Just like our images. Now, we will update input of generated image instead of weights.

learning_rate = np.array([0.1]) generated_image = generated_image - learning_rate * gradients[0]

Actually, we have just applied a basic gradient descent to randomly generated image. You might applied Adam optimization algorithm to create arts fast. BTW, original work L-BFGS optimization algorithm to update image content. Sincerely, this is the first time I have heard that optimization algorithm. Researchers said that L-BFGS overperforms than Adam and others. But, still gradient descent works. Finally, we need to do all those operations in a for loop (epoch) to be a real learning.

I have applied Van Gogh’s Starry Night oil style to Galatasaray University. Result seems very impressive after 10 epochs.

So, we have mentioned artistic style transfer in this post. It is a combination of several high level deep learning techniques such as convolutional neural networks, transfer learning and auto-encoders. I strongly recommend you to understand these related topics first before applying style transfer.

Even though background of this motivation is neural networks, we would not apply standard neural networks rules. We would update inputs instead of weights. We would consume outputs of some layers instead of network output.

As usual, source code of this post is pushed into GitHub.

Enjoy!

The post Artistic Style Transfer with Deep Learning appeared first on Sefik Ilkin Serengil.

]]>The post The Insider’s Guide to Adam Optimization Algorithm for Deep Learning appeared first on Sefik Ilkin Serengil.

]]>w_{i} = w_{i} – α.(∂Error/∂w_{i})

Or more basically we can demonstrate classical gradient descent as

w_{i} = w_{i} – α.dw_{i}

This is common because it works slowly but surely.

In 2015, Adam optimization algorithm is raised. The name of the algorithm refers to adaptive moment estimation. Actually, it is an extension of stochastic gradient descent. But, it tends to converge cost function to zero faster for such problems.

In Adam, we will include vdw and sdw variables instead of raw partial derivatives of error with respect to the weight. You might think vdw is similar to momentum, and sdw is similar to RMSProp.

Firstly, we will assign initial values of vdw and sdw for every weight to 0.

vdw_{i} = 0; sdw_{i} = 0

Then, vdw and sdw values will be calculated for each weight.

vdw_{i} = β1 . vdw_{i} + (1 – β1) . w_{i}

sdw_{i} = β2 . sdw_{i} + (1 – β2) . (w_{i})^{2}

After then, calculated vdw and sdw will be modified by power of current epoch index value.

vdw_{i}_corrected = vdw_{i} / (1 – β1^{epoch})

sdw_{i}_corrected = sdw_{i} / (1 – β2^{epoch})

Finally, each weight will be updated as its previous value plus vdw and sdw corrected values.

w_{i} = w_{i} – α.(vdw_corrected / (√(sdw_corrected) + ε))

Herein, recommended values for Adam variables are demonstrated below. Additionaly, learning rate α should be tuned.

β1 = 0.9; β2 = 0.999; ε = 10^{-8}

As seen, the algorithm requires really little memory. That’s why, it is easy to adapt daily life problems.

I also reflect Adam logic into a code to monitor how it works.

Firstly, initialize Adam parameters.

vdw = [0.0 for i in range(num_of_layers)]

sdw = [0.0 for i in range(num_of_layers)]

epsilon = np.array([pow(10, -8)])

beta1 = 0.9; beta2 = 0.999

Then, applied backpropagation and find dw – partial derivative of total error with respect to each weight. Thereafter, update weights. The following code stores weights as matrix. You can find the entire code from my GitHub profile.

for i in range(epoch): for j in range(num_of_instances): for k in range(num_of_layers - 1): dw = findDelta() #partial derivative of total error with respect to the weight if optimization_algorithm == 'gradient-descent': w[j] = w[j] + learningRate * dw elif optimization_algorithm == 'adam': vdw[j] = beta1 * vdw[j] + (1 - beta1) * dw sdw[j] = beta2 * sdw[j] + (1 - beta2) * pow(dw, 2) vdw_corrected = vdw[j] / (1-pow(beta1, epoch+1)) sdw_corrected = sdw[j] / (1-pow(beta2,epoch+1)) w[j] = w[j] + learningRate * (vdw_corrected / (np.sqrt(sdw_corrected) + epsilon))

When I applied both gradient descent and Adam optimization algorithm for XOR problem for same configuration (same learning rate, and same initial weights), Adam tends to converge error to zero much faster.

Today, Adam is much more meaningful for very complex neural networks and deep learning models with really big data.

Would you image that what if optimization algorithms were car brands? Gradient descent would be Volvo. It moves with slowly but surely steps. Also, it gives confidence to consumers. On the other hand, Adam optimization algorithm would be Tesla. Because, no other algorithm has insane mode for today! Also, Adam usually outperforms the rest of optimization algorithms.

The post The Insider’s Guide to Adam Optimization Algorithm for Deep Learning appeared first on Sefik Ilkin Serengil.

]]>The post An Overview to Vanishing Gradient Problem appeared first on Sefik Ilkin Serengil.

]]>It has been discovered that multi layered perceptrons can handle non-linear problems in 1986. The discovery causes to pass AI winter away. Unfortunately, that was the 1st AI winter!

This discovery requires to transform activation units to differentiable functions. In this way, we can back-propagate errors and apply learning. Herein, sigmoid and tanh are one of the most common activation functions. However, these functions come with a huge defects.

Sigmoid function is meaningful for inputs between (-5, +5). In this scale, it has a derivative different than 0. This means that we can back-propage errors and apply learning.

Ian Goodfellow represents meaningfulness as mobility of Bart Simpson with his skateboard. Gravity contributes Bart to move if he is in range of [-5, +5].

On the other hand, gravity will not contribute Bart to move if he is in a point grater than 5 or less than -5. This representation describes gradient vanishing problem very well. If derivative of activation function would always produce 0, then we cannot update weights. But this is the result. The question is that what causes to happen this result?Wide and deep networks would cause to produce large outputs in every layer. Constructing wide and deep network with sigmoid activation unit reveals gradient vanishing or exploding problem. This ends us up in the **AI winter** again.

2nd AI winter passed away in just 2011. Raising a simple activation function named ReLU shows us again sunny days. This function is identity function for positive inputs whereas it produces zero for negative inputs.

Let’s imagine that Bart’s mobility on this new function. Gravity causes Bart to move for any positive input.

Wide network structures tend to produce mostly large positive inputs among layers. That’s why, most of gradient vanishing problems would be solved even though gravity would not contribute Bart to move for negative inputs.

You might consider to use Leaky ReLU as activation unit to handle this issue for negative inputs. Bart can move at any point for this new function! Leaky ReLU is a non-linear function, it is differentiable, and its derivative is different than 0 for any point except 0.

Let’s construct a wide and deep neural networks model. Basically, I’ll create a model for handwritten digit classification. There are 4 hidden layers consisting of 128, 64, 32 and 16 units respectively. Actually, it is not so deep.

classifier = tf.contrib.learn.DNNClassifier( feature_columns=feature_columns , n_classes=10 #0 to 9 - 10 classes , hidden_units=[128, 64, 32, 16] , optimizer=tf.train.GradientDescentOptimizer(learning_rate=0.1) , activation_fn = tf.nn.sigmoid )

As seen, model make disappointment. Accuracy is very low.

Only we need is to switch activation function to ReLU.

classifier = tf.contrib.learn.DNNClassifier( feature_columns=feature_columns , n_classes=10 #0 to 9 - 10 classes , hidden_units=[128, 64, 32, 16] , optimizer=tf.train.GradientDescentOptimizer(learning_rate=0.1) , activation_fn = tf.nn.relu )

As seen, accuracy will increase dramatically if activation unit were ReLU.

BTW, I’ve pushed the code into GitHub.

So, AI studies had unproductive period for almost 20 years between 1986 – 2006 because of activation units. Funnily, this challenging problem can be solved with a simple function usage. ReLU is the reason why we are much stronger in AI studies for these days.

The post An Overview to Vanishing Gradient Problem appeared first on Sefik Ilkin Serengil.

]]>The post Official Guide To Fermat’s Little Theorem appeared first on Sefik Ilkin Serengil.

]]>Remember the Fermat’s Little Theorem

a^{p} – a = 0 (mod p)

We’ve already known that the statement is true while a = 0. Also, statement is still valid while a = 1.

Now, we’ll jump to a = n and suppose that the statement would be true for this condition. If we can prove that statement is true while a = n + 1 based on the previous assumption, then we could prove the correctness proof of statement. This approach is called as **proof by induction**.

(n+1)^{p} – (n+1)

Here, we can use binomial theorem to expand the term.

(x+y)^{n} = C(n, 0)x^{n}y^{0} + C(n, 1)x^{n-1}y^{1} + C(n, 2)x^{n-2}y^{2} + … + C(n, n-1)x^{1}y^{n-1} + C(n, n)x^{0}y^{n}

Let’s apply binomial theorem to expand n+1 to the power of p.

(n+1)^{p} = C(p, 0)n^{p} + C(p, 1)n^{p-1} + C(p, 2)n^{p-2} + … + C(p, p-1)n^{1} + C(p, p)n^{0}

Now, replace the power term in main statement.

(n+1)^{p} – (n+1) = C(p, 0)n^{p} + C(p, 1)n^{p-1} + C(p, 2)n^{p-2} + … + C(p, p-1)n^{1} + C(p, p)n^{0} – (n+1)

C coefficients refer to combination and it can be calculated as

C(i, j) = i! / (j! (i-j)!)

That’s why, both C(p, 0) and C(p, p) terms are equal to 1.

(n+1)^{p} – (n+1) = n^{p} + C(p, 1)n^{p-1} + C(p, 2)n^{p-2} + … + C(p, p-1)n^{1} + n^{0} – n – 1

Would you realize that (n^{p} – n) exists in the equation above. Let’s group them. Also, n^{0} is equal to 1, and there is a -1 term in the equation. Let’s remove them.

(n+1)^{p} – (n+1) = (n^{p} – n) + C(p, 1)n^{p-1} + C(p, 2)n^{p-2} + … + C(p, p-1)n^{1}

Notice that our assumption is (n^{p} – n) can be divided by p. That’s why, we can remove it and focus on the rest of the equation. The question is that the following term can be divided by p or not.

C(p, 1)n^{p-1} + C(p, 2)n^{p-2} + … + C(p, p-1)n^{1}

We should focus on coefficient terms. Herein, imagine the binomial expansion and pascal’s triangle.

Pow |
Expansion |

0 | 1 |

1 | 1 1 |

2 | 1 2 1 |

3 | 1 3 3 1 |

4 | 1 4 6 41 |

5 | 1 5 10 10 5 1 |

6 | 1 6 15 20 20 15 6 1 |

7 | 1 7 21 35 35 21 7 1 |

We’ve already known that p is a prime. That’s why, we should focus on only prime pow lines.

Pow |
Expansion |

2 | 1 2 1 |

3 | 1 3 3 1 |

5 | 1 5 10 10 5 1 |

7 | 1 7 21 35 35 21 7 1 |

Remember that C(p, 0) and C(p, p) coefficients are equal to 1 and we separated them because we supposed that sum of their multipliers can be divided by p. That’s why, I’ll remove 1 terms in expansion column.

Pow |
Expansion |

2 | 2 |

3 | 3 3 |

5 | 5 10 10 5 |

7 | 7 21 35 35 21 7 |

Notice that pow can be divided by all terms in expansions. However, we have to prove this to be convinced.

Let’s focus on a concrete example. I pick p as 7 and my alphabet would be {1, 2, 3, 4, 5, 6, 7}. And I would like to produce strings length of 4. In other words, I wonder the C(7, 4).

Remember that order doesn’t matter in combination. I mean that both (1, 3, 5, 7) and (3, 5, 7, 1) sets are same in my combination space. How can I manipulate this set? The answer is easy. I can increase all item values for module 7.

(1, 3, 5, 7); (2, 4, 6, 1); (3, 5, 7, 2); (4, 6, 1, 3); (5, 7, 2, 4); (6, 1, 3, 5); (7, 2, 4, 6)

If final set (7, 2, 4, 6) were increased one more time, then it would be equal to first one (1, 3, 5, 7). As seen, any demonstration can be manipulated 7 times. This means all sets can be divided by 7.

To sum up, C(p, x) can be divided by p if p is a prime. Let’s turn back to the main statement.

(n+1)^{p} – (n+1) = (n^{p} – n) + C(p, 1)n^{p-1} + C(p, 2)n^{p-2} + … + C(p, p-1)n^{1}

We have supposed that (n^{p} – n) can be divided by p and try to prove (n+1)^{p} – (n+1) can be divided by p based on this assumption. We also proved that C(p, x) can be divided by p if p is a prime. That’s why, all terms in the equation illustrated above can be divided by p.

So, we can prove the correctness of Fermat’s Little Theorem based on proof by induction. We’ve benefited from binomial theorem and pascal triangle (binomial expansion). This approach requires more powerful math background than the necklace method.

The post Official Guide To Fermat’s Little Theorem appeared first on Sefik Ilkin Serengil.

]]>