The post Large Scale Machine Learning with Pandas appeared first on Sefik Ilkin Serengil.

]]>You might remember the Iris flower data set. There are 150 instances of length and width measurement for top and bottom leaf and corresponding class in the data set. Corresponding class can be 3 different iris flower types: setosa, versicolor and virginica. So, there are 4 input features and 3 output labels. Let’s create a hidden layer consisting of 4 nodes in the neural networks. I mostly decide this number as 2/3 times of sum of features and labels. Multi class classification requires to use cross-entropy as loss function. Also, I want to apply Adam optimization algorithm to converge faster.

import keras from keras.models import Sequential from keras.layers import Dense, Activation def createNetwork(): model = Sequential() model.add(Dense(4 #num of hidden units , input_shape=(4,))) #num of features in input layer model.add(Activation('sigmoid')) #activation function from input layer to 1st hidden layer model.add(Dense(num_classes)) #num of classes in output layer model.add(Activation('sigmoid')) #activation function from 1st hidden layer to output layer return model model = createNetwork() model.compile(loss='categorical_crossentropy' , optimizer=keras.optimizers.Adam(lr=0.007) , metrics=['accuracy']

Even though data set is small enough, we will load sub data sets instead of loading all. In this way, we will save on the memory. On the other hand, this will increase the I/O usage but this is reasonable because we cannot store massive data sets on memory.

Chunk size parameter is set to 30. Thus, we will read 30 lines of data set for each iteration. Moreover, column information is missing in the data set. That’s why, we need to define column names. Otherwise, pandas thinks the first row as column names and we will lose that line’s information.

import pandas as pd import numpy as np chunk_size = 30 def processDataset(): for chunk in pd.read_csv("iris.data", chunksize=chunk_size , names = ["sepal_length","sepal_width","petal_length","petal_width","class"]): current_set = chunk.values #convert df to numpy array

Chunk parameter is type of pandas data frame. We can still convert it to numpy array by getting its values. This is important because fit operation will expect features and labels as numpy.

A line of the data set consisting of 4 measurement of a flower, and corresponding class respectively. I can seperate features and label by specifying index values.

features = current_set[:,0:4] labels = current_set[:,4]

Labels are in single column and type of string. I will apply one-hot-encoding to feed network.

for i in range(0,labels.shape[0]): if labels[i] == 'Iris-setosa': labels[i] = 0 elif labels[i] == 'Iris-versicolor': labels[i] = 1 elif labels[i] == 'Iris-virginica': labels[i] = 2 labels = keras.utils.to_categorical(labels, num_classes)

Features and labels are ready. We can feed to neural networks. Learning time or epochs must be set to 1 here. This is important. I will handle epochs in a for loop at the top.

model.fit(features, labels, epochs=1, verbose=0) #epochs handled in the for loop above

We will done processing all train set when processDataset() operation is over. Remeber back-propagation and gradient descent algorithm. We need to apply this processing over and over.

epochs = 1000 for epoch in range(0, epochs): #epoch should be handled here, not in fit command! processDataset()

If you set verbose to 1, then you will face will loss values for current sub data set. You should ignore the loss during training because it does not represent global loss for train set.

So, we’ve adapted pandas to read massive data set as small chunks and feed neural networks learning. It comes with pros and cons. The main advantage is that we can handle massive data set and save on the memory. The disadvantage is that it increases I/O usage. However, the focus of this post is working on massive data sets, it is neither big data nor streaming data. I’ve pushed the source code of this post into GitHub.

The post Large Scale Machine Learning with Pandas appeared first on Sefik Ilkin Serengil.

]]>The post A Beginner’s Guide to TensorFlow.js: Machine Lerning in JavaScript appeared first on Sefik Ilkin Serengil.

]]>In this case, we can just run the code. No prerequisite installation is required. I will create a hello.html file and reference tensorflow js library in head tag. This reference offers to find tensorflow related objects under tf variable. There might be up-to-date version of the library. You should check it in the official site.

Also, I need to define an another script tag after tensorflow js referencing. I have to construct neural networks here.

<html> <head> <script src="https://cdn.jsdelivr.net/npm/@tensorflow/tfjs@0.12.5"> </script> <!-- Place your code in the script tag below --> <script> </script> </head> <body> </body> </html>

I will construct a model for XOR problem. Let’s create the data set first. Here, xtrain stores all potential inputs whereas ytrain stores xor logic gate results respectively as one-hot encoded. I mean that [1, 0] refers to firing 0 whereas [0, 1] refers to firing 1 as xor result.

const xtrain = tf.tensor2d([[0, 0], [0, 1], [1, 0], [1, 1]]); const ytrain = tf.tensor2d([[1, 0], [0, 1], [0, 1], [1, 0]]);

We can construct a neural networks model. I will create a sequential model. Input layer consists of 2 nodes because there 2 input features in xor data set. First and single hidden layer will have 5 nodes and its activation function will be sigmoid. Finally, output layer will have 2 nodes because xor data set has 2 output classes. Activation function of output layer should be softmax because this is a classification problem.

const model = tf.sequential(); model.add(tf.layers.dense({units: 5, activation: 'sigmoid', inputShape: [2]})); model.add(tf.layers.dense({ units: 2, activation: 'softmax' }));

Now, we can specify the optimization algorithm and loss function to train the model. You have to use categorical crossentropy loss function if you use softmax activation function in the output layer. Moreover, I would like to train the model with Adam optimization algorithm to be learnt faster.

var learning_rate = 0.1 model.compile({loss: 'categoricalCrossentropy', optimizer: tf.train.adam(learning_rate)});

Time to train the network. You might remember that we run fitting and prediction respectively in python. Here, running is a little different. Fit command is handled asynchronously. That’s why, you must not run fit and predict commands in separate lines as demonstrated below. Otherwise, predict command dumps the results before training.

//you should not run the prediction in this way const history = model.fit(xtrain, ytrain, {epochs: 200}) console.log("fit is over") model.predict(xtrain).print();

Fit command should include prediction as illustrated below.

const history = model.fit(xtrain, ytrain, {epochs: 200}) .then(()=>{ console.log("fit is over") //model.predict(tf.tensor2d([[0, 0], [0, 1], [1, 0], [1, 1]])).print(); model.predict(xtrain).print(); });

Coding is over for client side solution. Now, you can open the hello.html file in the browser. Do not surprise when you see the blank page. You can see the final predictions by pressing F12 button in chrome. Or you can access the same place under Settings (3 points on the right top side) > More tools > Developer tools > Console tab.

So, we can successfully create the Machine Learning in the browser as shown above. But this is too beyond the ML in browser. Let’s see how.

Server side capabilities enabled for javascript in Node.js recently. We can run the (almost) same code in Node.js server. In this case, you have to install Node.js into your computer. I installed the recommended version 8.11.4 for today. You can run node command in the command prompt after installation.

You should run the following command if you run the node.js first time. This creates packages.json file in the current directory. Otherwise, tensorflow.js installation would not complete successfully. BTW, I run the command on my desktop.

*npm init*

You can install TensorFlow.js package after initialization. (double dash and save. it seems like single dash in the browser)

*npm install @tensorflow/tfjs –save*

That’s it! Your environment is ready. Please create a hello.js file. Content of the file will be look like this.

var tf = require('@tensorflow/tfjs'); const model = tf.sequential(); model.add(tf.layers.dense({units: 5, activation: 'sigmoid', inputShape: [2]})); model.add(tf.layers.dense({ units: 2, activation: 'softmax' , outputShape: [2] })); model.compile({loss: 'categoricalCrossentropy', optimizer: tf.train.adam(0.1)}); const xtrain = tf.tensor2d([[0, 0], [0, 1], [1, 0], [1, 1]]); const ytrain = tf.tensor2d([[1,0],[0,1],[0,1],[1,0]]); const history = model.fit(xtrain, ytrain, {epochs: 200}) .then(()=>{ console.log("fit is over") //model.predict(tf.tensor2d([[0, 0], [0, 1], [1, 0], [1, 1]])).print(); model.predict(xtrain ).print(); });

As seen, we’ve run the same code. Model has learnt the principles of xor logic gate successfully.

So, we have mentioned the javascript version of TensorFlow in this post. TensorFlow is not just a tool for research. For instance, Facebook developed both PyTorch and Caffe2 frameworks for deep learning. However, Facebook uses PyTorch as research purposes whereas it uses Caffe2 for production. On the other hand, Google enabled TensorFlow for both research and production. It seems that we will see TensorFlow.js much more common in the following days.

The post A Beginner’s Guide to TensorFlow.js: Machine Lerning in JavaScript appeared first on Sefik Ilkin Serengil.

]]>The post 10 Interview Questions Asked in Machine Learning appeared first on Sefik Ilkin Serengil.

]]>Rewarding branches based on profits might not be fair. Because some of these branches have higher profits and some have more customers. This causes to reward lucky ones. You might apply **unsupervised learning** and create clusters based on profitability, turnover, transaction volumes, having customers or region. It is like customer segmentation. Then, you should evaluate each branch based on where it is in current cluster. In this way, each branch can compete with same weight competitors. Otherwise, it would be like putting light weighted boxer in front of heavyweight one. In fact, there might be several champs for different weight groups.

This is rare event detection problem. Classifiers expect homogeneous data during training to produce satisfactory results. We cannot always expect to have balanced data for some cases. Firstly, you can feed less number of randomly selected instances to decrease the number of non fraud transactions. This is called **sub sampling**. But this causes to lose important data. We would not often prefer to apply this. Secondly, we can increase the number of fraud transactions by creating synthetic fraud data. For example, you can pick random two existing fraud instances, calculate average of transaction amount for this two instances, and assign the average amount to the new data. This is called **over sampling**. This increases the number of fraud instances. This approach might be preferable than sub sampling for the fraud case but it is still dangerous because it causes to feed non existing data to the model. It is like having imaginary friends!

We can ignore the fraud mark and consider the problem as **anomaly detection**. However, we should work on transactions for customers individually. Suppose that transactions of a customer (e.g. named Sefik) has a normal distribution. Mean (µ) and standard deviation (σ) of transaction amount will lighten us. We have already known that 3 standard deviation beyond the mean (µ ± 3σ) covers 99.7% of all area. We can apply this logic to transactions of a customer. For example, if a customer has averagely 100$ expenses, and standard deviation were 10$, then 99.7% expenses must be less than 130$ and must be greater than 70$. You can mark any transaction of that customer as abnormal if it is greater than 130$. That might not be fraud but still it is abnormal. In this way, we can have an idea for unmarked transactions. BTW, you can increase precision. 6 sigma covers 99.99%.

We thought about the problem for only transaction amount. We can increase the dimensions by adding some additional information such as time and location information.

Some machine learning models such as neural networks or support vector machines produce opaque models. This means that opaque decisions cannot be read and understood by human. Everything is handled in a black box. On the other hand, a **decision tree** algorithm produces transparent decisions. Transparent decisions can be read and understood by human clearly. In other words, you can follow the steps to make decision. For example, look at the following decision tree. If your decision were accept offer, because the company offers free coffee, commutation does not last more than 1 hour and salary is greater than 50K.

That’s why, you have to build decision tree for credit decisioning. Herein, the most common decision tree algorithms for classification are ID3, C4.5 and CART. On the other hand, CART can be adapted for regression problems.

You might either solve an insignificant problem like how many legs does a cow have or you **overfitted**. You have most probably the second one. Even the most advanced AI models or intelligent life forms fail. You should not expect to get 100% accuracy anytime. How senior developers do not expect new programs to work bug-free at first time, notice that it makes happy just junior developers. Similarly, machine learning practitioners should never expect to get 100%. Still, you believe that you can solve a problem with 100% accuracy, then it would be automation. In this case, you can create a rule based model and there is no need for AI.

Remember the fraud detection data set. Suppose that there are 1M legal transactions and 100 fraud transactions. This means that 99.99% of the dataset corresponds legal whereas 0.01% corresponds fraud. In this case, you can get 99.99% accuracy if you return not fraud by default. Is this a success? Of course, no! Here, the important thing is that you can classify correctly how many of really fraud instances. Confusion matrix and ROC curve become important instead of overall accuracy. If number of cases for true positive and true negative close to 100%, that would be a good job.

Besides, if your problem is based on human health, then 99.99% accuracy means that you can cause to die of 1 person in every 1000 people. So, metrics might have different meanings based on problems.

Funny, but it includes both regression, classification and clustering. It predicts weather temperature in Fahrenheit or Celsius degrees. This is regression because continuous outputs will be produced. Moreover, it classifies the weather as partly sunny, raining and snowing. This is classification because there are limited number of classes. Finally, it includes unsupervised learning. It clusters some cities /states based on the geographic location.

If you run a decision tree algorithm, then they tend to over-fit on a large scale data sets. A basic approach is to apply random forest. It basically separates data set into several sub data sets (mostly prime number). Then, different decision trees are created for all of those sub data sets. Final decisions of these sub data sets specify the global decision. Moreover, you can apply pruning to avoid over-fitting.

On the other hand, if you run neural networks, it is based on updating weights over epochs. You should monitor the training set and validation set error over epochs. Training set error will decrease over iterations. If validation set error starts to increase for some epoch value, you should terminate epochs. Moreover, you could create a really complex neural networks model (input features, number of hidden layers and nodes). You might re-design a less complex the model.

This question might seem very easy but it is a tricky one. Traditional developers tend to design this kind of systems with for loops.

import numpy as np inputs = np.array([1,0,1]) weights = np.array([0.3, 0.8, 0.4]) sum = 0 for i in range(inputs.shape[0]): sum = sum + inputs[i] * weights[i] print(sum)

However, machine learning practitioners must not apply this approach. They have to apply matrix multiplication. Because, vectorized solution fasten processing time almost 150 times.

import numpy as np inputs = np.array([1,0,1]) weights = np.array([0.3, 0.8, 0.4]) sum = np.matmul(np.transpose(weights), inputs) print(sum)

Your data set can have thousands of features. Feeding all features becomes much more complex model. Training lasts longer and it might tend to over fit. Dropping some features will reduce the complexity and fasten training but in this case we might lose some significant information. Autoencoders are typical way to represent data and reduce dimensions. Thus, you can zip the data (lossy) but it offers you to have less complex model, faster training and you do not lose any information just like in dropping.

Besides, face recognition technology and art style transfer techniques are mainly based on dimension reduction and auto-encoders.

So, I collected some job interview questions asked for data scientists and machine learning practitioners and I try to respond. Responses reflect my personal opinions. You might find some answers true or partially false. These questions asked to test solution approach of a candidate. In other words, solution approach is more important than the pure answer.

The post 10 Interview Questions Asked in Machine Learning appeared first on Sefik Ilkin Serengil.

]]>The post Face Recognition with FaceNet in Keras appeared first on Sefik Ilkin Serengil.

]]>We will apply transfer learning to have outcomes of previous researches. David Sandberg shared pre-trained weights after 30 hours training with GPU. However, that work was on raw TensorFlow. **Your friendly neighborhood blogger** converted the pre-trained weights into Keras format. I put the weights in Google Drive because it exceeds the upload size of GitHub. You can find pre-trained weights here. Also, FaceNet has a very complex model structure. You can find the model structure here in json format.

We can create FaceNet mode as illustrated below.

from keras.models import model_from_json #facenet model structure: https://github.com/serengil/tensorflow-101/blob/master/model/facenet_model.json model = model_from_json(open("facenet_model.json", "r").read()) #pre-trained weights https://drive.google.com/file/d/1971Xk5RwedbudGgTIrGAL4F7Aifu7id1/view?usp=sharing model.load_weights('facenet_weights.h5') model.summary()

FaceNet model expects 160×160 RGB images whereas it produces 128-dimensional representations. Auto-encoded representations called embeddings in the research paper. Additionally, researchers put an extra l2 normalization layer at the end of the network. Remember what l2 normalization is.

l2 = √(∑ x_{i}^{2}) while (i=0 to n) for n-dimensional vector

They also constrained 128-dimensional output embedding to live on the 128-dimensional hyperspace. This means that element wise should be applied to output and l2 normalized form pair.

def l2_normalize(x): return x / np.sqrt(np.sum(np.multiply(x, x)))

Researchers also mentioned that they used euclidean distance instead of cosine similarity to find similarity between two vectors. Euclidean distance basically finds distance of two vectors on an euclidean space.

def findEuclideanDistance(source_representation, test_representation): euclidean_distance = source_representation - test_representation euclidean_distance = np.sum(np.multiply(euclidean_distance, euclidean_distance)) euclidean_distance = np.sqrt(euclidean_distance) return euclidean_distance

Finally, we can find the distance between two different images via FaceNet.

img1_representation = l2_normalize(model.predict(preprocess_image('img1.jpg'))[0,:]) img2_representation = l2_normalize(model.predict(preprocess_image('img2.jpg'))[0,:]) euclidean_distance = findEuclideanDistance(img1_representation, img2_representation)

Distance should be small for images of same person whereas distance should be large for pictures of different people. Setting the threshold to 0.20 in the research paper but I got successful results when it is set to 0.35.

threshold = 0.35 if euclidean_distance < threshold: print("verified... they are same person") else: print("unverified! they are not same person!")

Still, we can check cosine similarity between two vectors. In this case, I got the most successful results when I set the threshold to 0.07. Notice that l2 normalization skipped for this metric.

def findCosineSimilarity(source_representation, test_representation): a = np.matmul(np.transpose(source_representation), test_representation) b = np.sum(np.multiply(source_representation, source_representation)) c = np.sum(np.multiply(test_representation, test_representation)) return 1 - (a / (np.sqrt(b) * np.sqrt(c))) img1_representation = model.predict(preprocess_image('img1.jpg'))[0,:] img2_representation = model.predict(preprocess_image('img2.jpg'))[0,:] cosine_similarity = findCosineSimilarity(img1_representation, img2_representation) print("cosine similarity: ",cosine_similarity) threshold = 0.07 if cosine_similarity < threshold: print("verified... they are same person") else: print("unverified! they are not same person!")

Well, we designed the model. The important thing how successful designed model is. I test the FaceNet with same instances in VGG-Face testing.

It succeeded when I tested the model for really different Angelina Jolie images.

Similarly, FaceNet succeeded when tested for different photos of Jennifer Aniston .

We can process true negative cases successfully.

So, we’ve implemented Google’s face recognition model on-premise in this post. We have combined representations with autoencoders, transfer learning and vector similarity concepts to build FaceNet. Original paper includes face alignment steps but we skipped them in this post. Instead of including alignment, I fed already aligned images as inputs. Moreover, FaceNet has a much more complex model structure than VGG-Face. Still, VGG-Face produces more successful results than FaceNet based on experiments. This might cause to produce slower results in real time. Finally, I pushed the code of this post into GitHub.

The post Face Recognition with FaceNet in Keras appeared first on Sefik Ilkin Serengil.

]]>The post Hyperbolic Secant As Neural Networks Activation Function appeared first on Sefik Ilkin Serengil.

]]>Some resources mention the function as inverse of hyperbolic cosine or inverse-cosh. Remember the formula of hyperbolic cosine.

y = 1 / cosh(x) where cosh(x) = (e^{x} + e^{-x})/2

So, pure form of the function is formulated below.

y = 2 / (e^{x} + e^{-x})

The function produces outputs in scale of [0,1]. Output decreases and closes to neutral when x goes to infinite. However, it will never produce 0 output even for very large inputs except ±∞.

Hyperbolic secant formula contributes to feed forward step in neural networks. On the other hand, derivative of the function will be involved in back propagation.

dy/dx = 2 . (e^{x} + e^{-x})^{-1}

dy/dx = 2.(-1).(e^{x} + e^{-x})^{-2}.[d(e^{x} + e^{-x})/dx]

dy/dx = 2.(-1).(e^{x} + e^{-x})^{-2}.(e^{x} + (-1).e^{-x}) = 2.(-1).(e^{x} + e^{-x})^{-2}.(e^{x} – e^{-x})

dy/dx = (-2).(-e^{x} + e^{-x})/(e^{x} + e^{-x})^{2}

dy/dx = 2.(-e

^{x}+ e^{-x})/(e^{x}+ e^{-x})^{2}

Or we can rearrange the derivative into simpler form. Adding and substracting e^{x} to numerators would not change the result.

dy/dx = 2.(e^{x }– e^{x }– e^{x} + e^{-x})/(e^{x} + e^{-x})^{2} = 2.(e^{x }+ e^{-x} -e^{x }– e^{x} )/(e^{x} + e^{-x})^{2}

dy/dx = 2(e^{x }+ e^{-x})/[(e^{x }+ e^{-x}).(e^{x }+ e^{-x})] -2.(e^{x} + e^{x})/(e^{x} + e^{-x})^{2}

dy/dx = 2/(e^{x }+ e^{-x}) – (2.2e^{x})/(e^{x} + e^{-x})(e^{x} + e^{-x})

dy/dx = 2/(e^{x }+ e^{-x}) – 2[2/(e^{x} + e^{-x})].[e^{x}/(e^{x} + e^{-x})]

You might realize that the term above contains hyperbolic secant function. Put y instead of 2/(e^{x }+ e^{-x}).

dy/dx = y – 2y.[e

^{x}/(e^{x}+ e^{-x})]

Notice that both the function and its derivative have high computation cost.

The post Hyperbolic Secant As Neural Networks Activation Function appeared first on Sefik Ilkin Serengil.

]]>The post A Step By Step Regression Tree Example appeared first on Sefik Ilkin Serengil.

]]>The following data set might be familiar with. We’ve used similar data set in our previous experiments but that one denotes golf playing decision based on some factors. In other words, golf playing decision was nominal target consisting of true or false values. Herein, the target column is number of golf players and it stores real numbers. We have counted the number of instances for each class when the target was nominal. I mean that we can create branches based on the number of instances for true decisions and false decisions. Here, we cannot count the target values because it is continuous. Instead of counting, we can handle regression problems by switching the metric to standard deviation.

Day |
Outlook |
Temp. |
Humidity |
Wind |
Golf Players |

1 | Sunny | Hot | High | Weak | 25 |

2 | Sunny | Hot | High | Strong | 30 |

3 | Overcast | Hot | High | Weak | 46 |

4 | Rain | Mild | High | Weak | 45 |

5 | Rain | Cool | Normal | Weak | 52 |

6 | Rain | Cool | Normal | Strong | 23 |

7 | Overcast | Cool | Normal | Strong | 43 |

8 | Sunny | Mild | High | Weak | 35 |

9 | Sunny | Cool | Normal | Weak | 38 |

10 | Rain | Mild | Normal | Weak | 46 |

11 | Sunny | Mild | Normal | Strong | 48 |

12 | Overcast | Mild | High | Strong | 52 |

13 | Overcast | Hot | Normal | Weak | 44 |

14 | Rain | Mild | High | Strong | 30 |

Golf players = {25, 30, 46, 45, 52, 23, 43, 35, 38, 46, 48, 52, 44, 30}

Average of golf players = (25 + 30 + 46 + 45 + 52 + 23 + 43 + 35 + 38 + 46 + 48 + 52 + 44 + 30

)/14 = 39.78

Standard deviation of golf players = √[( (25 – 39.78)^{2} + (30 – 39.78)^{2} + (46 – 39.78)^{2} + … + (30 – 39.78)^{2} )/14] = 9.32

Outlook can be sunny, overcast and rain. We need to calculate standard deviation of golf players for all of these outlook candidates.

Day | Outlook | Temp. | Humidity | Wind | Golf Players |

1 | Sunny | Hot | High | Weak | 25 |

2 | Sunny | Hot | High | Strong | 30 |

8 | Sunny | Mild | High | Weak | 35 |

9 | Sunny | Cool | Normal | Weak | 38 |

11 | Sunny | Mild | Normal | Strong | 48 |

Golf players for sunny outlook = {25, 30, 35, 38, 48}

Average of golf players for sunny outlook = (25+30+35+38+48)/5 = 35.2

Standard deviation of golf players for sunny outlook = √(((25 – 35.2)^{2} + (30 – 35.2)^{2} + … )/5) = 7.78

Day | Outlook | Temp. | Humidity | Wind | Golf Players |

3 | Overcast | Hot | High | Weak | 46 |

7 | Overcast | Cool | Normal | Strong | 43 |

12 | Overcast | Mild | High | Strong | 52 |

13 | Overcast | Hot | Normal | Weak | 44 |

Golf players for overcast outlook = {46, 43, 52, 44}

Average of golf players for overcast outlook = (46 + 43 + 52 + 44)/4 = 46.25

Standard deviation of golf players for overcast outlook = √(((46-46.25)^{2}+(43-46.25)^{2}+…)= 3.49

Day | Outlook | Temp. | Humidity | Wind | Golf Players |

4 | Rain | Mild | High | Weak | 45 |

5 | Rain | Cool | Normal | Weak | 52 |

6 | Rain | Cool | Normal | Strong | 23 |

10 | Rain | Mild | Normal | Weak | 46 |

14 | Rain | Mild | High | Strong | 30 |

Golf players for overcast outlook = {45, 52, 23, 46, 30}

Average of golf players for overcast outlook = (45+52+23+46+30)/5 = 39.2

Standard deviation of golf players for overcast outlook = √(((45 – 39.2)^{2}+(52 – 39.2)^{2}+…)/5)=10.87

Outlook | Stdev of Golf Players | Instances |

Overcast | 3.49 | 4 |

Rain | 10.87 | 5 |

Sunny | 7.78 | 5 |

Weighted standard deviation for outlook = (4/14)x3.49 + (5/14)x10.87 + (5/14)x7.78 = 7.66

You might remember that we have calculated the global standard deviation of golf players 9.32 in previous steps. Standard deviation reduction is difference of the global standard deviation and standard deviation for current feature. In this way, maximized standard deviation reduction will be the decision node.

Standard deviation reduction for outlook = 9.32 – 7.66 = 1.66

Temperature can be hot, cool or mild. We will calculate standard deviations for those candidates.

Day | Outlook | Temp. | Humidity | Wind | Golf Players |

1 | Sunny | Hot | High | Weak | 25 |

2 | Sunny | Hot | High | Strong | 30 |

3 | Overcast | Hot | High | Weak | 46 |

13 | Overcast | Hot | Normal | Weak | 44 |

Golf players for hot temperature = {25, 30, 46, 44}

Standard deviation of golf players for hot temperature = 8.95

Day | Outlook | Temp. | Humidity | Wind | Golf Players |

5 | Rain | Cool | Normal | Weak | 52 |

6 | Rain | Cool | Normal | Strong | 23 |

7 | Overcast | Cool | Normal | Strong | 43 |

9 | Sunny | Cool | Normal | Weak | 38 |

Golf players for cool temperature = {52, 23, 43, 38}

Standard deviation of golf players for cool temperature = 10.51

Day | Outlook | Temp. | Humidity | Wind | Golf Players |

4 | Rain | Mild | High | Weak | 45 |

8 | Sunny | Mild | High | Weak | 35 |

10 | Rain | Mild | Normal | Weak | 46 |

11 | Sunny | Mild | Normal | Strong | 48 |

12 | Overcast | Mild | High | Strong | 52 |

14 | Rain | Mild | High | Strong | 30 |

Golf players for mild temperature = {45, 35, 46, 48, 52, 30}

Standard deviation of golf players for mild temperature = 7.65

Temperature | Stdev of Golf Players | Instances |

Hot | 8.95 | 4 |

Cool | 10.51 | 4 |

Mild | 7.65 | 6 |

Weighted standard deviation for temperature = (4/14)x8.95 + (4/14)x10.51 + (6/14)x7.65 = 8.84

Standard deviation reduction for temperature = 9.32 – 8.84 = 0.47

Humidity is a binary class. It can either be normal or high.

Day | Outlook | Temp. | Humidity | Wind | Golf Players |

1 | Sunny | Hot | High | Weak | 25 |

2 | Sunny | Hot | High | Strong | 30 |

3 | Overcast | Hot | High | Weak | 46 |

4 | Rain | Mild | High | Weak | 45 |

8 | Sunny | Mild | High | Weak | 35 |

12 | Overcast | Mild | High | Strong | 52 |

14 | Rain | Mild | High | Strong | 30 |

Golf players for high humidity = {25, 30, 46, 45, 35, 52, 30}

Standard deviation for golf players for high humidity = 9.36

Day | Outlook | Temp. | Humidity | Wind | Golf Players |

5 | Rain | Cool | Normal | Weak | 52 |

6 | Rain | Cool | Normal | Strong | 23 |

7 | Overcast | Cool | Normal | Strong | 43 |

9 | Sunny | Cool | Normal | Weak | 38 |

10 | Rain | Mild | Normal | Weak | 46 |

11 | Sunny | Mild | Normal | Strong | 48 |

13 | Overcast | Hot | Normal | Weak | 44 |

Golf players for normal humidity = {52, 23, 43, 38, 46, 48, 44}

Standard deviation for golf players for normal humidity = 8.73

Humidity | Stdev of Golf Player | Instances |

High | 9.36 | 7 |

Normal | 8.73 | 7 |

Weighted standard deviation for humidity = (7/14)x9.36 + (7/14)x8.73 = 9.04

Standard deviation reduction for humidity = 9.32 – 9.04 = 0.27

Wind is a binary class, too. It can either be Strong or Weak.

Day | Outlook | Temp. | Humidity | Wind | Golf Players |

2 | Sunny | Hot | High | Strong | 30 |

6 | Rain | Cool | Normal | Strong | 23 |

7 | Overcast | Cool | Normal | Strong | 43 |

11 | Sunny | Mild | Normal | Strong | 48 |

12 | Overcast | Mild | High | Strong | 52 |

14 | Rain | Mild | High | Strong | 30 |

Golf players for strong wind= {30, 23, 43, 48, 52, 30}

Standard deviation for golf players for strong wind = 10.59

1 | Sunny | Hot | High | Weak | 25 |

3 | Overcast | Hot | High | Weak | 46 |

4 | Rain | Mild | High | Weak | 45 |

5 | Rain | Cool | Normal | Weak | 52 |

8 | Sunny | Mild | High | Weak | 35 |

9 | Sunny | Cool | Normal | Weak | 38 |

10 | Rain | Mild | Normal | Weak | 46 |

13 | Overcast | Hot | Normal | Weak | 44 |

Golf players for weakk wind= {25, 46, 45, 52, 35, 38, 46, 44}

Standard deviation for golf players for weak wind = 7.87

Wind | Stdev of Golf Player | Instances |

Strong | 10.59 | 6 |

Weak | 7.87 | 8 |

Weighted standard deviation for wind = (6/14)x10.59 + (8/14)x7.87 = 9.03

Standard deviation reduction for wind = 9.32 – 9.03 = 0.29

So, we’ve calculated standard deviation reduction values for all features. The winner is outlook because it has the highest score.

Feature |
Standard Deviation Reduction |

Outlook | 1.66 |

Temperature | 0.47 |

Humidity | 0.27 |

Wind | 0.29 |

We’ll put outlook decision at the top of decision tree. Let’s monitor the new sub data sets for the candidate branches of outlook feature.

Day | Outlook | Temp. | Humidity | Wind | Golf Players |

1 | Sunny | Hot | High | Weak | 25 |

2 | Sunny | Hot | High | Strong | 30 |

8 | Sunny | Mild | High | Weak | 35 |

9 | Sunny | Cool | Normal | Weak | 38 |

11 | Sunny | Mild | Normal | Strong | 48 |

Golf players for sunny outlook = {25, 30, 35, 38, 48}

Standard deviation for sunny outlook = 7.78

Notice that we will use this standard deviation value as global standard deviation for this sub data set.

Day | Outlook | Temp. | Humidity | Wind | Golf Players |

1 | Sunny | Hot | High | Weak | 25 |

2 | Sunny | Hot | High | Strong | 30 |

Standard deviation for sunny outlook and hot temperature = 2.5

Day | Outlook | Temp. | Humidity | Wind | Golf Players |

9 | Sunny | Cool | Normal | Weak | 38 |

Standard deviation for sunny outlook and cool temperature = 0

Day | Outlook | Temp. | Humidity | Wind | Golf Players |

8 | Sunny | Mild | High | Weak | 35 |

11 | Sunny | Mild | Normal | Strong | 48 |

Standard deviation for sunny outlook and mild temperature = 6.5

Temperature | Stdev for Golf Players | Instances |

Hot | 2.5 | 2 |

Cool | 0 | 1 |

Mild | 6.5 | 2 |

Weighted standard deviation for sunny outlook and temperature = (2/5)x2.5 + (1/5)x0 + (2/5)x6.5 = 3.6

Standard deviation reduction for sunny outlook and temperature = 7.78 – 3.6 = 4.18

Day | Outlook | Temp. | Humidity | Wind | Golf Players |

1 | Sunny | Hot | High | Weak | 25 |

2 | Sunny | Hot | High | Strong | 30 |

8 | Sunny | Mild | High | Weak | 35 |

Standard deviation for sunny outlook and high humidity = 4.08

Day | Outlook | Temp. | Humidity | Wind | Golf Players |

9 | Sunny | Cool | Normal | Weak | 38 |

11 | Sunny | Mild | Normal | Strong | 48 |

Standard deviation for sunny outlook and normal humidity = 5

Humidity | Stdev for Golf Players | Instances |

High | 4.08 | 3 |

Normal | 5.00 | 2 |

Weighted standard deviations for sunny outlook and humidity = (3/5)x4.08 + (2/5)x5 = 4.45

Standard deviation reduction for sunny outlook and humidity = 7.78 – 4.45 = 3.33

Day | Outlook | Temp. | Humidity | Wind | Golf Players |

2 | Sunny | Hot | High | Strong | 30 |

11 | Sunny | Mild | Normal | Strong | 48 |

Standard deviation for sunny outlook and strong wind = 9

Day | Outlook | Temp. | Humidity | Wind | Golf Players |

1 | Sunny | Hot | High | Weak | 25 |

8 | Sunny | Mild | High | Weak | 35 |

9 | Sunny | Cool | Normal | Weak | 38 |

Standard deviation for sunny outlook and weak wind = 5.56

Wind | Stdev for Golf Players | Instances |

Strong | 9 | 2 |

Weak | 5.56 | 3 |

Weighted standard deviations for sunny outlook and wind = (2/5)x9 + (3/5)x5.56 = 6.93

Standard deviation reduction for sunny outlook and wind = 7.78 – 6.93 = 0.85

We’ve calculated standard deviation reductions for sunny outlook. The winner is temperature.

Feature |
Standard Deviation Reduction |

Temperature | 4.18 |

Humidity | 3.33 |

Wind | 0.85 |

Cool branch has one instance in its sub data set. We can say that if outlook is sunny and temperature is cool, then there would be 38 golf players. But what about hot branch? There are still 2 instances. Should we add another branch for weak wind and strong wind? No, we should not. Because this causes over-fitting. We should terminate building branches, for example if there are less than five instances in the sub data set. Or standard deviation of the sub data set can be less than 5% of the entire data set. I prefer to apply the first one. I will terminate the branch if there are less than 5 instances in the current sub data set. If this termination condition is satisfied, then I will calculate the average of the sub data set. This operation is called as pruning in decision tree trees.

Overcast outlook branch has already 4 instances in the sub data set. We can terminate building branches for this leaf. Final decision will be average of the following table for overcast outlook.

Day | Outlook | Temp. | Humidity | Wind | Golf Players |

3 | Overcast | Hot | High | Weak | 46 |

7 | Overcast | Cool | Normal | Strong | 43 |

12 | Overcast | Mild | High | Strong | 52 |

13 | Overcast | Hot | Normal | Weak | 44 |

If outlook is overcast, then there would be (46+43+52+44)/4 = 46.25 golf players.

Day | Outlook | Temp. | Humidity | Wind | Golf Players |

4 | Rain | Mild | High | Weak | 45 |

5 | Rain | Cool | Normal | Weak | 52 |

6 | Rain | Cool | Normal | Strong | 23 |

10 | Rain | Mild | Normal | Weak | 46 |

14 | Rain | Mild | High | Strong | 30 |

We need to find standard deviation reduction values for the rest of the features in same way for the sub data set above.

Standard deviation for rainy outlook = 10.87

Notice that we will use this value as global standard deviation for this branch in reduction step.

Temperature | Standard deviation for golf players | instances |

Cool | 14.50 | 2 |

Mild | 7.32 | 3 |

Weighted standard deviation for rainy outlook and temperature = (2/5)x14.50 + (3/5)x7.32 = 10.19

Standard deviation reduction for rainy outlook and temperature = 10.87 – 10.19 = 0.67

Humidity | Standard deviation for golf players | instances |

High | 7.50 | 2 |

Normal | 12.50 | 3 |

Weighted standard deviation for rainy outlook and humidity = (2/5)x7.50 + (3/5)x12.50 = 10.50

Standard deviation reduction for rainy outlook and humidity = 10.87 – 10.50 = 0.37

Wind | Standard deviation for golf players | instances |

Weak | 3.09 | 3 |

Strong | 3.5 | 2 |

Weighted standard deviation for rainy outlook and wind = (3/5)x3.09 + (2/5)x3.5 = 3.25

Standard deviation reduction for rainy outlook and wind = 10.87 – 3.25 = 7.62

As illustrated below, the winner is wind feature.

Feature |
Standard deviation reduction |

Temperature | 0.67 |

Humidity | 0.37 |

Wind | 7.62 |

As seen, both branches have items less than 5. Now, we can terminate these leafs based on the termination rule.

So, Final form of the decision tree is demonstrated below.

So, we have mentioned how to build decision trees for regression problems. Even though, decision trees are powerful way to classify problems, they can be adapted into regression problems as mentioned. Regression trees tend to over-fit much more than classification trees. Termination rule should be tuned carefully to avoid over-fitting. Finally, lecture notes of Dr. Saed Sayad (University of Toronto) guides me to create this content.

The post A Step By Step Regression Tree Example appeared first on Sefik Ilkin Serengil.

]]>The post A Step by Step CART Decision Tree Example appeared first on Sefik Ilkin Serengil.

]]>We will work on same dataset in ID3. There are 14 instances of golf playing decisions based on outlook, temperature, humidity and wind factors.

Day | Outlook | Temp. | Humidity | Wind | Decision |
---|---|---|---|---|---|

1 | Sunny | Hot | High | Weak | No |

2 | Sunny | Hot | High | Strong | No |

3 | Overcast | Hot | High | Weak | Yes |

4 | Rain | Mild | High | Weak | Yes |

5 | Rain | Cool | Normal | Weak | Yes |

6 | Rain | Cool | Normal | Strong | No |

7 | Overcast | Cool | Normal | Strong | Yes |

8 | Sunny | Mild | High | Weak | No |

9 | Sunny | Cool | Normal | Weak | Yes |

10 | Rain | Mild | Normal | Weak | Yes |

11 | Sunny | Mild | Normal | Strong | Yes |

12 | Overcast | Mild | High | Strong | Yes |

13 | Overcast | Hot | Normal | Weak | Yes |

14 | Rain | Mild | High | Strong | No |

Gini index is a metric for classification tasks in CART. It stores sum of squared probabilities of each class. We can formulate it as illustrated below.

Gini = 1 – Σ (Pi)^{2} for i=1 to number of classes

Outlook is a nominal feature. It can be sunny, overcast or rain. I will summarize the final decisions for outlook feature.

Outlook | Yes | No | Number of instances |

Sunny | 2 | 3 | 5 |

Overcast | 4 | 0 | 4 |

Rain | 3 | 2 | 5 |

Gini(Outlook=Sunny) = 1 – (2/5)^{2} – (3/5)^{2} = 1 – 0.16 – 0.36 = 0.48

Gini(Outlook=Overcast) = 1 – (4/4)^{2} – (0/4)^{2} = 0

Gini(Outlook=Rain) = 1 – (3/5)^{2} – (2/5)^{2} = 1 – 0.36 – 0.16 = 0.48

Then, we will calculate weighted sum of gini indexes for outlook feature.

Gini(Outlook) = (5/14) x 0.48 + (4/14) x 0 + (5/14) x 0.48 = 0.171 + 0 + 0.171 = 0.342

Similarly, temperature is a nominal feature and it could have 3 different values: Cool, Hot and Mild. Let’s summarize decisions for temperature feature.

Temperature | Yes | No | Number of instances |

Hot | 2 | 2 | 4 |

Cool | 3 | 1 | 4 |

Mild | 4 | 2 | 6 |

Gini(Temp=Hot) = 1 – (2/4)^{2} – (2/4)^{2} = 0.5

Gini(Temp=Cool) = 1 – (3/4)^{2} – (1/4)^{2} = 1 – 0.5625 – 0.0625 = 0.375

Gini(Temp=Mild) = 1 – (4/6)^{2} – (2/6)^{2} = 1 – 0.444 – 0.111 = 0.445

We’ll calculate weighted sum of gini index for temperature feature

Gini(Temp) = (4/14) x 0.5 + (4/14) x 0.375 + (6/14) x 0.445 = 0.142 + 0.107 + 0.190 = 0.439

Humidity is a binary class feature. It can be high or normal.

Humidity | Yes | No | Number of instances |

High | 3 | 4 | 7 |

Normal | 6 | 1 | 7 |

Gini(Humidity=High) = 1 – (3/7)^{2} – (4/7)^{2} = 1 – 0.183 – 0.326 = 0.489

Gini(Humidity=Normal) = 1 – (6/7)^{2} – (1/7)^{2} = 1 – 0.734 – 0.02 = 0.244

Weighted sum for humidity feature will be calculated next

Gini(Humidity) = (7/14) x 0.489 + (7/14) x 0.244 = 0.367

Wind is a binary class similar to humidity. It can be weak and strong.

Wind | Yes | No | Number of instances |

Weak | 6 | 2 | 8 |

Strong | 3 | 3 | 6 |

Gini(Wind=Weak) = 1 – (6/8)^{2} – (2/8)^{2} = 1 – 0.5625 – 0.062 = 0.375

Gini(Wind=Strong) = 1 – (3/6)^{2} – (3/6)^{2} = 1 – 0.25 – 0.25 = 0.5

Gini(Wind) = (8/14) x 0.375 + (6/14) x 0.5 = 0.428

We’ve calculated gini index values for each feature. The winner will be outlook feature because its cost is the lowest.

Feature | Gini index |

Outlook | 0.342 |

Temperature | 0.439 |

Humidity | 0.367 |

Wind | 0.428 |

We’ll put outlook decision at the top of the tree.

You might realize that sub dataset in the overcast leaf has only yes decisions. This means that overcast leaf is over.

We will apply same principles to those sub datasets in the following steps.

Focus on the sub dataset for sunny outlook. We need to find the gini index scores for temperature, humidity and wind features respectively.

Day | Outlook | Temp. | Humidity | Wind | Decision |

1 | Sunny | Hot | High | Weak | No |

2 | Sunny | Hot | High | Strong | No |

8 | Sunny | Mild | High | Weak | No |

9 | Sunny | Cool | Normal | Weak | Yes |

11 | Sunny | Mild | Normal | Strong | Yes |

Temperature | Yes | No | Number of instances |

Hot | 0 | 2 | 2 |

Cool | 1 | 0 | 1 |

Mild | 1 | 1 | 2 |

Gini(Outlook=Sunny and Temp.=Hot) = 1 – (0/2)^{2} – (2/2)^{2} = 0

Gini(Outlook=Sunny and Temp.=Cool) = 1 – (1/1)^{2} – (0/1)^{2} = 0

Gini(Outlook=Sunny and Temp.=Mild) = 1 – (1/2)^{2} – (1/2)^{2} = 1 – 0.25 – 0.25 = 0.5

Gini(Outlook=Sunny and Temp.) = (2/5)x0 + (1/5)x0 + (2/5)x0.5 = 0.2

Humidity | Yes | No | Number of instances |

High | 0 | 3 | 3 |

Normal | 2 | 0 | 2 |

Gini(Outlook=Sunny and Humidity=High) = 1 – (0/3)^{2} – (3/3)^{2} = 0

Gini(Outlook=Sunny and Humidity=Normal) = 1 – (2/2)^{2} – (0/2)^{2} = 0

Gini(Outlook=Sunny and Humidity) = (3/5)x0 + (2/5)x0 = 0

Wind | Yes | No | Number of instances |

Weak | 1 | 2 | 3 |

Strong | 1 | 1 | 2 |

Gini(Outlook=Sunny and Wind=Weak) = 1 – (1/3)^{2} – (2/3)^{2} = 0.266

Gini(Outlook=Sunny and Wind=Strong) = 1- (1/2)^{2} – (1/2)^{2} = 0.2

Gini(Outlook=Sunny and Wind) = (3/5)x0.266 + (2/5)x0.2 = 0.466

We’ve calculated gini index scores for feature when outlook is sunny. The winner is humidity because it has the lowest value.

Feature | Gini index |

Temperature | 0.2 |

Humidity | 0 |

Wind | 0.466 |

We’ll put humidity check at the extension of sunny outlook.

As seen, decision is always no for high humidity and sunny outlook. On the other hand, decision will always be yes for normal humidity and sunny outlook. This branch is over.

Now, we need to focus on rain outlook.

Day | Outlook | Temp. | Humidity | Wind | Decision |

4 | Rain | Mild | High | Weak | Yes |

5 | Rain | Cool | Normal | Weak | Yes |

6 | Rain | Cool | Normal | Strong | No |

10 | Rain | Mild | Normal | Weak | Yes |

14 | Rain | Mild | High | Strong | No |

We’ll calculate gini index scores for temperature, humidity and wind features when outlook is rain.

Temperature | Yes | No | Number of instances |

Cool | 1 | 1 | 2 |

Mild | 2 | 1 | 3 |

Gini(Outlook=Rain and Temp.=Cool) = 1 – (1/2)^{2} – (1/2)^{2} = 0.5

Gini(Outlook=Rain and Temp.=Mild) = 1 – (2/3)^{2} – (1/3)^{2} = 0.444

Gini(Outlook=Rain and Temp.) = (2/5)x0.5 + (3/5)x0.444 = 0.466

Humidity | Yes | No | Number of instances |

High | 1 | 1 | 2 |

Normal | 2 | 1 | 3 |

Gini(Outlook=Rain and Humidity=High) = 1 – (1/2)^{2} – (1/2)^{2} = 0.5

Gini(Outlook=Rain and Humidity=Normal) = 1 – (2/3)^{2} – (1/3)^{2} = 0.444

Gini(Outlook=Rain and Humidity) = (2/5)x0.5 + (3/5)x0.444 = 0.466

Wind | Yes | No | Number of instances |

Weak | 3 | 0 | 3 |

Strong | 0 | 2 | 2 |

Gini(Outlook=Rain and Wind=Weak) = 1 – (3/3)^{2} – (0/3)^{2} = 0

Gini(Outlook=Rain and Wind=Strong) = 1 – (0/2)^{2} – (2/2)^{2} = 0

Gini(Outlook=Rain and Wind) = (3/5)x0 + (2/5)x0 = 0

The winner is wind feature for rain outlook because it has the minimum gini index score in features.

Feature | Gini index |

Temperature | 0.466 |

Humidity | 0.466 |

Wind | 0 |

Put the wind feature for rain outlook branch and monitor the new sub data sets.

As seen, decision is always yes when wind is weak. On the other hand, decision is always no if wind is strong. This means that this branch is over.

So, decision tree building is over. We have built a decision tree by hand. BTW, you might realize that we’ve created exactly the same tree in ID3 example. This does not mean that ID3 and CART algorithms produce same trees always. We are just lucky. Finally, I believe that CART is easier than ID3 and C4.5, isn’t it?

The post A Step by Step CART Decision Tree Example appeared first on Sefik Ilkin Serengil.

]]>The post Indeterminate Forms and L’Hospital’s Rule in Decision Trees appeared first on Sefik Ilkin Serengil.

]]>Decision tree algorithms such as ID3 and C4.5 use entropy and gain calculations for determining the most dominant feature. Typical entropy calculation is demonstrated below for n classes.

Entropy = – Σ (i=0 to n) p(class_{i}) . log_{2}p(class_{i}) = – p(class_{1}) . log_{2}p(class_{1}) – p(class_{2}) . log_{2}p(class_{2}) – … – p(class_{n}) . log_{2}p(class_{n})

For example, if decision class consists of 4 yes and 2 no instances, then there are 6 instances and binary classes. Entropy will be calculated as

Entropy(decision) = – p(no) . log_{2}p(no) – p(yes) . log_{2}p(yes) = – (2/6) . log_{2}(2/6) – (4/6) . log_{2}(4/6) = -0.333.log_{2}(0.333) – 0.667.log_{2}(0.667) = -0.333.(-1.585) – 0.667.(-0.585) = 0.918

What if number of instances for a class is equal to 0? Let’s say decision class consists of 6 yes, and 0 no examples.

Entropy(decision) = – p(no) . log_{2}p(no) – p(yes) . log_{2}p(yes) = – (0/6) . log_{2}(0/6) – (6/6) . log_{2}(6/6) = – 0 . log_{2}(0) – 1 . log_{2}(1)

Here, log_{2}(1) is equal to 0, but the problem is log_{2}(0) is equal to – ∞. Additionally, we need 0 times ∞ in this calculation.

Let’s ask this question to python

import math a = 0 b = math.log(0, 2) #log to the base 2 of 0, or log 0 to the base 2 print(a*b)

You will face with **ValueError: math domain error** if you run 0 times negative ∞ in python. Similarly, Java produces **NaN** exception and excel returns **#NUM!** error.

As seen, this operation cannot be performed, can it? But we are suspicious ones. What if even high level programming languages do not know how to compute?

The term we have trouble is x . log_{2}x for x is equal to 0. We can rearrange the equation as limit x goes to 0 for x times log x to the base 2. Moving x multiplier to the denominator as 1 over x would not change the result.

lim (x->0) x . log_{2}x = lim (x->0) log_{2}x / (1/x) = – ∞/∞

Yes, it is transformed to familiar indeterminate form of ∞/∞.

L’Hopital’s rule states that if f(x) and g(x) are both equal to 0 (or ∞) while limit goes to some point c

Condition: lim_{(x->c)} f(x) = lim_{(x->c)} g(x) = 0 (or ∞)

Then, function f over function g is equal to derivative of f and derivative of g.

lim_{(x->c)} f(x)/g(x) = lim_{(x->c)} f'(x)/g'(x)

Here, f(x) and g(x) must be must be differentiable at point c.

We can already transformed x . log_{2}x term to ∞/∞ indeterminate form. This means that we can apply L’Hopital.

lim_{(x->0)} x . log_{2}x = lim_{(x->0)} log_{2}x / (1/x) = lim_{(x->0)} (log_{2}x)’/(1/x)’ = (log_{2}x)’/(x^{-1})’

Notice that derivative of log_{2}x is 1/(x.ln(2))

(1/(x.ln(2))) / (-1 . x^{-2}) = [1 / (x.ln(2))] / [-1 / x^{2}] = – x^{2} / x.ln(2) = x / ln(2)

This is L’Hospital applied version of lim (x->0) x . log_{2}x

lim_{(x->0)} x . log_{2}x = lim_{(x->0)} x / ln(2) = 0/0.693 = 0

Graph of x.log(x) is defined for [0, +∞) as illustrated below. Surprisingly, x = 0 is not undefined.

So, this case appears often when building entropy based decision trees. We can handle this trouble with calculus only. Even high level programming languages could not help to solve this case. To sum up, we can say that programming languages do not know calculus. They are designed to perform linear operations only.

You might rethink about takeover by some kind of evil AI or killer robots. They are not capable of applying a basic calculus. This is the basic answer why AI cannot takeover the human dominance on earth.

As an antithesis, *None of the best predators in the earth orca, lion, white shark, siberian tiger, king cobra does not even known counting* – Alper Ozpinar. But do not forget these species couldn’t takeover human dominance. Heavy-handed force might not put you at the top of food chain pyramid.

*PS: Thanks to Valentin Cold to inform me and raise awareness about this subject*

The post Indeterminate Forms and L’Hospital’s Rule in Decision Trees appeared first on Sefik Ilkin Serengil.

]]>The post Swish as Neural Networks Activation Function appeared first on Sefik Ilkin Serengil.

]]>The function is formulated as x times sigmoid x. Sigmoid function was important activation function in the history but today it is a legacy one because of the vanishing gradient problem. A little modification makes this legacy activation function important again.

y = x . sigmoid(x)

y = x . (1/(1+e^{-x})) = x / (1+e^{-x})

Notice that ReLU produces 0 output for negative inputs and it cannot be back-propagated. Herein, swish can partially handle this problem.

Derivative of the function will be involved in back propagation step.

y = x . σ(x)

y = x . (1/(1+e^{-x}))

The equation consists of two differentiable functions. We can apply product rule to the function. Remember what product rule is first.

(f . g)’ = f’.g + f.g’

y’ = x’ . σ(x) + x . σ(x)’

Remember the derivative of sigmoid function.

σ(x) = 1/(1+e^{-x})

σ(x)’ = σ(x).(1 – σ(x))

Derivative of the function includes derivative of sigmoid, too.

y’ = x’ . σ(x) + x . σ(x)’

y’ = σ(x) + x . σ(x) . (1 – σ(x)) = σ(x) + x . σ(x) – x. σ^{2}(x)

The second term is equal to identity of swish function. Shift it to the left.

y’ = x . σ(x) + σ(x) – x. σ^{2}(x) = y + σ(x) – x. σ^{2}(x)

Now, 2nd and 3rd terms both have sigmoid multiplier. Let’s express them both as sigmoid common parenthesis.

y’ = y + σ(x) . (1 – x.σ(x))

The term in the parenthesis includes swish function again. We can express it as the function y.

y’ = y + σ(x) . (1 – y)

This is the most basic form for derivative of swish function.

We can replace sigma function to content of sigmoid function and produce a raw equation.

y’ = x . (1/(1+e^{-x})) + 1/(1+e^{-x}) . (1 – (x/(1+e^{-x})))

y’ = (x/(1+e^{-x})) + [1/(1+e^{-x})].[(1 + e^{-x} – x)/(1+e^{-x})]

y’ = x/(1+e^{-x}) + (1 + e^{-x} – x)/(1+e^{-x})^{2}

y’ = x.(1+e^{-x})/(1+e^{-x})^{2} + (1 + e^{-x} – x)/(1+e^{-x})^{2}

y’ = [x.(1+e^{-x}) + (1 + e^{-x} – x)]/(1+e^{-x})^{2}

y’ = (x + x.e^{-x} + 1 + e^{-x} – x)/(1+e^{-x})^{2}

y’ = (e

^{-x}(x + 1) + 1)/(1+e^{-x})^{2}

Same authors published a new research paper just a week after. In this paper, they modified the function, and add a β multiplier in sigmoid. Interestingly, they called this new function swish again.

y = x . sigmoid(β.x)

y = x . (1/(1+e^{-βx})) = x / (1+e^{-βx})

Here, β is a parameter must be tuned. β must be different than 0 , otherwise it becomes a linear function. If β gets closer to ∞, then the function looks like ReLU. We have mention β as 1 in previous calculations. Authors proposed to assign β as 1 for reinforcement learning task in this new research.

Derivative of this new term would not be changed radically because β is constant.

Firstly, find the derivative for σ(β.x).

σ(β.x) = 1/(1+e^{-βx}) = (1+e^{-βx})^{-1}

σ(β.x)’ = (-1).(1+e^{-βx})^{-2}.e^{-βx}.(-β) = β . e^{-βx} .(1+e^{-βx})^{-2} = (β . e^{-βx} )/(1+e^{-βx})^{2}

Put β out of the parenthesis

σ(β.x)’ = β.((e^{-βx} )/(1+e^{-βx})^{2})

We will apply a little trick to form the derivative term simpler. Append plus and minus 1 to the numerator. This would not change the result.

σ(β.x)’ = β.((e^{-βx} +1-1)/(1+e^{-βx})^{2})

Separate terms in the numerator as 1+e^{-βx} and -1.

σ(β.x)’ = β.[(1+e^{-βx} )/(1+e^{-βx})^{2 }– 1/(1+e^{-βx})^{2}]

The first term in the parenthesis includes 1+e^{-βx} in both numerator and denominator. We can remove this term.

σ(β.x)’ = β.[1/(1+e^{-βx} ) – 1/(1+e^{-βx})^{2}]

Express 2nd term in the parenthesis as multiplier instead of squared.

σ(β.x)’ = β. [1/(1+e^{-βx} ) – (1/(1+e^{-βx} ))(1/(1+e^{-βx} )) ]

Notice that σ was 1/(1+e^{-βx}). Replace this terms with σ in the equation above.

σ(β.x)’ = β . [σ – σ.σ] = β. [σ.(1-σ)]

We’ve found the derivative for σ(β.x). Actually, it is equal to β times derivative of pure sigmoid.

Turn back to the modified swish function.

y = x . sigmoid(β.x)

Again, we’ll apply product rule to the term above.

y’ = x’ . σ(β.x) + x . σ(β.x)’

y’ = σ(β.x) + x . β. [σ.(1-σ)]

y’ = 1/(1+e^{-βx}) + x . β . (1/(1+e^{-βx})) . (1 – 1/(1+e^{-βx}))

I summarized both the swish function and its derivative below.

y = x . σ(x) where σ(x) = 1/(1+e

^{-x})dy/dx = y + σ(x) . (1 – y)

or dy/dx = (e

^{-x}(x + 1) + 1)/(1+e^{-x})^{2}

So, we’ve mentioned a new activation function derived from a legacy activation function. This function can handle vanishing gradient problem that sigmoid cannot. Moreover, experiments show that swish works better than the ReLU – superstar activation function of deep learning. That’s a fact, computation of the function has higher cost than ReLU for both feed forwarding and back propogation.

The post Swish as Neural Networks Activation Function appeared first on Sefik Ilkin Serengil.

]]>The post Elliptic Curve ElGamal Encryption appeared first on Sefik Ilkin Serengil.

]]>Elliptic curves satisfy the equation y^{2} = x^{3} + ax + b. Here, a and b specify the characteristic feature of the curve. Also, we define elliptic curves over prime fields to produce points including integer coordinates. This definition transforms the equation as y^{2} = x^{3} + ax + b (mod p).

We will use the following configuration in this post.

#curve configuration # y^2 = x^3 + a*x + b = y^2 = x^3 + 7 a = 0; b = 7 base_point = [55066263022277343669578718895168534326250603453777594175500187360389116729240 , 32670510020758816978083085130507043184471273380659243275938904335757337482424] mod = pow(2, 256) - pow(2, 32) - pow(2, 9) - pow(2, 8) - pow(2, 7) - pow(2, 6) - pow(2, 4) - pow(2, 0) #NIST curves order = 115792089237316195423570985008687907852837564279074904382605163141518161494337

We can apply additions and multiplications over elliptic curves. Here, we want to encrypt a message. We need to express the message numerically if we want to talk the same language with elliptic curve cryptography.

def textToInt(text): encoded_text = text.encode('utf-8') hex_text = encoded_text.hex() int_text = int(hex_text, 16) return int_text message = 'hi' plaintext = textToInt(message) print("message: ",message,". it is numeric matching is ",plaintext)

Now, we can calculate the numeric message times base point on an elliptic curve. In this way, we can map the plaintext to a coordinate. This would be the plain coordinates that we will actually encrypt. We must keep secret both message, plaintext and plain coordinates.

plain_coordinates= EccCore.applyDoubleAndAddMethod(base_point[0], base_point[1], plaintext, a, b, mod)

We can produce our public key. Public key would be independent from plaintext. We will pick a secret key and calculate secret key times base point. That would be our public key.

secretKey = 75263518707598184987916378021939673586055614731957507592904438851787542395619 publicKey = EccCore.applyDoubleAndAddMethod(base_point[0], base_point[1], secretKey, a, b, mod)

We will create a ciphertext pair. This requires to include a really random key.

import random randomKey = random.getrandbits(128) c1 = EccCore.applyDoubleAndAddMethod(base_point[0], base_point[1], randomKey, a, b, mod) c2 = EccCore.applyDoubleAndAddMethod(publicKey[0], publicKey[1], randomKey, a, b, mod) c2 = EccCore.pointAddition(c2[0], c2[1], plain_coordinates[0], plain_coordinates[1], a, b, mod)

Encryption phase is over. We will send ciphertext pair (c1, c2). Also, public key and public curve configuration are publicly known information.

ElGamal decryption scheme is based on the following equation.

decryption = c2 – secretKey * c1

We will adapt this equation into elliptic curves. We have already known how to add points over elliptic curves but the term in the above includes subtraction. Reflecting the sign to multiplier point handles subtraction.

decryption = c2 + secretKey * (-c1)

Notice that elliptic curves in weierstrass form are symmetric about x-axis. This means that negative of a point (x, y) is equal to (x, -y)

#secret key times c1 dx, dy = EccCore.applyDoubleAndAddMethod(c1[0], c1[1], secretKey, a, b, mod) #-secret key times c1 dy = dy * -1 #c2 + secret key * (-c1) decrypted = EccCore.pointAddition(c2[0], c2[1], dx, dy, a, b, mod)

Decryption phase is almost over. You can restore the coordinates for plaintext.

The rest of the operation requires high computational power. We need to solve elliptic curve discrete logarithm problem to restore plaintext again. But, I will apply brute force method to be clear. But applying baby step giant step reduces the complexity radically.

new_point = EccCore.pointAddition(base_point[0], base_point[1], base_point[0], base_point[1], a, b, mod) #2P #brute force method for i in range(3, order): new_point = EccCore.pointAddition(new_point[0], new_point[1], base_point[0], base_point[1], a, b, mod) if new_point[0] == decrypted[0] and new_point[1] == decrypted[1]: print("decrypted message as numeric: ",i) print("decrypted message: ",intToText(i)) break

So, we can encrypt a plaintext and restore it successfully by combining Elliptic Curve and ElGamal concepts. However, this is a conceptual, theoretical encryption schema and hard to apply in real world because elliptic curve cryptography is powerful because of elliptic curve discrete logarithm problem and decryption phase of this method requires to solve ECDLP – which is really difficult. I’ve already pushed the source code of this post into GitHub.

The post Elliptic Curve ElGamal Encryption appeared first on Sefik Ilkin Serengil.

]]>