The post A Gentle Introduction to LightGBM for Applied Machine Learning appeared first on Sefik Ilkin Serengil.
]]>You might run pip install lightgbm command to install LightGBM package. Then, we will reference the related library.
import lightgbm as lgb
The data set that we are going to work on is about playing Golf decision based on some features. You can find the data set here. I choose this data set because it has both numeric and string features. Decision column is the target that we would like to extract decision rules. I will load the data set with pandas because it will simplify column based operations in the following steps.
import pandas as pd dataset = pd.read_csv('golf2.txt') dataset.head()
Data frame’s head function prints the first 5 rows.
Outlook | Temp. | Humidity | Wind | Decision | |
0 | Sunny | 85 | 85 | Weak | No |
1 | Sunny | 80 | 90 | Strong | No |
2 | Overcast | 83 | 78 | Weak | Yes |
3 | Rain | 70 | 96 | Weak | Yes |
4 | Rain | 68 | 80 | Weak | Yes |
LightGBM expects to convert categorical features to integer. Here, temperature and humidity features are already numeric but outlook and wind features are categorical. We need to convert these features. I will use scikit-learn’s transformer.
Even though categorical features will be converted to integer, we will specify categorical features in the following steps. That’s why, I store both all features and categorical ones in different variables.
from sklearn import preprocessing le = preprocessing.LabelEncoder() features = []; categorical_features = [] num_of_columns = dataset.shape[1] for i in range(0, num_of_columns): column_name = dataset.columns[i] column_type = dataset[column_name].dtypes if i != num_of_columns - 1: #skip target features.append(column_name) if column_type == 'object': le.fit(dataset[column_name]) feature_classes = list(le.classes_) encoded_feature = le.transform(dataset[column_name]) dataset[column_name] = pd.DataFrame(encoded_feature) if i != num_of_columns - 1: #skip target categorical_features.append(column_name) if is_regression == False and i == num_of_columns - 1: num_of_classes = len(feature_classes)
In this way, we can handle different data sets. Let’s check the encoded data set.
dataset.head()
Outlook | Temp. | Humidity | Wind | Decision | |
0 | 2 | 85 | 85 | 1 | 0 |
1 | 2 | 80 | 90 | 0 | 0 |
2 | 0 | 83 | 78 | 1 | 1 |
3 | 1 | 70 | 96 | 1 | 1 |
4 | 1 | 68 | 80 | 1 | 1 |
Data set is transformed into the final form. We need to separate input features and output labels to feed LightGBM.
y_train = dataset['Decision'].values x_train = dataset.drop(columns=['Decision']).values
Remember that we have converted string features to integer. Here, we need to specify categorical features. Even though it still work if categorical features wouldn’t mention. But in this case, some node in the decision tree might check that feature is greater than something, or less than or equal to it. Consider that gender would be a feature in our data set. We set unknown gender to 0, male to 1, and woman to 2. What if decision tree checks gender is greater than 0, or less than or equal to 0? We might miss an important gender information. Specifying categorical features enables to check gender for male, for woman and for unknown respectively.
lgb_train = lgb.Dataset(x_train, y_train ,feature_name = features , categorical_feature = categorical_features )
We can solve this problem for both classification and regression. Typically, objective and metric parameters should be different. Passing parameter set and LightGBM’s data set will start training.
params = { 'task': 'train' , 'boosting_type': 'gbdt' , 'objective': 'regression' if is_regression == True else 'multiclass' , 'num_class': num_of_classes , 'metric': 'rmsle' if is_regression == True else 'multi_logloss' , 'min_data': 1 , 'verbose': -1 } gbm = lgb.train(params, lgb_train, num_boost_round=50)
Trained tree stored in gbm variable. We can ask gbm to predict the decision for a new instance. Similarly, we can feed features of training set instances and want gbm to predict decisions.
predictions = gbm.predict(x_train) for index, instance in dataset.iterrows(): actual = instance[target_name] if is_regression == True: prediction = round(predictions[index]) else: #classification prediction = np.argmax(predictions[index]) print((index+1),". actual= ",actual,", prediction= ",prediction)
This code block makes following predictions for the training data set. As seen, all instances can be predicted successfully.
actual= 0 , prediction= 0 actual= 0 , prediction= 0 actual= 1 , prediction= 1 actual= 1 , prediction= 1 actual= 1 , prediction= 1 actual= 0 , prediction= 0 actual= 1 , prediction= 1 actual= 0 , prediction= 0 actual= 1 , prediction= 1 actual= 1 , prediction= 1 actual= 1 , prediction= 1 actual= 1 , prediction= 1 actual= 1 , prediction= 1 actual= 0 , prediction= 0
Luckily, LightGBM enables to visualize built decision tree and importance of data set features. This makes decisions understandable. This requires to install Graph Visualization Software.
Firstly, you need to run pip install graphviz command to install python package.
Secondly, please install graphviz package related to your OS here. You can specify the installed directory as illustrated below.
import matplotlib.pyplot as plt import os os.environ["PATH"] += os.pathsep + 'C:/Program Files (x86)/Graphviz2.38/bin'
Plotting tree is an easy task now.
ax = lgb.plot_importance(gbm, max_num_features=10) plt.show() ax = lgb.plot_tree(gbm) plt.show()
Decision rules can be extracted from the built tree easily.
Now, we know feature importance for the data set.
So, we have discovered Microsoft’s light gradient boosting machine framework adopted by many applied machine learning studies. Moreover, we’ve mentioned its pros and cons compared to its alternatives. Besides, we’ve developed a hello world model with LightGBM. Finally, I pushed the source code of this blog post to my GitHub profile.
The post A Gentle Introduction to LightGBM for Applied Machine Learning appeared first on Sefik Ilkin Serengil.
]]>The post A Step by Step Gradient Boosting Decision Tree Example appeared first on Sefik Ilkin Serengil.
]]>Lecture notes of Zico Colter from Carnegie Mellon University and lecture notes of Cheng Li from Northeastern University guide me to understand the concept. Moreover, Tianqi Chen‘s presentation reinforce to make sense. Also, I referenced all sources help me to make clear the subject as a link in this post. I strongly recommend you to visit these links.
I pushed the core implementation of gradient boosted regression tree algorithm to GitHub. You might want to clone the repository and run it by yourself.
This is very similar to baby step giant step method. We initially create a decision tree for the raw data set. That would be the giant step. Then, it is time to tune and boost. We will create new decision tree based on previous tree’s error. We will apply this approach several times. These would be baby steps. Terence Parr described this process wonderfully in golf playing scenario as illustrated below.
Herein, remember random forest algorithm. We separate data set to n different sub data sets and create different decision trees for these sub data sets. In contrast, data set remains same in GBM. We will create a decision tree, we will feed decision tree algorithm same data set but we will update each instance’s label value as its actual value minus its prediction. You might think sequential decision trees in gradient boosting.
For instance, the following illustration shows that first decision tree returns 2 as a result for the boy. Then, we will build another decision tree based on errors for the first decision tree’s results. It returns 0.9 in this time for the boy. Final decision for the boy would be 2.9 which sums the prediction of sequential trees.
You might remember that we’ve mentioned regression trees in previous posts. Reading that post will contribute to understand GBM clearly.
We’ve worked on the following data set.
Day | Outlook | Temp. | Humidity | Wind | Decision |
1 | Sunny | Hot | High | Weak | 25 |
2 | Sunny | Hot | High | Strong | 30 |
3 | Overcast | Hot | High | Weak | 46 |
4 | Rain | Mild | High | Weak | 45 |
5 | Rain | Cool | Normal | Weak | 52 |
6 | Rain | Cool | Normal | Strong | 23 |
7 | Overcast | Cool | Normal | Strong | 43 |
8 | Sunny | Mild | High | Weak | 35 |
9 | Sunny | Cool | Normal | Weak | 38 |
10 | Rain | Mild | Normal | Weak | 46 |
11 | Sunny | Mild | Normal | Strong | 48 |
12 | Overcast | Mild | High | Strong | 52 |
13 | Overcast | Hot | Normal | Weak | 44 |
14 | Rain | Mild | High | Strong | 30 |
And we’ve built the following decision tree.
This duty is handled by buildDecisionTree function. We will pass data set, number of inline tabs (this is important in python. we will increase this every inner call and restore it after calling) and file name to store decision rules.
root = 1 buildDecisionTree(df,root,"rules0.py") #generate rules0.py
Running this decision tree algorithm for the data set generates the following decision rules.
def findDecision(obj): if Outlook == 'Rain': if Wind == 'Weak': return 47.666666666666664 if Wind == 'Strong': return 26.5 if Outlook == 'Sunny': if obj[1] == 'Hot': return 27.5 if obj[1] == 'Mild': return 41.5 if obj[1] == 'Cool': return 38 if Outlook == 'Overcast': return 46.25
Building this decision tree was covered in a previous post. That’s why, I skipped how the tree is built. If it is hard to understand, I strongly recommend you to read that post.
Let’s check the Day 1 and Day 2 instances. They both have sunny outlook and hot temperature. Built decision tree says that decision will be 27.5 for sunny outlook and hot temperature. However, day 1 should be 25 and day 2 should be 30. This means that the error (or residual) is 25 – 27.5 = -2.5 for day 1 and 30 – 27.5 = +2.5 for day 2. The following days have similar errors. We will boost these errors.
This is not a must but we will use mean squared error as loss function.
loss = (1/2) x (y – y’)^{2}
where y is the actual value and y’ is the prediction.
Gradient refers to gradient descent in gradient boosting. We will update each prediction as partial derivative of loss function with respect to the prediction. Let’s find this derivative first.
∂ loss / ∂ y’ = ∂((1/2) x (y – f(x))^{2})/∂y’ = 2. (1/2) . (y – y’) . ∂(-y’)/∂y’ = 2. (1/2) . (y – y’) . (-1) = y’ – y
Now, we can update predictions by applying the following formula. Here, α is learning rate.
y’ = y’ – α . (∂ loss / ∂ y’)
Please focus on the updating term only. I set α to 1 to make formula simpler.
– α . (∂ loss / ∂ y’) = – α . (y’ – y) = α . (y – y’) = y – y’
This is the label that we are going to build a new decision tree.
Remember that error was -2.5 for day 1 and +2.5 for day 2. Similarly, we’ll find the errors based on the built decision tree’s results and actual labels for the following days.
import rules0 as myrules for i, instance in df.iterrows(): params = [] #features for current line stored in params list for j in range(0, columns-1): params.append(instance[j]) prediction = myrules.findDecision(params) #apply rules(i-1) for data(i-1) actual = instance[columns-1] gradient = actual - prediction instance[columns-1] = gradient df.loc[i] = instance #end of for loop df.to_csv("data1.py", index=False)
Then, a new data set will be created and residual for each line set to its decision column.
Day | Outlook | Temp. | Humidity | Wind | Decision |
1 | Sunny | Hot | High | Weak | -2.5 |
2 | Sunny | Hot | High | Strong | 2.5 |
3 | Overcast | Hot | High | Weak | -0.25 |
4 | Rain | Mild | High | Weak | -2.66 |
5 | Rain | Cool | Normal | Weak | 4.333 |
6 | Rain | Cool | Normal | Strong | -3.5 |
7 | Overcast | Cool | Normal | Strong | -3.25 |
8 | Sunny | Mild | High | Weak | -6.5 |
9 | Sunny | Cool | Normal | Weak | 0 |
10 | Rain | Mild | Normal | Weak | -1.66 |
11 | Sunny | Mild | Normal | Strong | 6.5 |
12 | Overcast | Mild | High | Strong | 5.75 |
13 | Overcast | Hot | Normal | Weak | -2.25 |
14 | Rain | Mild | High | Strong | 3.55 |
Now, it is time to build a new decision tree based on the data set above. The following code block will generate decision rules for the current data frame.
root = 1 buildDecisionTree(df,root,"rules1.py")
Running regression tree algorithm creates the following decision rules.
def findDecision(Outlook, Temperature, Humidity, Wind): if Wind == 'Weak': if Temperature == 'Hot': return -1.6666666666666667 if Temperature == 'Mild': return -3.6111111111111094 if Temperature == 'Cool': return 2.166666666666668 if Wind == 'Strong': if Temperature == 'Mild': return 5.25 if Temperature == 'Cool': return -3.375 if Temperature == 'Hot': return 2.5
Let’s look for predictions of day 1 and day 2 again. Now, built decision tree says that day 1 has weak wind and hot temperature and it is -1.666 but its actual value was -2.5 in the 2nd data set. This means that error is -2.5 – (-1.666) = -0.833.
Similarly, the tree says that day 2 has strong wind and hot temperature, that’s why, it is predicted as 2.5 whereas its actual value is 2.5, too. In this case, error is equal to 2.5 – 2.5 = 0. In this way, I calculate each instance’s prediction and subtract from its actual value again.
Day | Outlook | Temp. | Humidity | Wind | Golf Players |
1 | Sunny | Hot | High | Weak | -0.833 |
2 | Sunny | Hot | High | Strong | 0.0 |
3 | Overcast | Hot | High | Weak | 1.416 |
4 | Rain | Mild | High | Weak | 0.944 |
5 | Rain | Cool | Normal | Weak | 2.166 |
6 | Rain | Cool | Normal | Strong | -0.125 |
7 | Overcast | Cool | Normal | Strong | 0.125 |
8 | Sunny | Mild | High | Weak | -2.888 |
9 | Sunny | Cool | Normal | Weak | -2.166 |
10 | Rain | Mild | Normal | Weak | 1.944 |
11 | Sunny | Mild | Normal | Strong | 1.25 |
12 | Overcast | Mild | High | Strong | 0.5 |
13 | Overcast | Hot | Normal | Weak | -0.583 |
14 | Rain | Mild | High | Strong | -1.75 |
This time, the following rules will be created for the data set above.
def findDecision(Outlook, Temperature, Humidity, Wind): if Outlook == 'Rain': if Wind == 'Weak': return 1.685185185185186 if Wind == 'Strong': return -0.9375 if Outlook == 'Sunny': if Wind == 'Weak': return -1.962962962962964 if Wind == 'Strong': return 0.625 if Outlook == 'Overcast': return 0.3645833333333334
I skipped epochs from 3 to 5 because same procedures are applied in each step.
Thereafter, I summarize each epoch’s predictions in the table shown below. I’m going to calculate predictions cumulatively and sum values from epoch 1 to epoch 5 in each line to find the final prediction.
Day | Actual | epoch 1 | epoch 2 | epoch 3 | epoch 4 | epoch 5 | prediction |
1 | 25 | 27.5 | -1.667 | -1.963 | 0.152 | 5.55E-17 | 24.023 |
2 | 30 | 27.5 | 2.5 | 0.625 | 0.152 | 5.55E-17 | 30.777 |
3 | 46 | 46.25 | -1.667 | 0.365 | 0.152 | 5.55E-17 | 45.1 |
4 | 45 | 47.667 | -3.611 | 1.685 | -0.586 | -1.88E-01 | 44.967 |
5 | 52 | 47.667 | 2.167 | 1.685 | 0.213 | 1.39E-17 | 51.731 |
6 | 23 | 26.5 | -3.375 | -0.938 | 0.213 | 1.39E-17 | 22.4 |
7 | 43 | 46.25 | -3.375 | 0.365 | 0.213 | 1.39E-17 | 43.452 |
8 | 35 | 41.5 | -3.611 | -1.963 | -0.586 | -7.86E-02 | 35.261 |
9 | 38 | 38 | 2.167 | -1.963 | 0.213 | 1.39E-17 | 38.416 |
10 | 46 | 47.667 | -3.611 | 1.685 | 0.442 | -1.88E-01 | 45.995 |
11 | 48 | 41.5 | 5.25 | 0.625 | 0.442 | -7.86E-02 | 47.739 |
12 | 52 | 46.25 | 5.25 | 0.365 | -0.586 | 7.21E-01 | 52 |
13 | 44 | 46.25 | -1.667 | 0.365 | 0.152 | 5.55E-17 | 45.1 |
14 | 30 | 26.5 | 5.25 | -0.938 | -0.586 | -1.88E-01 | 30.038 |
For instance, predictions will be changed over epoch as illustrated below.
1st Epoch = 27.5
2nd Epoch = 27.5 – 1.667 = 25.833
3rd Epoch = 27.5 – 1.667 – 1.963 = 23.87
4th Epoch = 27.5 – 1.667 – 1.963 + 0.152 = 24.022
Absolute error was |25-27.5| = 2.5 in 1st round for 1st day but we can reduce it to |25-24.023| = 0.97 in 5th round. As seen, each instance’s prediction closes to its actual value when it is boosted.
BTW, learning rate (α) and number of iterations (epoch) should be tuned for different problems.
I pivot mean absolute error value for each epoch.
Day | epoch 1 | epoch 2 | epoch 3 | epoch 4 | epoch 5 |
1 | 2.5 | 0.833 | 1.13 | 0.977 | 0.977 |
2 | 2.5 | 0 | 0.625 | 0.777 | 0.777 |
3 | 0.25 | 1.417 | 1.052 | 0.9 | 0.9 |
4 | 2.667 | 0.944 | 0.741 | 0.155 | 0.033 |
5 | 4.333 | 2.167 | 0.481 | 0.269 | 0.269 |
6 | 3.5 | 0.125 | 0.813 | 0.6 | 0.6 |
7 | 3.25 | 0.125 | 0.24 | 0.452 | 0.452 |
8 | 6.5 | 2.889 | 0.926 | 0.34 | 0.261 |
9 | 0 | 2.167 | 0.204 | 0.416 | 0.416 |
10 | 1.667 | 1.944 | 0.259 | 0.183 | 0.005 |
11 | 6.5 | 1.25 | 0.625 | 0.183 | 0.261 |
12 | 5.75 | 0.5 | 0.135 | 0.721 | 0 |
13 | 2.25 | 0.583 | 0.948 | 1.1 | 1.1 |
14 | 3.5 | 1.75 | 0.813 | 0.227 | 0.038 |
MAE | 3.011111 | 1.112963 | 0.599383 | 0.48669 | 0.406115 |
The result seems interesting when I plot the total error over epochs.
We can definitely say that boosting works well.
So, the intuition behind gradient boosting is covered in this post. XGBoost, LightGBM and Catboost are common variants of gradient boosting. Even though, decision trees are very powerful machine learning algorithms, a single tree is not strong enough for applied machine learning studies. However, experiments show that its sequential form GBM dominates most of applied ML challenges. I pushed the core implementation of gradient boosted regression tree algorithm to GitHub.
The post A Step by Step Gradient Boosting Decision Tree Example appeared first on Sefik Ilkin Serengil.
]]>The post Large Scale Machine Learning with Pandas appeared first on Sefik Ilkin Serengil.
]]>You might remember the Iris flower data set. There are 150 instances of length and width measurement for top and bottom leaf and corresponding class in the data set. Corresponding class can be 3 different iris flower types: setosa, versicolor and virginica. So, there are 4 input features and 3 output labels. Let’s create a hidden layer consisting of 4 nodes in the neural networks. I mostly decide this number as 2/3 times of sum of features and labels. Multi class classification requires to use cross-entropy as loss function. Also, I want to apply Adam optimization algorithm to converge faster.
import keras from keras.models import Sequential from keras.layers import Dense, Activation def createNetwork(): model = Sequential() model.add(Dense(4 #num of hidden units , input_shape=(4,))) #num of features in input layer model.add(Activation('sigmoid')) #activation function from input layer to 1st hidden layer model.add(Dense(num_classes)) #num of classes in output layer model.add(Activation('sigmoid')) #activation function from 1st hidden layer to output layer return model model = createNetwork() model.compile(loss='categorical_crossentropy' , optimizer=keras.optimizers.Adam(lr=0.007) , metrics=['accuracy']
Even though data set is small enough, we will load sub data sets instead of loading all. In this way, we will save on the memory. On the other hand, this will increase the I/O usage but this is reasonable because we cannot store massive data sets on memory.
Chunk size parameter is set to 30. Thus, we will read 30 lines of data set for each iteration. Moreover, column information is missing in the data set. That’s why, we need to define column names. Otherwise, pandas thinks the first row as column names and we will lose that line’s information.
import pandas as pd import numpy as np chunk_size = 30 def processDataset(): for chunk in pd.read_csv("iris.data", chunksize=chunk_size , names = ["sepal_length","sepal_width","petal_length","petal_width","class"]): current_set = chunk.values #convert df to numpy array
Chunk parameter is type of pandas data frame. We can still convert it to numpy array by getting its values. This is important because fit operation will expect features and labels as numpy.
A line of the data set consisting of 4 measurement of a flower, and corresponding class respectively. I can seperate features and label by specifying index values.
features = current_set[:,0:4] labels = current_set[:,4]
Labels are in single column and type of string. I will apply one-hot-encoding to feed network.
for i in range(0,labels.shape[0]): if labels[i] == 'Iris-setosa': labels[i] = 0 elif labels[i] == 'Iris-versicolor': labels[i] = 1 elif labels[i] == 'Iris-virginica': labels[i] = 2 labels = keras.utils.to_categorical(labels, num_classes)
Features and labels are ready. We can feed to neural networks. Learning time or epochs must be set to 1 here. This is important. I will handle epochs in a for loop at the top.
model.fit(features, labels, epochs=1, verbose=0) #epochs handled in the for loop above
We will done processing all train set when processDataset() operation is over. Remeber back-propagation and gradient descent algorithm. We need to apply this processing over and over.
epochs = 1000 for epoch in range(0, epochs): #epoch should be handled here, not in fit command! processDataset()
If you set verbose to 1, then you will face will loss values for current sub data set. You should ignore the loss during training because it does not represent global loss for train set.
So, we’ve adapted pandas to read massive data set as small chunks and feed neural networks learning. It comes with pros and cons. The main advantage is that we can handle massive data set and save on the memory. The disadvantage is that it increases I/O usage. However, the focus of this post is working on massive data sets, it is neither big data nor streaming data. I’ve pushed the source code of this post into GitHub.
The post Large Scale Machine Learning with Pandas appeared first on Sefik Ilkin Serengil.
]]>The post A Beginner’s Guide to TensorFlow.js: Machine Lerning in JavaScript appeared first on Sefik Ilkin Serengil.
]]>In this case, we can just run the code. No prerequisite installation is required. I will create a hello.html file and reference tensorflow js library in head tag. This reference offers to find tensorflow related objects under tf variable. There might be up-to-date version of the library. You should check it in the official site.
Also, I need to define an another script tag after tensorflow js referencing. I have to construct neural networks here.
<html> <head> <script src="https://cdn.jsdelivr.net/npm/@tensorflow/tfjs@0.12.5"> </script> <!-- Place your code in the script tag below --> <script> </script> </head> <body> </body> </html>
I will construct a model for XOR problem. Let’s create the data set first. Here, xtrain stores all potential inputs whereas ytrain stores xor logic gate results respectively as one-hot encoded. I mean that [1, 0] refers to firing 0 whereas [0, 1] refers to firing 1 as xor result.
const xtrain = tf.tensor2d([[0, 0], [0, 1], [1, 0], [1, 1]]); const ytrain = tf.tensor2d([[1, 0], [0, 1], [0, 1], [1, 0]]);
We can construct a neural networks model. I will create a sequential model. Input layer consists of 2 nodes because there 2 input features in xor data set. First and single hidden layer will have 5 nodes and its activation function will be sigmoid. Finally, output layer will have 2 nodes because xor data set has 2 output classes. Activation function of output layer should be softmax because this is a classification problem.
const model = tf.sequential(); model.add(tf.layers.dense({units: 5, activation: 'sigmoid', inputShape: [2]})); model.add(tf.layers.dense({ units: 2, activation: 'softmax' }));
Now, we can specify the optimization algorithm and loss function to train the model. You have to use categorical crossentropy loss function if you use softmax activation function in the output layer. Moreover, I would like to train the model with Adam optimization algorithm to be learnt faster.
var learning_rate = 0.1 model.compile({loss: 'categoricalCrossentropy', optimizer: tf.train.adam(learning_rate)});
Time to train the network. You might remember that we run fitting and prediction respectively in python. Here, running is a little different. Fit command is handled asynchronously. That’s why, you must not run fit and predict commands in separate lines as demonstrated below. Otherwise, predict command dumps the results before training.
//you should not run the prediction in this way const history = model.fit(xtrain, ytrain, {epochs: 200}) console.log("fit is over") model.predict(xtrain).print();
Fit command should include prediction as illustrated below.
const history = model.fit(xtrain, ytrain, {epochs: 200}) .then(()=>{ console.log("fit is over") //model.predict(tf.tensor2d([[0, 0], [0, 1], [1, 0], [1, 1]])).print(); model.predict(xtrain).print(); });
Coding is over for client side solution. Now, you can open the hello.html file in the browser. Do not surprise when you see the blank page. You can see the final predictions by pressing F12 button in chrome. Or you can access the same place under Settings (3 points on the right top side) > More tools > Developer tools > Console tab.
So, we can successfully create the Machine Learning in the browser as shown above. But this is too beyond the ML in browser. Let’s see how.
Server side capabilities enabled for javascript in Node.js recently. We can run the (almost) same code in Node.js server. In this case, you have to install Node.js into your computer. I installed the recommended version 8.11.4 for today. You can run node command in the command prompt after installation.
You should run the following command if you run the node.js first time. This creates packages.json file in the current directory. Otherwise, tensorflow.js installation would not complete successfully. BTW, I run the command on my desktop.
npm init
You can install TensorFlow.js package after initialization. (double dash and save. it seems like single dash in the browser)
npm install @tensorflow/tfjs –save
That’s it! Your environment is ready. Please create a hello.js file. Content of the file will be look like this.
var tf = require('@tensorflow/tfjs'); const model = tf.sequential(); model.add(tf.layers.dense({units: 5, activation: 'sigmoid', inputShape: [2]})); model.add(tf.layers.dense({ units: 2, activation: 'softmax' , outputShape: [2] })); model.compile({loss: 'categoricalCrossentropy', optimizer: tf.train.adam(0.1)}); const xtrain = tf.tensor2d([[0, 0], [0, 1], [1, 0], [1, 1]]); const ytrain = tf.tensor2d([[1,0],[0,1],[0,1],[1,0]]); const history = model.fit(xtrain, ytrain, {epochs: 200}) .then(()=>{ console.log("fit is over") //model.predict(tf.tensor2d([[0, 0], [0, 1], [1, 0], [1, 1]])).print(); model.predict(xtrain ).print(); });
As seen, we’ve run the same code. Model has learnt the principles of xor logic gate successfully.
So, we have mentioned the javascript version of TensorFlow in this post. TensorFlow is not just a tool for research. For instance, Facebook developed both PyTorch and Caffe2 frameworks for deep learning. However, Facebook uses PyTorch as research purposes whereas it uses Caffe2 for production. On the other hand, Google enabled TensorFlow for both research and production. It seems that we will see TensorFlow.js much more common in the following days.
The post A Beginner’s Guide to TensorFlow.js: Machine Lerning in JavaScript appeared first on Sefik Ilkin Serengil.
]]>The post 10 Interview Questions Asked in Machine Learning appeared first on Sefik Ilkin Serengil.
]]>Rewarding branches based on profits might not be fair. Because some of these branches have higher profits and some have more customers. This causes to reward lucky ones. You might apply unsupervised learning and create clusters based on profitability, turnover, transaction volumes, having customers or region. It is like customer segmentation. Then, you should evaluate each branch based on where it is in current cluster. In this way, each branch can compete with same weight competitors. Otherwise, it would be like putting light weighted boxer in front of heavyweight one. In fact, there might be several champs for different weight groups.
This is rare event detection problem. Classifiers expect homogeneous data during training to produce satisfactory results. We cannot always expect to have balanced data for some cases. Firstly, you can feed less number of randomly selected instances to decrease the number of non fraud transactions. This is called sub sampling. But this causes to lose important data. We would not often prefer to apply this. Secondly, we can increase the number of fraud transactions by creating synthetic fraud data. For example, you can pick random two existing fraud instances, calculate average of transaction amount for this two instances, and assign the average amount to the new data. This is called over sampling. This increases the number of fraud instances. This approach might be preferable than sub sampling for the fraud case but it is still dangerous because it causes to feed non existing data to the model. It is like having imaginary friends!
We can ignore the fraud mark and consider the problem as anomaly detection. However, we should work on transactions for customers individually. Suppose that transactions of a customer (e.g. named Sefik) has a normal distribution. Mean (µ) and standard deviation (σ) of transaction amount will lighten us. We have already known that 3 standard deviation beyond the mean (µ ± 3σ) covers 99.7% of all area. We can apply this logic to transactions of a customer. For example, if a customer has averagely 100$ expenses, and standard deviation were 10$, then 99.7% expenses must be less than 130$ and must be greater than 70$. You can mark any transaction of that customer as abnormal if it is greater than 130$. That might not be fraud but still it is abnormal. In this way, we can have an idea for unmarked transactions. BTW, you can increase precision. 6 sigma covers 99.99%.
We thought about the problem for only transaction amount. We can increase the dimensions by adding some additional information such as time and location information.
Some machine learning models such as neural networks or support vector machines produce opaque models. This means that opaque decisions cannot be read and understood by human. Everything is handled in a black box. On the other hand, a decision tree algorithm produces transparent decisions. Transparent decisions can be read and understood by human clearly. In other words, you can follow the steps to make decision. For example, look at the following decision tree. If your decision were accept offer, because the company offers free coffee, commutation does not last more than 1 hour and salary is greater than 50K.
That’s why, you have to build decision tree for credit decisioning. Herein, the most common decision tree algorithms for classification are ID3, C4.5 and CART. On the other hand, CART can be adapted for regression problems.
You might either solve an insignificant problem like how many legs does a cow have or you overfitted. You have most probably the second one. Even the most advanced AI models or intelligent life forms fail. You should not expect to get 100% accuracy anytime. How senior developers do not expect new programs to work bug-free at first time, notice that it makes happy just junior developers. Similarly, machine learning practitioners should never expect to get 100%. Still, you believe that you can solve a problem with 100% accuracy, then it would be automation. In this case, you can create a rule based model and there is no need for AI.
Remember the fraud detection data set. Suppose that there are 1M legal transactions and 100 fraud transactions. This means that 99.99% of the dataset corresponds legal whereas 0.01% corresponds fraud. In this case, you can get 99.99% accuracy if you return not fraud by default. Is this a success? Of course, no! Here, the important thing is that you can classify correctly how many of really fraud instances. Confusion matrix and ROC curve become important instead of overall accuracy. If number of cases for true positive and true negative close to 100%, that would be a good job.
Besides, if your problem is based on human health, then 99.99% accuracy means that you can cause to die of 1 person in every 1000 people. So, metrics might have different meanings based on problems.
Funny, but it includes both regression, classification and clustering. It predicts weather temperature in Fahrenheit or Celsius degrees. This is regression because continuous outputs will be produced. Moreover, it classifies the weather as partly sunny, raining and snowing. This is classification because there are limited number of classes. Finally, it includes unsupervised learning. It clusters some cities /states based on the geographic location.
If you run a decision tree algorithm, then they tend to over-fit on a large scale data sets. A basic approach is to apply random forest. It basically separates data set into several sub data sets (mostly prime number). Then, different decision trees are created for all of those sub data sets. Final decisions of these sub data sets specify the global decision. Moreover, you can apply pruning to avoid over-fitting.
On the other hand, if you run neural networks, it is based on updating weights over epochs. You should monitor the training set and validation set error over epochs. Training set error will decrease over iterations. If validation set error starts to increase for some epoch value, you should terminate epochs. Moreover, you could create a really complex neural networks model (input features, number of hidden layers and nodes). You might re-design a less complex the model.
This question might seem very easy but it is a tricky one. Traditional developers tend to design this kind of systems with for loops.
import numpy as np inputs = np.array([1,0,1]) weights = np.array([0.3, 0.8, 0.4]) sum = 0 for i in range(inputs.shape[0]): sum = sum + inputs[i] * weights[i] print(sum)
However, machine learning practitioners must not apply this approach. They have to apply matrix multiplication. Because, vectorized solution fasten processing time almost 150 times.
import numpy as np inputs = np.array([1,0,1]) weights = np.array([0.3, 0.8, 0.4]) sum = np.matmul(np.transpose(weights), inputs) print(sum)
Your data set can have thousands of features. Feeding all features becomes much more complex model. Training lasts longer and it might tend to over fit. Dropping some features will reduce the complexity and fasten training but in this case we might lose some significant information. Autoencoders are typical way to represent data and reduce dimensions. Thus, you can zip the data (lossy) but it offers you to have less complex model, faster training and you do not lose any information just like in dropping.
Besides, face recognition technology and art style transfer techniques are mainly based on dimension reduction and auto-encoders.
So, I collected some job interview questions asked for data scientists and machine learning practitioners and I try to respond. Responses reflect my personal opinions. You might find some answers true or partially false. These questions asked to test solution approach of a candidate. In other words, solution approach is more important than the pure answer.
The post 10 Interview Questions Asked in Machine Learning appeared first on Sefik Ilkin Serengil.
]]>The post Face Recognition with FaceNet in Keras appeared first on Sefik Ilkin Serengil.
]]>We will apply transfer learning to have outcomes of previous researches. David Sandberg shared pre-trained weights after 30 hours training with GPU. However, that work was on raw TensorFlow. Your friendly neighborhood blogger converted the pre-trained weights into Keras format. I put the weights in Google Drive because it exceeds the upload size of GitHub. You can find pre-trained weights here. Also, FaceNet has a very complex model structure. You can find the model structure here in json format.
We can create FaceNet mode as illustrated below.
from keras.models import model_from_json #facenet model structure: https://github.com/serengil/tensorflow-101/blob/master/model/facenet_model.json model = model_from_json(open("facenet_model.json", "r").read()) #pre-trained weights https://drive.google.com/file/d/1971Xk5RwedbudGgTIrGAL4F7Aifu7id1/view?usp=sharing model.load_weights('facenet_weights.h5') model.summary()
FaceNet model expects 160×160 RGB images whereas it produces 128-dimensional representations. Auto-encoded representations called embeddings in the research paper. Additionally, researchers put an extra l2 normalization layer at the end of the network. Remember what l2 normalization is.
l2 = √(∑ x_{i}^{2}) while (i=0 to n) for n-dimensional vector
They also constrained 128-dimensional output embedding to live on the 128-dimensional hyperspace. This means that element wise should be applied to output and l2 normalized form pair.
def l2_normalize(x): return x / np.sqrt(np.sum(np.multiply(x, x)))
Researchers also mentioned that they used euclidean distance instead of cosine similarity to find similarity between two vectors. Euclidean distance basically finds distance of two vectors on an euclidean space.
def findEuclideanDistance(source_representation, test_representation): euclidean_distance = source_representation - test_representation euclidean_distance = np.sum(np.multiply(euclidean_distance, euclidean_distance)) euclidean_distance = np.sqrt(euclidean_distance) return euclidean_distance
Finally, we can find the distance between two different images via FaceNet.
img1_representation = l2_normalize(model.predict(preprocess_image('img1.jpg'))[0,:]) img2_representation = l2_normalize(model.predict(preprocess_image('img2.jpg'))[0,:]) euclidean_distance = findEuclideanDistance(img1_representation, img2_representation)
Distance should be small for images of same person whereas distance should be large for pictures of different people. Setting the threshold to 0.20 in the research paper but I got successful results when it is set to 0.35.
threshold = 0.35 if euclidean_distance < threshold: print("verified... they are same person") else: print("unverified! they are not same person!")
Still, we can check cosine similarity between two vectors. In this case, I got the most successful results when I set the threshold to 0.07. Notice that l2 normalization skipped for this metric.
def findCosineSimilarity(source_representation, test_representation): a = np.matmul(np.transpose(source_representation), test_representation) b = np.sum(np.multiply(source_representation, source_representation)) c = np.sum(np.multiply(test_representation, test_representation)) return 1 - (a / (np.sqrt(b) * np.sqrt(c))) img1_representation = model.predict(preprocess_image('img1.jpg'))[0,:] img2_representation = model.predict(preprocess_image('img2.jpg'))[0,:] cosine_similarity = findCosineSimilarity(img1_representation, img2_representation) print("cosine similarity: ",cosine_similarity) threshold = 0.07 if cosine_similarity < threshold: print("verified... they are same person") else: print("unverified! they are not same person!")
Well, we designed the model. The important thing how successful designed model is. I test the FaceNet with same instances in VGG-Face testing.
It succeeded when I tested the model for really different Angelina Jolie images.
Similarly, FaceNet succeeded when tested for different photos of Jennifer Aniston .
We can process true negative cases successfully.
So, we’ve implemented Google’s face recognition model on-premise in this post. We have combined representations with autoencoders, transfer learning and vector similarity concepts to build FaceNet. Original paper includes face alignment steps but we skipped them in this post. Instead of including alignment, I fed already aligned images as inputs. Moreover, FaceNet has a much more complex model structure than VGG-Face. Still, VGG-Face produces more successful results than FaceNet based on experiments. This might cause to produce slower results in real time. Finally, I pushed the code of this post into GitHub.
The post Face Recognition with FaceNet in Keras appeared first on Sefik Ilkin Serengil.
]]>The post Hyperbolic Secant As Neural Networks Activation Function appeared first on Sefik Ilkin Serengil.
]]>Some resources mention the function as inverse of hyperbolic cosine or inverse-cosh. Remember the formula of hyperbolic cosine.
y = 1 / cosh(x) where cosh(x) = (e^{x} + e^{-x})/2
So, pure form of the function is formulated below.
y = 2 / (e^{x} + e^{-x})
The function produces outputs in scale of [0,1]. Output decreases and closes to neutral when x goes to infinite. However, it will never produce 0 output even for very large inputs except ±∞.
Hyperbolic secant formula contributes to feed forward step in neural networks. On the other hand, derivative of the function will be involved in back propagation.
dy/dx = 2 . (e^{x} + e^{-x})^{-1}
dy/dx = 2.(-1).(e^{x} + e^{-x})^{-2}.[d(e^{x} + e^{-x})/dx]
dy/dx = 2.(-1).(e^{x} + e^{-x})^{-2}.(e^{x} + (-1).e^{-x}) = 2.(-1).(e^{x} + e^{-x})^{-2}.(e^{x} – e^{-x})
dy/dx = (-2).(-e^{x} + e^{-x})/(e^{x} + e^{-x})^{2}
dy/dx = 2.(-e^{x} + e^{-x})/(e^{x} + e^{-x})^{2}
Or we can rearrange the derivative into simpler form. Adding and substracting e^{x} to numerators would not change the result.
dy/dx = 2.(e^{x }– e^{x }– e^{x} + e^{-x})/(e^{x} + e^{-x})^{2} = 2.(e^{x }+ e^{-x} -e^{x }– e^{x} )/(e^{x} + e^{-x})^{2}
dy/dx = 2(e^{x }+ e^{-x})/[(e^{x }+ e^{-x}).(e^{x }+ e^{-x})] -2.(e^{x} + e^{x})/(e^{x} + e^{-x})^{2}
dy/dx = 2/(e^{x }+ e^{-x}) – (2.2e^{x})/(e^{x} + e^{-x})(e^{x} + e^{-x})
dy/dx = 2/(e^{x }+ e^{-x}) – 2[2/(e^{x} + e^{-x})].[e^{x}/(e^{x} + e^{-x})]
You might realize that the term above contains hyperbolic secant function. Put y instead of 2/(e^{x }+ e^{-x}).
dy/dx = y – 2y.[e^{x}/(e^{x} + e^{-x})]
Notice that both the function and its derivative have high computation cost.
The post Hyperbolic Secant As Neural Networks Activation Function appeared first on Sefik Ilkin Serengil.
]]>The post A Step By Step Regression Tree Example appeared first on Sefik Ilkin Serengil.
]]>The following data set might be familiar with. We’ve used similar data set in our previous experiments but that one denotes golf playing decision based on some factors. In other words, golf playing decision was nominal target consisting of true or false values. Herein, the target column is number of golf players and it stores real numbers. We have counted the number of instances for each class when the target was nominal. I mean that we can create branches based on the number of instances for true decisions and false decisions. Here, we cannot count the target values because it is continuous. Instead of counting, we can handle regression problems by switching the metric to standard deviation.
Day | Outlook | Temp. | Humidity | Wind | Golf Players |
1 | Sunny | Hot | High | Weak | 25 |
2 | Sunny | Hot | High | Strong | 30 |
3 | Overcast | Hot | High | Weak | 46 |
4 | Rain | Mild | High | Weak | 45 |
5 | Rain | Cool | Normal | Weak | 52 |
6 | Rain | Cool | Normal | Strong | 23 |
7 | Overcast | Cool | Normal | Strong | 43 |
8 | Sunny | Mild | High | Weak | 35 |
9 | Sunny | Cool | Normal | Weak | 38 |
10 | Rain | Mild | Normal | Weak | 46 |
11 | Sunny | Mild | Normal | Strong | 48 |
12 | Overcast | Mild | High | Strong | 52 |
13 | Overcast | Hot | Normal | Weak | 44 |
14 | Rain | Mild | High | Strong | 30 |
Golf players = {25, 30, 46, 45, 52, 23, 43, 35, 38, 46, 48, 52, 44, 30}
Average of golf players = (25 + 30 + 46 + 45 + 52 + 23 + 43 + 35 + 38 + 46 + 48 + 52 + 44 + 30
)/14 = 39.78
Standard deviation of golf players = √[( (25 – 39.78)^{2} + (30 – 39.78)^{2} + (46 – 39.78)^{2} + … + (30 – 39.78)^{2} )/14] = 9.32
Outlook can be sunny, overcast and rain. We need to calculate standard deviation of golf players for all of these outlook candidates.
Day | Outlook | Temp. | Humidity | Wind | Golf Players |
1 | Sunny | Hot | High | Weak | 25 |
2 | Sunny | Hot | High | Strong | 30 |
8 | Sunny | Mild | High | Weak | 35 |
9 | Sunny | Cool | Normal | Weak | 38 |
11 | Sunny | Mild | Normal | Strong | 48 |
Golf players for sunny outlook = {25, 30, 35, 38, 48}
Average of golf players for sunny outlook = (25+30+35+38+48)/5 = 35.2
Standard deviation of golf players for sunny outlook = √(((25 – 35.2)^{2} + (30 – 35.2)^{2} + … )/5) = 7.78
Day | Outlook | Temp. | Humidity | Wind | Golf Players |
3 | Overcast | Hot | High | Weak | 46 |
7 | Overcast | Cool | Normal | Strong | 43 |
12 | Overcast | Mild | High | Strong | 52 |
13 | Overcast | Hot | Normal | Weak | 44 |
Golf players for overcast outlook = {46, 43, 52, 44}
Average of golf players for overcast outlook = (46 + 43 + 52 + 44)/4 = 46.25
Standard deviation of golf players for overcast outlook = √(((46-46.25)^{2}+(43-46.25)^{2}+…)= 3.49
Day | Outlook | Temp. | Humidity | Wind | Golf Players |
4 | Rain | Mild | High | Weak | 45 |
5 | Rain | Cool | Normal | Weak | 52 |
6 | Rain | Cool | Normal | Strong | 23 |
10 | Rain | Mild | Normal | Weak | 46 |
14 | Rain | Mild | High | Strong | 30 |
Golf players for overcast outlook = {45, 52, 23, 46, 30}
Average of golf players for overcast outlook = (45+52+23+46+30)/5 = 39.2
Standard deviation of golf players for overcast outlook = √(((45 – 39.2)^{2}+(52 – 39.2)^{2}+…)/5)=10.87
Outlook | Stdev of Golf Players | Instances |
Overcast | 3.49 | 4 |
Rain | 10.87 | 5 |
Sunny | 7.78 | 5 |
Weighted standard deviation for outlook = (4/14)x3.49 + (5/14)x10.87 + (5/14)x7.78 = 7.66
You might remember that we have calculated the global standard deviation of golf players 9.32 in previous steps. Standard deviation reduction is difference of the global standard deviation and standard deviation for current feature. In this way, maximized standard deviation reduction will be the decision node.
Standard deviation reduction for outlook = 9.32 – 7.66 = 1.66
Temperature can be hot, cool or mild. We will calculate standard deviations for those candidates.
Day | Outlook | Temp. | Humidity | Wind | Golf Players |
1 | Sunny | Hot | High | Weak | 25 |
2 | Sunny | Hot | High | Strong | 30 |
3 | Overcast | Hot | High | Weak | 46 |
13 | Overcast | Hot | Normal | Weak | 44 |
Golf players for hot temperature = {25, 30, 46, 44}
Standard deviation of golf players for hot temperature = 8.95
Day | Outlook | Temp. | Humidity | Wind | Golf Players |
5 | Rain | Cool | Normal | Weak | 52 |
6 | Rain | Cool | Normal | Strong | 23 |
7 | Overcast | Cool | Normal | Strong | 43 |
9 | Sunny | Cool | Normal | Weak | 38 |
Golf players for cool temperature = {52, 23, 43, 38}
Standard deviation of golf players for cool temperature = 10.51
Day | Outlook | Temp. | Humidity | Wind | Golf Players |
4 | Rain | Mild | High | Weak | 45 |
8 | Sunny | Mild | High | Weak | 35 |
10 | Rain | Mild | Normal | Weak | 46 |
11 | Sunny | Mild | Normal | Strong | 48 |
12 | Overcast | Mild | High | Strong | 52 |
14 | Rain | Mild | High | Strong | 30 |
Golf players for mild temperature = {45, 35, 46, 48, 52, 30}
Standard deviation of golf players for mild temperature = 7.65
Temperature | Stdev of Golf Players | Instances |
Hot | 8.95 | 4 |
Cool | 10.51 | 4 |
Mild | 7.65 | 6 |
Weighted standard deviation for temperature = (4/14)x8.95 + (4/14)x10.51 + (6/14)x7.65 = 8.84
Standard deviation reduction for temperature = 9.32 – 8.84 = 0.47
Humidity is a binary class. It can either be normal or high.
Day | Outlook | Temp. | Humidity | Wind | Golf Players |
1 | Sunny | Hot | High | Weak | 25 |
2 | Sunny | Hot | High | Strong | 30 |
3 | Overcast | Hot | High | Weak | 46 |
4 | Rain | Mild | High | Weak | 45 |
8 | Sunny | Mild | High | Weak | 35 |
12 | Overcast | Mild | High | Strong | 52 |
14 | Rain | Mild | High | Strong | 30 |
Golf players for high humidity = {25, 30, 46, 45, 35, 52, 30}
Standard deviation for golf players for high humidity = 9.36
Day | Outlook | Temp. | Humidity | Wind | Golf Players |
5 | Rain | Cool | Normal | Weak | 52 |
6 | Rain | Cool | Normal | Strong | 23 |
7 | Overcast | Cool | Normal | Strong | 43 |
9 | Sunny | Cool | Normal | Weak | 38 |
10 | Rain | Mild | Normal | Weak | 46 |
11 | Sunny | Mild | Normal | Strong | 48 |
13 | Overcast | Hot | Normal | Weak | 44 |
Golf players for normal humidity = {52, 23, 43, 38, 46, 48, 44}
Standard deviation for golf players for normal humidity = 8.73
Humidity | Stdev of Golf Player | Instances |
High | 9.36 | 7 |
Normal | 8.73 | 7 |
Weighted standard deviation for humidity = (7/14)x9.36 + (7/14)x8.73 = 9.04
Standard deviation reduction for humidity = 9.32 – 9.04 = 0.27
Wind is a binary class, too. It can either be Strong or Weak.
Day | Outlook | Temp. | Humidity | Wind | Golf Players |
2 | Sunny | Hot | High | Strong | 30 |
6 | Rain | Cool | Normal | Strong | 23 |
7 | Overcast | Cool | Normal | Strong | 43 |
11 | Sunny | Mild | Normal | Strong | 48 |
12 | Overcast | Mild | High | Strong | 52 |
14 | Rain | Mild | High | Strong | 30 |
Golf players for strong wind= {30, 23, 43, 48, 52, 30}
Standard deviation for golf players for strong wind = 10.59
1 | Sunny | Hot | High | Weak | 25 |
3 | Overcast | Hot | High | Weak | 46 |
4 | Rain | Mild | High | Weak | 45 |
5 | Rain | Cool | Normal | Weak | 52 |
8 | Sunny | Mild | High | Weak | 35 |
9 | Sunny | Cool | Normal | Weak | 38 |
10 | Rain | Mild | Normal | Weak | 46 |
13 | Overcast | Hot | Normal | Weak | 44 |
Golf players for weakk wind= {25, 46, 45, 52, 35, 38, 46, 44}
Standard deviation for golf players for weak wind = 7.87
Wind | Stdev of Golf Player | Instances |
Strong | 10.59 | 6 |
Weak | 7.87 | 8 |
Weighted standard deviation for wind = (6/14)x10.59 + (8/14)x7.87 = 9.03
Standard deviation reduction for wind = 9.32 – 9.03 = 0.29
So, we’ve calculated standard deviation reduction values for all features. The winner is outlook because it has the highest score.
Feature | Standard Deviation Reduction |
Outlook | 1.66 |
Temperature | 0.47 |
Humidity | 0.27 |
Wind | 0.29 |
We’ll put outlook decision at the top of decision tree. Let’s monitor the new sub data sets for the candidate branches of outlook feature.
Day | Outlook | Temp. | Humidity | Wind | Golf Players |
1 | Sunny | Hot | High | Weak | 25 |
2 | Sunny | Hot | High | Strong | 30 |
8 | Sunny | Mild | High | Weak | 35 |
9 | Sunny | Cool | Normal | Weak | 38 |
11 | Sunny | Mild | Normal | Strong | 48 |
Golf players for sunny outlook = {25, 30, 35, 38, 48}
Standard deviation for sunny outlook = 7.78
Notice that we will use this standard deviation value as global standard deviation for this sub data set.
Day | Outlook | Temp. | Humidity | Wind | Golf Players |
1 | Sunny | Hot | High | Weak | 25 |
2 | Sunny | Hot | High | Strong | 30 |
Standard deviation for sunny outlook and hot temperature = 2.5
Day | Outlook | Temp. | Humidity | Wind | Golf Players |
9 | Sunny | Cool | Normal | Weak | 38 |
Standard deviation for sunny outlook and cool temperature = 0
Day | Outlook | Temp. | Humidity | Wind | Golf Players |
8 | Sunny | Mild | High | Weak | 35 |
11 | Sunny | Mild | Normal | Strong | 48 |
Standard deviation for sunny outlook and mild temperature = 6.5
Temperature | Stdev for Golf Players | Instances |
Hot | 2.5 | 2 |
Cool | 0 | 1 |
Mild | 6.5 | 2 |
Weighted standard deviation for sunny outlook and temperature = (2/5)x2.5 + (1/5)x0 + (2/5)x6.5 = 3.6
Standard deviation reduction for sunny outlook and temperature = 7.78 – 3.6 = 4.18
Day | Outlook | Temp. | Humidity | Wind | Golf Players |
1 | Sunny | Hot | High | Weak | 25 |
2 | Sunny | Hot | High | Strong | 30 |
8 | Sunny | Mild | High | Weak | 35 |
Standard deviation for sunny outlook and high humidity = 4.08
Day | Outlook | Temp. | Humidity | Wind | Golf Players |
9 | Sunny | Cool | Normal | Weak | 38 |
11 | Sunny | Mild | Normal | Strong | 48 |
Standard deviation for sunny outlook and normal humidity = 5
Humidity | Stdev for Golf Players | Instances |
High | 4.08 | 3 |
Normal | 5.00 | 2 |
Weighted standard deviations for sunny outlook and humidity = (3/5)x4.08 + (2/5)x5 = 4.45
Standard deviation reduction for sunny outlook and humidity = 7.78 – 4.45 = 3.33
Day | Outlook | Temp. | Humidity | Wind | Golf Players |
2 | Sunny | Hot | High | Strong | 30 |
11 | Sunny | Mild | Normal | Strong | 48 |
Standard deviation for sunny outlook and strong wind = 9
Day | Outlook | Temp. | Humidity | Wind | Golf Players |
1 | Sunny | Hot | High | Weak | 25 |
8 | Sunny | Mild | High | Weak | 35 |
9 | Sunny | Cool | Normal | Weak | 38 |
Standard deviation for sunny outlook and weak wind = 5.56
Wind | Stdev for Golf Players | Instances |
Strong | 9 | 2 |
Weak | 5.56 | 3 |
Weighted standard deviations for sunny outlook and wind = (2/5)x9 + (3/5)x5.56 = 6.93
Standard deviation reduction for sunny outlook and wind = 7.78 – 6.93 = 0.85
We’ve calculated standard deviation reductions for sunny outlook. The winner is temperature.
Feature | Standard Deviation Reduction |
Temperature | 4.18 |
Humidity | 3.33 |
Wind | 0.85 |
Cool branch has one instance in its sub data set. We can say that if outlook is sunny and temperature is cool, then there would be 38 golf players. But what about hot branch? There are still 2 instances. Should we add another branch for weak wind and strong wind? No, we should not. Because this causes over-fitting. We should terminate building branches, for example if there are less than five instances in the sub data set. Or standard deviation of the sub data set can be less than 5% of the entire data set. I prefer to apply the first one. I will terminate the branch if there are less than 5 instances in the current sub data set. If this termination condition is satisfied, then I will calculate the average of the sub data set. This operation is called as pruning in decision tree trees.
Overcast outlook branch has already 4 instances in the sub data set. We can terminate building branches for this leaf. Final decision will be average of the following table for overcast outlook.
Day | Outlook | Temp. | Humidity | Wind | Golf Players |
3 | Overcast | Hot | High | Weak | 46 |
7 | Overcast | Cool | Normal | Strong | 43 |
12 | Overcast | Mild | High | Strong | 52 |
13 | Overcast | Hot | Normal | Weak | 44 |
If outlook is overcast, then there would be (46+43+52+44)/4 = 46.25 golf players.
Day | Outlook | Temp. | Humidity | Wind | Golf Players |
4 | Rain | Mild | High | Weak | 45 |
5 | Rain | Cool | Normal | Weak | 52 |
6 | Rain | Cool | Normal | Strong | 23 |
10 | Rain | Mild | Normal | Weak | 46 |
14 | Rain | Mild | High | Strong | 30 |
We need to find standard deviation reduction values for the rest of the features in same way for the sub data set above.
Standard deviation for rainy outlook = 10.87
Notice that we will use this value as global standard deviation for this branch in reduction step.
Temperature | Standard deviation for golf players | instances |
Cool | 14.50 | 2 |
Mild | 7.32 | 3 |
Weighted standard deviation for rainy outlook and temperature = (2/5)x14.50 + (3/5)x7.32 = 10.19
Standard deviation reduction for rainy outlook and temperature = 10.87 – 10.19 = 0.67
Humidity | Standard deviation for golf players | instances |
High | 7.50 | 2 |
Normal | 12.50 | 3 |
Weighted standard deviation for rainy outlook and humidity = (2/5)x7.50 + (3/5)x12.50 = 10.50
Standard deviation reduction for rainy outlook and humidity = 10.87 – 10.50 = 0.37
Wind | Standard deviation for golf players | instances |
Weak | 3.09 | 3 |
Strong | 3.5 | 2 |
Weighted standard deviation for rainy outlook and wind = (3/5)x3.09 + (2/5)x3.5 = 3.25
Standard deviation reduction for rainy outlook and wind = 10.87 – 3.25 = 7.62
As illustrated below, the winner is wind feature.
Feature | Standard deviation reduction |
Temperature | 0.67 |
Humidity | 0.37 |
Wind | 7.62 |
As seen, both branches have items less than 5. Now, we can terminate these leafs based on the termination rule.
So, Final form of the decision tree is demonstrated below.
So, we have mentioned how to build decision trees for regression problems. Even though, decision trees are powerful way to classify problems, they can be adapted into regression problems as mentioned. Regression trees tend to over-fit much more than classification trees. Termination rule should be tuned carefully to avoid over-fitting. Finally, lecture notes of Dr. Saed Sayad (University of Toronto) guides me to create this content.
The post A Step By Step Regression Tree Example appeared first on Sefik Ilkin Serengil.
]]>The post A Step by Step CART Decision Tree Example appeared first on Sefik Ilkin Serengil.
]]>We will work on same dataset in ID3. There are 14 instances of golf playing decisions based on outlook, temperature, humidity and wind factors.
Day | Outlook | Temp. | Humidity | Wind | Decision |
---|---|---|---|---|---|
1 | Sunny | Hot | High | Weak | No |
2 | Sunny | Hot | High | Strong | No |
3 | Overcast | Hot | High | Weak | Yes |
4 | Rain | Mild | High | Weak | Yes |
5 | Rain | Cool | Normal | Weak | Yes |
6 | Rain | Cool | Normal | Strong | No |
7 | Overcast | Cool | Normal | Strong | Yes |
8 | Sunny | Mild | High | Weak | No |
9 | Sunny | Cool | Normal | Weak | Yes |
10 | Rain | Mild | Normal | Weak | Yes |
11 | Sunny | Mild | Normal | Strong | Yes |
12 | Overcast | Mild | High | Strong | Yes |
13 | Overcast | Hot | Normal | Weak | Yes |
14 | Rain | Mild | High | Strong | No |
Gini index is a metric for classification tasks in CART. It stores sum of squared probabilities of each class. We can formulate it as illustrated below.
Gini = 1 – Σ (Pi)^{2} for i=1 to number of classes
Outlook is a nominal feature. It can be sunny, overcast or rain. I will summarize the final decisions for outlook feature.
Outlook | Yes | No | Number of instances |
Sunny | 2 | 3 | 5 |
Overcast | 4 | 0 | 4 |
Rain | 3 | 2 | 5 |
Gini(Outlook=Sunny) = 1 – (2/5)^{2} – (3/5)^{2} = 1 – 0.16 – 0.36 = 0.48
Gini(Outlook=Overcast) = 1 – (4/4)^{2} – (0/4)^{2} = 0
Gini(Outlook=Rain) = 1 – (3/5)^{2} – (2/5)^{2} = 1 – 0.36 – 0.16 = 0.48
Then, we will calculate weighted sum of gini indexes for outlook feature.
Gini(Outlook) = (5/14) x 0.48 + (4/14) x 0 + (5/14) x 0.48 = 0.171 + 0 + 0.171 = 0.342
Similarly, temperature is a nominal feature and it could have 3 different values: Cool, Hot and Mild. Let’s summarize decisions for temperature feature.
Temperature | Yes | No | Number of instances |
Hot | 2 | 2 | 4 |
Cool | 3 | 1 | 4 |
Mild | 4 | 2 | 6 |
Gini(Temp=Hot) = 1 – (2/4)^{2} – (2/4)^{2} = 0.5
Gini(Temp=Cool) = 1 – (3/4)^{2} – (1/4)^{2} = 1 – 0.5625 – 0.0625 = 0.375
Gini(Temp=Mild) = 1 – (4/6)^{2} – (2/6)^{2} = 1 – 0.444 – 0.111 = 0.445
We’ll calculate weighted sum of gini index for temperature feature
Gini(Temp) = (4/14) x 0.5 + (4/14) x 0.375 + (6/14) x 0.445 = 0.142 + 0.107 + 0.190 = 0.439
Humidity is a binary class feature. It can be high or normal.
Humidity | Yes | No | Number of instances |
High | 3 | 4 | 7 |
Normal | 6 | 1 | 7 |
Gini(Humidity=High) = 1 – (3/7)^{2} – (4/7)^{2} = 1 – 0.183 – 0.326 = 0.489
Gini(Humidity=Normal) = 1 – (6/7)^{2} – (1/7)^{2} = 1 – 0.734 – 0.02 = 0.244
Weighted sum for humidity feature will be calculated next
Gini(Humidity) = (7/14) x 0.489 + (7/14) x 0.244 = 0.367
Wind is a binary class similar to humidity. It can be weak and strong.
Wind | Yes | No | Number of instances |
Weak | 6 | 2 | 8 |
Strong | 3 | 3 | 6 |
Gini(Wind=Weak) = 1 – (6/8)^{2} – (2/8)^{2} = 1 – 0.5625 – 0.062 = 0.375
Gini(Wind=Strong) = 1 – (3/6)^{2} – (3/6)^{2} = 1 – 0.25 – 0.25 = 0.5
Gini(Wind) = (8/14) x 0.375 + (6/14) x 0.5 = 0.428
We’ve calculated gini index values for each feature. The winner will be outlook feature because its cost is the lowest.
Feature | Gini index |
Outlook | 0.342 |
Temperature | 0.439 |
Humidity | 0.367 |
Wind | 0.428 |
We’ll put outlook decision at the top of the tree.
You might realize that sub dataset in the overcast leaf has only yes decisions. This means that overcast leaf is over.
We will apply same principles to those sub datasets in the following steps.
Focus on the sub dataset for sunny outlook. We need to find the gini index scores for temperature, humidity and wind features respectively.
Day | Outlook | Temp. | Humidity | Wind | Decision |
1 | Sunny | Hot | High | Weak | No |
2 | Sunny | Hot | High | Strong | No |
8 | Sunny | Mild | High | Weak | No |
9 | Sunny | Cool | Normal | Weak | Yes |
11 | Sunny | Mild | Normal | Strong | Yes |
Temperature | Yes | No | Number of instances |
Hot | 0 | 2 | 2 |
Cool | 1 | 0 | 1 |
Mild | 1 | 1 | 2 |
Gini(Outlook=Sunny and Temp.=Hot) = 1 – (0/2)^{2} – (2/2)^{2} = 0
Gini(Outlook=Sunny and Temp.=Cool) = 1 – (1/1)^{2} – (0/1)^{2} = 0
Gini(Outlook=Sunny and Temp.=Mild) = 1 – (1/2)^{2} – (1/2)^{2} = 1 – 0.25 – 0.25 = 0.5
Gini(Outlook=Sunny and Temp.) = (2/5)x0 + (1/5)x0 + (2/5)x0.5 = 0.2
Humidity | Yes | No | Number of instances |
High | 0 | 3 | 3 |
Normal | 2 | 0 | 2 |
Gini(Outlook=Sunny and Humidity=High) = 1 – (0/3)^{2} – (3/3)^{2} = 0
Gini(Outlook=Sunny and Humidity=Normal) = 1 – (2/2)^{2} – (0/2)^{2} = 0
Gini(Outlook=Sunny and Humidity) = (3/5)x0 + (2/5)x0 = 0
Wind | Yes | No | Number of instances |
Weak | 1 | 2 | 3 |
Strong | 1 | 1 | 2 |
Gini(Outlook=Sunny and Wind=Weak) = 1 – (1/3)^{2} – (2/3)^{2} = 0.266
Gini(Outlook=Sunny and Wind=Strong) = 1- (1/2)^{2} – (1/2)^{2} = 0.2
Gini(Outlook=Sunny and Wind) = (3/5)x0.266 + (2/5)x0.2 = 0.466
We’ve calculated gini index scores for feature when outlook is sunny. The winner is humidity because it has the lowest value.
Feature | Gini index |
Temperature | 0.2 |
Humidity | 0 |
Wind | 0.466 |
We’ll put humidity check at the extension of sunny outlook.
As seen, decision is always no for high humidity and sunny outlook. On the other hand, decision will always be yes for normal humidity and sunny outlook. This branch is over.
Now, we need to focus on rain outlook.
Day | Outlook | Temp. | Humidity | Wind | Decision |
4 | Rain | Mild | High | Weak | Yes |
5 | Rain | Cool | Normal | Weak | Yes |
6 | Rain | Cool | Normal | Strong | No |
10 | Rain | Mild | Normal | Weak | Yes |
14 | Rain | Mild | High | Strong | No |
We’ll calculate gini index scores for temperature, humidity and wind features when outlook is rain.
Temperature | Yes | No | Number of instances |
Cool | 1 | 1 | 2 |
Mild | 2 | 1 | 3 |
Gini(Outlook=Rain and Temp.=Cool) = 1 – (1/2)^{2} – (1/2)^{2} = 0.5
Gini(Outlook=Rain and Temp.=Mild) = 1 – (2/3)^{2} – (1/3)^{2} = 0.444
Gini(Outlook=Rain and Temp.) = (2/5)x0.5 + (3/5)x0.444 = 0.466
Humidity | Yes | No | Number of instances |
High | 1 | 1 | 2 |
Normal | 2 | 1 | 3 |
Gini(Outlook=Rain and Humidity=High) = 1 – (1/2)^{2} – (1/2)^{2} = 0.5
Gini(Outlook=Rain and Humidity=Normal) = 1 – (2/3)^{2} – (1/3)^{2} = 0.444
Gini(Outlook=Rain and Humidity) = (2/5)x0.5 + (3/5)x0.444 = 0.466
Wind | Yes | No | Number of instances |
Weak | 3 | 0 | 3 |
Strong | 0 | 2 | 2 |
Gini(Outlook=Rain and Wind=Weak) = 1 – (3/3)^{2} – (0/3)^{2} = 0
Gini(Outlook=Rain and Wind=Strong) = 1 – (0/2)^{2} – (2/2)^{2} = 0
Gini(Outlook=Rain and Wind) = (3/5)x0 + (2/5)x0 = 0
The winner is wind feature for rain outlook because it has the minimum gini index score in features.
Feature | Gini index |
Temperature | 0.466 |
Humidity | 0.466 |
Wind | 0 |
Put the wind feature for rain outlook branch and monitor the new sub data sets.
As seen, decision is always yes when wind is weak. On the other hand, decision is always no if wind is strong. This means that this branch is over.
So, decision tree building is over. We have built a decision tree by hand. BTW, you might realize that we’ve created exactly the same tree in ID3 example. This does not mean that ID3 and CART algorithms produce same trees always. We are just lucky. Finally, I believe that CART is easier than ID3 and C4.5, isn’t it?
The post A Step by Step CART Decision Tree Example appeared first on Sefik Ilkin Serengil.
]]>The post Indeterminate Forms and L’Hospital’s Rule in Decision Trees appeared first on Sefik Ilkin Serengil.
]]>Decision tree algorithms such as ID3 and C4.5 use entropy and gain calculations for determining the most dominant feature. Typical entropy calculation is demonstrated below for n classes.
Entropy = – Σ (i=0 to n) p(class_{i}) . log_{2}p(class_{i}) = – p(class_{1}) . log_{2}p(class_{1}) – p(class_{2}) . log_{2}p(class_{2}) – … – p(class_{n}) . log_{2}p(class_{n})
For example, if decision class consists of 4 yes and 2 no instances, then there are 6 instances and binary classes. Entropy will be calculated as
Entropy(decision) = – p(no) . log_{2}p(no) – p(yes) . log_{2}p(yes) = – (2/6) . log_{2}(2/6) – (4/6) . log_{2}(4/6) = -0.333.log_{2}(0.333) – 0.667.log_{2}(0.667) = -0.333.(-1.585) – 0.667.(-0.585) = 0.918
What if number of instances for a class is equal to 0? Let’s say decision class consists of 6 yes, and 0 no examples.
Entropy(decision) = – p(no) . log_{2}p(no) – p(yes) . log_{2}p(yes) = – (0/6) . log_{2}(0/6) – (6/6) . log_{2}(6/6) = – 0 . log_{2}(0) – 1 . log_{2}(1)
Here, log_{2}(1) is equal to 0, but the problem is log_{2}(0) is equal to – ∞. Additionally, we need 0 times ∞ in this calculation.
Let’s ask this question to python
import math a = 0 b = math.log(0, 2) #log to the base 2 of 0, or log 0 to the base 2 print(a*b)
You will face with ValueError: math domain error if you run 0 times negative ∞ in python. Similarly, Java produces NaN exception and excel returns #NUM! error.
As seen, this operation cannot be performed, can it? But we are suspicious ones. What if even high level programming languages do not know how to compute?
The term we have trouble is x . log_{2}x for x is equal to 0. We can rearrange the equation as limit x goes to 0 for x times log x to the base 2. Moving x multiplier to the denominator as 1 over x would not change the result.
lim (x->0) x . log_{2}x = lim (x->0) log_{2}x / (1/x) = – ∞/∞
Yes, it is transformed to familiar indeterminate form of ∞/∞.
L’Hopital’s rule states that if f(x) and g(x) are both equal to 0 (or ∞) while limit goes to some point c
Condition: lim_{(x->c)} f(x) = lim_{(x->c)} g(x) = 0 (or ∞)
Then, function f over function g is equal to derivative of f and derivative of g.
lim_{(x->c)} f(x)/g(x) = lim_{(x->c)} f'(x)/g'(x)
Here, f(x) and g(x) must be must be differentiable at point c.
We can already transformed x . log_{2}x term to ∞/∞ indeterminate form. This means that we can apply L’Hopital.
lim_{(x->0)} x . log_{2}x = lim_{(x->0)} log_{2}x / (1/x) = lim_{(x->0)} (log_{2}x)’/(1/x)’ = (log_{2}x)’/(x^{-1})’
Notice that derivative of log_{2}x is 1/(x.ln(2))
(1/(x.ln(2))) / (-1 . x^{-2}) = [1 / (x.ln(2))] / [-1 / x^{2}] = – x^{2} / x.ln(2) = x / ln(2)
This is L’Hospital applied version of lim (x->0) x . log_{2}x
lim_{(x->0)} x . log_{2}x = lim_{(x->0)} x / ln(2) = 0/0.693 = 0
Graph of x.log(x) is defined for [0, +∞) as illustrated below. Surprisingly, x = 0 is not undefined.
So, this case appears often when building entropy based decision trees. We can handle this trouble with calculus only. Even high level programming languages could not help to solve this case. To sum up, we can say that programming languages do not know calculus. They are designed to perform linear operations only.
You might rethink about takeover by some kind of evil AI or killer robots. They are not capable of applying a basic calculus. This is the basic answer why AI cannot takeover the human dominance on earth.
As an antithesis, None of the best predators in the earth orca, lion, white shark, siberian tiger, king cobra does not even known counting – Alper Ozpinar. But do not forget these species couldn’t takeover human dominance. Heavy-handed force might not put you at the top of food chain pyramid.
PS: Thanks to Valentin Cold to inform me and raise awareness about this subject
The post Indeterminate Forms and L’Hospital’s Rule in Decision Trees appeared first on Sefik Ilkin Serengil.
]]>