The post A Step by Step Hill Cipher Example appeared first on Sefik Ilkin Serengil.
]]>First, sender and receiver parties need to agree with a secret key. This key must be a square matrix.
key = np.array([ [3, 10, 20], [20, 9, 17], [9, 4, 17] ]) key_rows = key.shape[0] key_columns = key.shape[1] if key_rows != key_columns: raise Exception('key must be square matrix!')
The key matrix must have an inverse matrix. This means that determinant of the matrix must not be 0.
if np.linalg.det(key) == 0: raise Exception('matrix must have an inverse matrix')
Hill cipher is language dependent encryption method. That’s why, all character will be in lowercase and we’ll remove blank characters as well. Then, every letter will be replaced with its index value in the alphabet.
def letterToNumber(letter): return string.ascii_lowercase.index(letter) raw_message = "attack is to night" print("raw message: ",raw_message) message = [] for i in range(0, len(raw_message)): current_letter = raw_message[i:i+1].lower() if current_letter != ' ': #discard blank characters letter_index = letterToNumber(current_letter) message.append(letter_index)
Encryption will be handled by multiplying message and key. This requires that column size of the message must be equal to row size of the key. Otherwise, multiplication cannot be handled. We can append beginning letter of the alphabet to the end of the message until multiplication can be handled. Hill cipher is a block cipher method and repetition won’t be cause weakness. Still, I prefer to append beginning of the message instead of repeating characters. BTW, column number of my message and row number of my key are equal. The following code block won’t be run for this case.
if len(message) % key_rows != 0: for i in range(0, len(message)): message.append(message[i]) if len(message) % key_rows == 0: break
Now, we can transform the message into a matrix.
message = np.array(message) message_length = message.shape[0] message.resize(int(message_length/key_rows), key_rows)
Now, my message is stored in a 5×3 sized matrix as illustrated below.
[[ 0 19 19] [ 0 2 10] [ 8 18 19] [14 13 8] [ 6 7 19]]
The message is 5×3 sized matrix and the key is 3×3 sized matrix. Message’s column size is equal to key matrix’s row count. They can be multiplied. Multiplication might produce values greater than the alphabet size. That’s why, we will apply modular arithmetic. Here, 26 refers to the size of English alphabet. We can consume either matmul or dot functions.
encryption = np.matmul(message, key) encryption = np.remainder(encryption, 26)
Encrypted text will be stored in 5×3 sized matrix as illustrated below.
[[ 5 13 22] [ 0 6 22] [ 9 6 9] [10 3 13] [17 17 16]]
Remember that plaintext was attackistonight. Please focus on the 2nd and 3rd letter in plaintext. They are both letter of t. However, 2nd and 3rd characters in the ciphertext are 13 and 22 respectively. Same characters substituted with different characters. This is idea behind block ciphers.
Multiplying ciphertext and inverse of key will create plaintext. Here, we need to find the inverse of key. Finding matrix inverse is a complex operation. Even though numpy has a matrix inverse function, we also need to apply modular arithmetic on this decimal matrix. On the other hand, SymPy handles modular arithmetic for matrix inverse operations easily.
from sympy import Matrix inverse_key = Matrix(key).inv_mod(26) inverse_key = np.array(inverse_key) #sympy to numpy inverse_key = inverse_key.astype(float)
We could find the inverse key.
[[11. 22. 14.] [ 7. 9. 21.] [17. 0. 3.]]
We can validate inverse key matrix. Multiplication of key and inverse key must be equal to idendity matrix.
check = np.matmul(key, inverse_key) check = np.remainder(check, module)
This is really produces the identity matrix.
[[1. 0. 0.] [0. 1. 0.] [0. 0. 1.]]
Bob found the inverse key and he has the ciphertext. He need to multiply ciphertext and inverse key matrices.
decryption = np.matmul(encryption, inverse_key) decryption = np.remainder(decryption, module).flatten()
As seen, decrytpion stores the exact message Alice sent.
decryption: [ 0. 19. 19. 0. 2. 10. 8. 18. 19. 14. 13. 8. 6. 7. 19.]
We can restore these values into characters.
decrypted_message = "" for i in range(0, len(decryption)): letter_num = int(decryption[i]) letter = numberToLetter(decryption[i]) decrypted_message = decrypted_message + letter
This restores the following message.
decrypted message: attackistonight
Inventor Lester S. Hill registered this idea to patent office. You should have a view on his drawings. He designed an encrypted telegraph machine at the beginning of 1930’s and named message protector. Today, we call this Hill’s Cipher Machine.
In this post, we’ve worked on 3×3 sized key and its key space is 26^{9}. Patented mechanism works on 6×6 sized keys. This increases key space to 26^{36}. This is very large even for today computation power. Increasing the size of key matrix makes the cipher much stronger. We can say that Hill is secure against ciphertext only attacks.
However, if an attacker can capture a plaintext ciphertext pair, then he can calculate key value easily. That’s why, ciphertext is weak against known plaintext attacks. That’s why, this cipher got out of the date.
The source code of this post is pushed into the GitHub.
The post A Step by Step Hill Cipher Example appeared first on Sefik Ilkin Serengil.
]]>The post Using Custom Activation Functions in Keras appeared first on Sefik Ilkin Serengil.
]]>Herein, advanced frameworks cannot catch innovations. For example, you cannot use Swish based activation functions in Keras today. This might appear in the following patch but you may need to use an another activation function before related patch pushed. So, this post will guide you to consume a custom activation function out of the Keras and Tensorflow such as Swish or E-Swish.
All you need is to create your custom activation function. In this case, I’ll consume swish which is x times sigmoid. Besides, I include this in a convolutional neural networks model.
import keras def swish(x): beta = 1.5 #1, 1.5 or 2 return beta * x * keras.backend.sigmoid(x) model = Sequential() #1st convolution layer model.add(Conv2D(32, (3, 3) #32 is number of filters and (3, 3) is the size of the filter. , activation = swish , input_shape=(28,28,1))) model.add(MaxPooling2D(pool_size=(2,2))) #2nd convolution layer model.add(Conv2D(64,(3, 3), activation = swish)) # apply 64 filters sized of (3x3) on 2nd convolution layer model.add(MaxPooling2D(pool_size=(2,2))) model.add(Flatten()) # Fully connected layer. 1 hidden layer consisting of 512 nodes model.add(Dense(512, activation = swish)) model.add(Dense(num_classes, activation='softmax')) model.compile(loss='categorical_crossentropy' , optimizer=keras.optimizers.Adam() , metrics=['accuracy'] ) model.fit_generator(x_train, y_train , epochs=epochs , validation_data=(x_test, y_test) )
Remember that we will use this activation function in feed forward step whereas we need to use its derivative in the backpropagation. We just define the activation function but we do offer its derivative. That’s the power of TensorFlow. The framework knows how to apply differentiation for backpropagation. This comes from importing keras backend module. If you design swish function without keras.backend then fitting would fail.
So, we’ve mentioned how to include a new activation function for learning process in Keras / TensorFlow pair. Picking the most convenient activation function is the state-of-the-art for scientists just like structure (number of hidden layers, number of nodes in the hidden layers) and learning parameters (learning rate, epoch or learning rate). Now, you can design your own activation function or consume any newly introduced activation function just similar to the following picture.
My friend and colleague Giray inspires me to produce this post. I am grateful to him as usual.
The post Using Custom Activation Functions in Keras appeared first on Sefik Ilkin Serengil.
]]>The post A Step by Step Adaboost Example appeared first on Sefik Ilkin Serengil.
]]>We are going to work on the following data set. Each instances are represented as 2-dimensional space and we also have its class value. You can find the raw data set here.
x1 | x2 | Decision |
2 | 3 | true |
2.1 | 2 | true |
4.5 | 6 | true |
4 | 3.5 | false |
3.5 | 1 | false |
5 | 7 | true |
5 | 3 | false |
6 | 5.5 | true |
8 | 6 | false |
8 | 2 | false |
We should plot features and class value to be understand clearly.
import pandas as pd import matplotlib.pyplot as plt import numpy as np df = pd.read_csv("dataset/adaboost.txt") positives = df[df['Decision'] >= 0] negatives = df[df['Decision'] < 0] plt.scatter(positives['x1'], positives['x2'], marker='+', s=500*abs(positives['Decision']), c='blue') plt.scatter(negatives['x1'], negatives['x2'], marker='_', s=500*abs(negatives['Decision']), c='red') plt.show()
This code block produces the following graph. As seen, true classes are marked with plus characters whereas false classes are marked with minus character.
We would like to separate true and false classes. This is not a linearly separable problem. Linear classifiers such as perceptrons or decision stumps cannot classify this problem. Herein, adaboost enables linear classifiers to solve this problem.
Decision trees approaches problems with divide and conquer method. They might have lots of nested decision rules. This makes them non-linear classifiers. In contrast, decision stumps are 1-level decision trees. They are linear classifiers just like (single layer) perceptrons. You might think that if height of someone is greater than 1.70 meters (5.57 feet), then it would be male. Otherwise, it would be female. This decision stump would classify gender correctly at least 50% accuracy. That’s why, these classifiers are weak learners.
I’ve modified my decision tree repository to handle decision stumps. Basically, buildDecisionTree function calls itself until reaching a decision. I terminated this recursive calling if adaboost enabled.
The main principle in adaboost is to increase the weight of unclassified ones and to decrease the weight value of classified ones. But we are working on a classification problem. Target values in the data set are nominal values. That’s why, we are going to transform the problem to a regression task. I will set true classes to 1 whereas false classes to -1 to handle this.
Initially, we distribute weights normally. I set weights of all instances to 1/n where n is the total number of instances.
x1 | x2 | actual | weight | weighted_actual |
2 | 3 | 1 | 0.1 | 0.1 |
2 | 2 | 1 | 0.1 | 0.1 |
4 | 6 | 1 | 0.1 | 0.1 |
4 | 3 | -1 | 0.1 | -0.1 |
4 | 1 | -1 | 0.1 | -0.1 |
5 | 7 | 1 | 0.1 | 0.1 |
5 | 3 | -1 | 0.1 | -0.1 |
6 | 5 | 1 | 0.1 | 0.1 |
8 | 6 | -1 | 0.1 | -0.1 |
8 | 2 | -1 | 0.1 | -0.1 |
Weighted actual stores weight times actual value for each line. Now, we are going to use weighted actual as target value whereas x1 and x2 are features to build a decision stump. The following rule set is created when I run the decision stump algorithm.
def findDecision(x1, x2): if x1>2.1: return -0.025 if x1<=2.1: return 0.1
We’ve set actual values as values ±1 but decision stump returns decimal values. Here, the trick is applying sign function handles this issue.
def sign(x): if x > 0: return 1 elif x < 0: return -1 else: return 0
To sum up, prediction will be sign(-0.025) = -1 when x1 is greater than 2.1, and it will be sign(0.1) = +1 when x1 is less than or equal to 2.1.
I’ll put predictions as a column. Also, I check the equality of actual and prediction in loss column. It will be 0 if the prediction is correct, will be 1 if the prediction is incorrect.
x1 | x2 | actual | weight | weighted_actual | prediction | loss | weight * loss |
2 | 3 | 1 | 0.1 | 0.1 | 1 | 0 | 0 |
2 | 2 | 1 | 0.1 | 0.1 | 1 | 0 | 0 |
4 | 6 | 1 | 0.1 | 0.1 | -1 | 1 | 0.1 |
4 | 3 | -1 | 0.1 | -0.1 | -1 | 0 | 0 |
4 | 1 | -1 | 0.1 | -0.1 | -1 | 0 | 0 |
5 | 7 | 1 | 0.1 | 0.1 | -1 | 1 | 0.1 |
5 | 3 | -1 | 0.1 | -0.1 | -1 | 0 | 0 |
6 | 5 | 1 | 0.1 | 0.1 | -1 | 1 | 0.1 |
8 | 6 | -1 | 0.1 | -0.1 | -1 | 0 | 0 |
8 | 2 | -1 | 0.1 | -0.1 | -1 | 0 | 0 |
Sum of weight times loss column stores the total error. It is 0.3 in this case. Here, we’ll define a new variable alpha. It stores logarithm (1 – ε)/ε to the base e over 2.
α = ln[(1-ε)/ε] / 2 = ln[(1 – 0.3)/0.3] / 2 = 0.42
We’ll use alpha to update weights in the next round.
w_{i+1} = w_{i} * math.exp(-alpha * actual * prediction) where i refers to instance number.
Also, sum of weights must be equal to 1. That’s why, we have to normalize weight values. Dividing each weight value to sum of weights column enables normalization.
x1 | x2 | actual | weight | prediction | w_(i+1) | norm(w_(i+1)) |
2 | 3 | 1 | 0.1 | 1 | 0.065 | 0.071 |
2 | 2 | 1 | 0.1 | 1 | 0.065 | 0.071 |
4 | 6 | 1 | 0.1 | -1 | 0.153 | 0.167 |
4 | 3 | -1 | 0.1 | -1 | 0.065 | 0.071 |
4 | 1 | -1 | 0.1 | -1 | 0.065 | 0.071 |
5 | 7 | 1 | 0.1 | -1 | 0.153 | 0.167 |
5 | 3 | -1 | 0.1 | -1 | 0.065 | 0.071 |
6 | 5 | 1 | 0.1 | -1 | 0.153 | 0.167 |
8 | 6 | -1 | 0.1 | -1 | 0.065 | 0.071 |
8 | 2 | -1 | 0.1 | -1 | 0.065 | 0.071 |
This round is over.
I shift normalized w_(i+1) column to weight column in this round. Then, build a decision stump. Still, x1 and x2 are features whereas weighted actual is the target value.
x1 | x2 | actual | weight | weighted_actual |
2 | 3 | 1 | 0.071 | 0.071 |
2 | 2 | 1 | 0.071 | 0.071 |
4 | 6 | 1 | 0.167 | 0.167 |
4 | 3 | -1 | 0.071 | -0.071 |
4 | 1 | -1 | 0.071 | -0.071 |
5 | 7 | 1 | 0.167 | 0.167 |
5 | 3 | -1 | 0.071 | -0.071 |
6 | 5 | 1 | 0.167 | 0.167 |
8 | 6 | -1 | 0.071 | -0.071 |
8 | 2 | -1 | 0.071 | -0.071 |
Graph of the new data set is demonstrated below. Weights of correct classified ones decreased whereas incorrect ones increased.
The following decision stump will be built for this data set.
def findDecision(x1, x2): if x2<=3.5: return -0.02380952380952381 if x2>3.5: return 0.10714285714285714
I’ve applied sign function to predictions. Then, I put loss and weight times loss values as columns.
x1 | x2 | actual | weight | prediction | loss | weight * loss |
2 | 3 | 1 | 0.071 | -1 | 1 | 0.071 |
2 | 2 | 1 | 0.071 | -1 | 1 | 0.071 |
4 | 6 | 1 | 0.167 | 1 | 0 | 0.000 |
4 | 3 | -1 | 0.071 | -1 | 0 | 0.000 |
4 | 1 | -1 | 0.071 | -1 | 0 | 0.000 |
5 | 7 | 1 | 0.167 | 1 | 0 | 0.000 |
5 | 3 | -1 | 0.071 | -1 | 0 | 0.000 |
6 | 5 | 1 | 0.167 | 1 | 0 | 0.000 |
8 | 6 | -1 | 0.071 | 1 | 1 | 0.071 |
8 | 2 | -1 | 0.071 | -1 | 0 | 0.000 |
I can calculate error and alpha values for round 2.
ε = 0.21, α = 0.65
So, weights for the following round can be found.
x1 | x2 | actual | weight | prediction | w_(i+1) | norm(w_(i+1)) |
2 | 3 | 1 | 0.071 | -1 | 0.137 | 0.167 |
2 | 2 | 1 | 0.071 | -1 | 0.137 | 0.167 |
4 | 6 | 1 | 0.167 | 1 | 0.087 | 0.106 |
4 | 3 | -1 | 0.071 | -1 | 0.037 | 0.045 |
4 | 1 | -1 | 0.071 | -1 | 0.037 | 0.045 |
5 | 7 | 1 | 0.167 | 1 | 0.087 | 0.106 |
5 | 3 | -1 | 0.071 | -1 | 0.037 | 0.045 |
6 | 5 | 1 | 0.167 | 1 | 0.087 | 0.106 |
8 | 6 | -1 | 0.071 | 1 | 0.137 | 0.167 |
8 | 2 | -1 | 0.071 | -1 | 0.037 | 0.045 |
I skipped calculations for the following rounds
x1 | x2 | actual | weight | prediction | loss | w * loss | w_(i+1) | norm(w_(i+1)) |
2 | 3 | 1 | 0.167 | 1 | 0 | 0.000 | 0.114 | 0.122 |
2 | 2 | 1 | 0.167 | 1 | 0 | 0.000 | 0.114 | 0.122 |
4 | 6 | 1 | 0.106 | -1 | 1 | 0.106 | 0.155 | 0.167 |
4 | 3 | -1 | 0.045 | -1 | 0 | 0.000 | 0.031 | 0.033 |
4 | 1 | -1 | 0.045 | -1 | 0 | 0.000 | 0.031 | 0.033 |
5 | 7 | 1 | 0.106 | -1 | 1 | 0.106 | 0.155 | 0.167 |
5 | 3 | -1 | 0.045 | -1 | 0 | 0.000 | 0.031 | 0.033 |
6 | 5 | 1 | 0.106 | -1 | 1 | 0.106 | 0.155 | 0.167 |
8 | 6 | -1 | 0.167 | -1 | 0 | 0.000 | 0.114 | 0.122 |
8 | 2 | -1 | 0.045 | -1 | 0 | 0.000 | 0.031 | 0.033 |
ε = 0.31, α = 0.38
def findDecision(x1, x2): if x1>2.1: return -0.003787878787878794 if x1<=2.1: return 0.16666666666666666
x1 | x2 | actual | weight | prediction | loss | w * loss | w_(i+1) | norm(w_(i+1)) |
2 | 3 | 1 | 0.122 | 1 | 0 | 0.000 | 0.041 | 0.068 |
2 | 2 | 1 | 0.122 | 1 | 0 | 0.000 | 0.041 | 0.068 |
4 | 6 | 1 | 0.167 | 1 | 0 | 0.000 | 0.056 | 0.093 |
4 | 3 | -1 | 0.033 | 1 | 1 | 0.033 | 0.100 | 0.167 |
4 | 1 | -1 | 0.033 | 1 | 1 | 0.033 | 0.100 | 0.167 |
5 | 7 | 1 | 0.167 | 1 | 0 | 0.000 | 0.056 | 0.093 |
5 | 3 | -1 | 0.033 | 1 | 1 | 0.033 | 0.100 | 0.167 |
6 | 5 | 1 | 0.167 | 1 | 0 | 0.000 | 0.056 | 0.093 |
8 | 6 | -1 | 0.122 | -1 | 0 | 0.000 | 0.041 | 0.068 |
8 | 2 | -1 | 0.033 | -1 | 0 | 0.000 | 0.011 | 0.019 |
ε = 0.10, α = 1.10
def findDecision(x1,x2): if x1<=6.0: return 0.08055555555555555 if x1>6.0: return -0.07777777777777778
Cumulative sum of each round’s alpha times prediction gives the final prediction.
round 1 | round 2 | round 3 | round 4 | final | |||||
α | pred | α | pred | α | pred | α | pred | pred | actual |
0.42 | 1 | 0.65 | -1 | 0.38 | 1 | 1.1 | 1 | 1 | 1 |
0.42 | 1 | 0.65 | -1 | 0.38 | 1 | 1.1 | 1 | 1 | 1 |
0.42 | -1 | 0.65 | 1 | 0.38 | -1 | 1.1 | 1 | 1 | 1 |
0.42 | -1 | 0.65 | -1 | 0.38 | -1 | 1.1 | 1 | -1 | -1 |
0.42 | -1 | 0.65 | -1 | 0.38 | -1 | 1.1 | 1 | -1 | -1 |
0.42 | -1 | 0.65 | 1 | 0.38 | -1 | 1.1 | 1 | 1 | 1 |
0.42 | -1 | 0.65 | -1 | 0.38 | -1 | 1.1 | 1 | -1 | -1 |
0.42 | -1 | 0.65 | 1 | 0.38 | -1 | 1.1 | 1 | 1 | 1 |
0.42 | -1 | 0.65 | 1 | 0.38 | -1 | 1.1 | -1 | -1 | -1 |
0.42 | -1 | 0.65 | -1 | 0.38 | -1 | 1.1 | -1 | -1 | -1 |
You might realize that both round 1 and round 3 produce same results. Pruning in adaboost proposes to remove similar weak classifier to overperform. Besides, you should increase the multiplier alpha value of remaining one. In this case, I remove round 3 and append its coefficient to round 1.
round 1 | round 2 | round 4 | final | ||||
α | pred | α | pred | α | pred | pred | actual |
0.8 | 1 | 0.65 | -1 | 1.1 | 1 | 1 | 1 |
0.8 | 1 | 0.65 | -1 | 1.1 | 1 | 1 | 1 |
0.8 | -1 | 0.65 | 1 | 1.1 | 1 | 1 | 1 |
0.8 | -1 | 0.65 | -1 | 1.1 | 1 | -1 | -1 |
0.8 | -1 | 0.65 | -1 | 1.1 | 1 | -1 | -1 |
0.8 | -1 | 0.65 | 1 | 1.1 | 1 | 1 | 1 |
0.8 | -1 | 0.65 | -1 | 1.1 | 1 | -1 | -1 |
0.8 | -1 | 0.65 | 1 | 1.1 | 1 | 1 | 1 |
0.8 | -1 | 0.65 | 1 | 1.1 | -1 | -1 | -1 |
0.8 | -1 | 0.65 | -1 | 1.1 | -1 | -1 | -1 |
Even though we’ve used linear weak classifiers, all instances can be classified correctly.
So, we’ve mentioned adaptive boosting algorithm. In this example, we’ve used decision stumps as a weak classifier. You might consume perceptrons for more complex data sets. I’ve pushed the adaboost logic into my GitHub repository.
Special thank to Olga Veksler. Her lecture notes help me to understand this concept.
The post A Step by Step Adaboost Example appeared first on Sefik Ilkin Serengil.
]]>The post A Step by Step Gradient Boosting Example for Classification appeared first on Sefik Ilkin Serengil.
]]>Notice that gradient boosting is not a decision tree algorithm. It proposes to run a regression trees sequentially.
Here, we are going to work on Iris data set. There are 150 instances of 3 homogeneous classes. They are setosa, versicolor and virginica. This is the target output whereas top and bottom leaf sizes are input features.
Applying C4.5 decision tree algorithm to this data set classifies 105 instances correctly whereas 45 instances incorrectly. This means 70% accuracy which is far away from the success. We will run same C4.5 algorithm in the following steps but boosting enables to increase the accuracy.
You can find the building decision tree code here.
We are going to apply one-hot-encoding to target output. Thus, output will be represented as three dimensional vector. However, decision tree algorithms can handle one output only. That’s why, we will build 3 different regression trees each time. You might think each decision tree as different binary classification problem.
I mean that I’ve selected sample rows of the data set to illustrate. This is the original one.
instance | sepal_length | sepal_width | petal_length | petal_width | label |
1 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
2 | 4.9 | 3 | 1.4 | 0.2 | setosa |
51 | 7 | 3.2 | 4.7 | 1.4 | versicolor |
101 | 6.3 | 3.3 | 6 | 2.5 | virginica |
Label consists of 3 classes: setosa, versicolor and virginica.
Firstly, I prepare a data set to check instances setosa or not.
instance | sepal_length | sepal_width | petal_length | petal_width | setosa |
1 | 5.1 | 3.5 | 1.4 | 0.2 | 1 |
2 | 4.9 | 3 | 1.4 | 0.2 | 1 |
51 | 7 | 3.2 | 4.7 | 1.4 | 0 |
101 | 6.3 | 3.3 | 6 | 2.5 | 0 |
Secondly, I prepare a data set to check instances versicolor or not.
instance | sepal_length | sepal_width | petal_length | petal_width | versicolor |
1 | 5.1 | 3.5 | 1.4 | 0.2 | 0 |
2 | 4.9 | 3 | 1.4 | 0.2 | 0 |
51 | 7 | 3.2 | 4.7 | 1.4 | 1 |
101 | 6.3 | 3.3 | 6 | 2.5 | 0 |
Finally, I’ll prepare a data set to check instances virginica or not.
instance | sepal_length | sepal_width | petal_length | petal_width | virginica |
1 | 5.1 | 3.5 | 1.4 | 0.2 | 0 |
2 | 4.9 | 3 | 1.4 | 0.2 | 0 |
51 | 7 | 3.2 | 4.7 | 1.4 | 0 |
101 | 6.3 | 3.3 | 6 | 2.5 | 1 |
Now, I have 3 different data sets. I can build 3 decision trees for these data sets.
I’m going to put actual labels and predictions in the same table in the following steps. Columns beginning with F_ prefix are predictions.
instance | Y_setosa | Y_versicolor | Y_virginica | F_setosa | F_versicolor | F_virginica |
1 | 1 | 0 | 0 | 1 | 0 | 0 |
2 | 1 | 0 | 0 | 1 | 0 | 0 |
51 | 0 | 1 | 0 | 0 | 1 | 0 |
101 | 0 | 0 | 1 | 0 | 1 | 1 |
Notice that instance 101 is predicted as versicolor and virginica with same probability. This has an error.
Initially, we need to apply softmax function for predictions. This function normalize all inputs in scale of [0, 1], and also sum of normalized values are always equal to 1. But there is no out-of-the-box function for softmax in python. Still we can create it easily as coded below.
def softmax(w): e = np.exp(np.array(w)) dist = e / np.sum(e) return dist
I’m going to add these probabilities as columns. I’ve also hided actual values (Y_prefix) to fit the table.
ins | F_setosa | F_versicolor | F_virginica | P_setosa | P_versicolor | P_virginica |
1 | 1 | 0 | 0 | 0.576 | 0.212 | 0.212 |
2 | 1 | 0 | 0 | 0.576 | 0.212 | 0.212 |
51 | 0 | 1 | 0 | 0.212 | 0.576 | 0.212 |
101 | 0 | 1 | 1 | 0.155 | 0.422 | 0.422 |
Remember that we’ve built new tree for actual minus prediction target value in regression case. This difference comes from the derivative of mean squared error. Herein, we’ve applied softmax function. The maximum one between probabilities of predictions (columns have P_ prefix) would be the prediction. In other words, we’ll apply one-hot-encoding as 1 for max one whereas 0 for others. Herein, cross entropy stores the relation between probabilities and one-hot-encoded results. Applying softmax and cross entropy respectively has surprising derivative. It is equal to prediction (probabilities in this case) minus actual. The negative gradient would be actual (columns have Y_ prefix) minus prediction (columns have P_ prefix). We will derive this value.
instance | Y_setosa – P_setosa |
Y_versicolor – P_versicolor |
Y_virginica – P_virginica |
1 | 0.424 | -0.212 | -0.212 |
2 | 0.424 | -0.212 | -0.212 |
51 | -0.212 | 0.424 | -0.212 |
101 | -0.155 | -0.422 | 0.578 |
This is the round 1. Target values will be replaced as these negative gradients in the following round.
Target column for setosa will be replaced with Y_setosa – P_setosa.
instance | sepal_length | sepal_width | petal_length | petal_width | setosa |
1 | 5.1 | 3.5 | 1.4 | 0.2 | 0.424 |
2 | 4.9 | 3 | 1.4 | 0.2 | 0.424 |
51 | 7 | 3.2 | 4.7 | 1.4 | -0.212 |
101 | 6.3 | 3.3 | 6 | 2.5 | -0.155 |
Target column for versicolor will be replaced with Y_versicolor – P_versicolor.
instance | sepal_length | sepal_width | petal_length | petal_width | versicolor |
1 | 5.1 | 3.5 | 1.4 | 0.2 | -0.212 |
2 | 4.9 | 3 | 1.4 | 0.2 | -0.212 |
51 | 7 | 3.2 | 4.7 | 1.4 | 0.424 |
101 | 6.3 | 3.3 | 6 | 2.5 | -0.422 |
I will apply similar replacements for virginica, too. These are my new data sets. I’m going to build 3 different decision trees for these 3 different data set. This operation will be repeated until I get satisfactory success.
Finally, I’m going to sum predictions (F_ prefix) for all rounds. The maximum index value will be my prediction.
At round 10, I can classify 144 instances correctly whereas 6 instances incorrectly. This means I got 96% accuracy. Remember that I got 70% accuracy before boosting. This is a major improvement!
I’ve demonstrated gradient boosting for classification on a multi-class classification problem where number of classes is greater than 2. Running it for a binary classification problem (true/false) might require to consume sigmoid function. Still, softmax and cross-entropy pair works for binary classification.
So, we’ve mentioned a step by step gradient boosting example for classification. I cannot find this in literature. Basically, we’ve transformed classification example to multiple regression tasks to boost. I am grateful to Cheng Li. His lecture notes guide me to understand this topic. Finally, running and debugging code by yourself makes concept much more understandable. That’s why, I’ve already pushed the code of gradient boosting for classification into GitHub.
The post A Step by Step Gradient Boosting Example for Classification appeared first on Sefik Ilkin Serengil.
]]>The post How Pruning Works in Decision Trees appeared first on Sefik Ilkin Serengil.
]]>Pruning can be handled as pre-pruning and post-pruning.
We’ve mentioned regression tree in a previous post. We are going to use the same data set in that post as demonstrated below.
Day | Outlook | Temp. | Humidity | Wind | Golf Players |
1 | Sunny | Hot | High | Weak | 25 |
2 | Sunny | Hot | High | Strong | 30 |
3 | Overcast | Hot | High | Weak | 46 |
4 | Rain | Mild | High | Weak | 45 |
5 | Rain | Cool | Normal | Weak | 52 |
6 | Rain | Cool | Normal | Strong | 23 |
7 | Overcast | Cool | Normal | Strong | 43 |
8 | Sunny | Mild | High | Weak | 35 |
9 | Sunny | Cool | Normal | Weak | 38 |
10 | Rain | Mild | Normal | Weak | 46 |
11 | Sunny | Mild | Normal | Strong | 48 |
12 | Overcast | Mild | High | Strong | 52 |
13 | Overcast | Hot | Normal | Weak | 44 |
14 | Rain | Mild | High | Strong | 30 |
Running regression tree algorithm constructs the following decision tree.
def findDecision(Outlook,Temp.,Humidity,Wind): if Outlook == 'Rain' : if Wind == 'Weak' : if Humidity <=95 : if Temp. <=83 : return 46 if Humidity >95 : return 45 if Wind == 'Strong' : if Temp. <=83 : if Humidity <=95 : return 23 if Outlook == 'Sunny' : if Temp. <=83 : if Wind == 'Weak' : if Humidity <=95 : return 35 if Wind == 'Strong' : if Humidity <=95 : return 30 if Temp. >83 : return 25 if Outlook == 'Overcast' : if Wind == 'Weak' : if Temp. <=83 : if Humidity <=95 : return 46 if Wind == 'Strong' : if Temp. <=83 : if Humidity <=95 : return 43
Disappointing huge tree is created.
As seen, a huge tree is created. This is typical problem of regression trees. Decision rules at the bottom includes a few instances or single instance. This causes overfitting. Here, we can apply early stop. Here, we might check number of instances in the current branch or ratio of standard deviation of the current branch to all global data set.
if algorithm == 'Regression' and subdataset.shape[0] < 5: #if algorithm == 'Regression' and subdataset['Decision'].std(ddof=0)/global_stdev < 0.4: final_decision = subdataset['Decision'].mean() #get average terminateBuilding = True
Enabling early stop if sub data sets in the current branch is less than e.g. 5 will construct the following decision tree. As seen, more generalized decision rules are created. This avoids overfitting.
def findDecision(Outlook,Temp.,Humidity,Wind): if Outlook == 'Rain' : if Wind == 'Weak' : return 47.666666666666664 if Wind == 'Strong' : return 26.5 if Outlook == 'Sunny' : if Temp. <=83 : return 37.75 if Temp. >83 : return 25 if Outlook == 'Overcast' : return 46.25
We’ve mentioned C4.5 decision tree algorithm in a previous post. Suppose that we are going to work on the following data set.
Day | Outlook | Temp. | Humidity | Wind | Decision |
1 | Sunny | 85 | 85 | Weak | No |
2 | Sunny | 80 | 90 | Strong | No |
3 | Overcast | 83 | 78 | Weak | Yes |
4 | Rain | 70 | 96 | Weak | Yes |
5 | Rain | 68 | 80 | Weak | Yes |
6 | Rain | 65 | 70 | Strong | No |
7 | Overcast | 64 | 65 | Strong | Yes |
8 | Sunny | 72 | 95 | Weak | No |
9 | Sunny | 69 | 70 | Weak | Yes |
10 | Rain | 75 | 80 | Weak | Yes |
11 | Sunny | 75 | 70 | Strong | Yes |
12 | Overcast | 72 | 90 | Strong | Yes |
13 | Overcast | 81 | 75 | Weak | Yes |
14 | Rain | 71 | 80 | Strong | No |
C4.5 algorithm constructs the following decision tree. Notice that built decision tree is in a different form in related blog post because we picked information gain metric in that post but we picked gain ratio metric in this post.
def findDecision(Outlook,Temp.,Humidity,Wind): if Temp. <=83 : if Outlook == 'Rain' : if Wind == 'Weak' : return 'Yes' if Wind == 'Strong' : return 'No' if Outlook == 'Overcast' : return 'Yes' if Outlook == 'Sunny' : if Humidity >65 : if Wind == 'Strong' : return 'Yes' if Wind == 'Weak' : return 'Yes' if Temp. >83 : return 'No'
Here, please focus on decisions when temperature is less than or equal to 82, and sunny outlook. This branch makes positive decision no matter what wind is. Still, it checks wind feature. We can prune checking wind feature in that level. Also, you might realize that there is no answer when humidity is less than or equal to 65. It actually comes from that humidity feature could have 65 as minimum value. We can prune this rule, too. The final form of the decision tree is illustrated below.
def findDecision(Outlook,Temp.,Humidity,Wind): if Temp. <=83 : if Outlook == 'Rain' : if Wind == 'Weak' : return 'Yes' if Wind == 'Strong' : return 'No' if Outlook == 'Overcast' : return 'Yes' if Outlook == 'Sunny' : return 'Yes' if Temp. >83 : return 'No'
This modification improves the performance of running decision tree. Because, it will always make same decisions even though its result wouldn’t be changed.
We have been pruning some decision rules because its upper branch includes them both. But this is not a must. You should prune some branches if they might derive from a few instances in the training set. This enables to avoid overfitting.
To sum up, post pruning covers building decision tree first and pruning some decision rules from end to beginning. In contrast, pre-pruning and building decision trees are handled simultaneously. In both cases, less complex trees are created and this causes to run decision rules faster. Also, this might enables to avoid overfitting.
All code and data sets are already pushed into GitHub. You might run it by yourself.
The post How Pruning Works in Decision Trees appeared first on Sefik Ilkin Serengil.
]]>The post A Gentle Introduction to LightGBM for Applied Machine Learning appeared first on Sefik Ilkin Serengil.
]]>You might run pip install lightgbm command to install LightGBM package. Then, we will reference the related library.
import lightgbm as lgb
The data set that we are going to work on is about playing Golf decision based on some features. You can find the data set here. I choose this data set because it has both numeric and string features. Decision column is the target that we would like to extract decision rules. I will load the data set with pandas because it will simplify column based operations in the following steps.
import pandas as pd dataset = pd.read_csv('golf2.txt') dataset.head()
Data frame’s head function prints the first 5 rows.
Outlook | Temp. | Humidity | Wind | Decision | |
0 | Sunny | 85 | 85 | Weak | No |
1 | Sunny | 80 | 90 | Strong | No |
2 | Overcast | 83 | 78 | Weak | Yes |
3 | Rain | 70 | 96 | Weak | Yes |
4 | Rain | 68 | 80 | Weak | Yes |
LightGBM expects to convert categorical features to integer. Here, temperature and humidity features are already numeric but outlook and wind features are categorical. We need to convert these features. I will use scikit-learn’s transformer.
Even though categorical features will be converted to integer, we will specify categorical features in the following steps. That’s why, I store both all features and categorical ones in different variables.
from sklearn import preprocessing le = preprocessing.LabelEncoder() features = []; categorical_features = [] num_of_columns = dataset.shape[1] for i in range(0, num_of_columns): column_name = dataset.columns[i] column_type = dataset[column_name].dtypes if i != num_of_columns - 1: #skip target features.append(column_name) if column_type == 'object': le.fit(dataset[column_name]) feature_classes = list(le.classes_) encoded_feature = le.transform(dataset[column_name]) dataset[column_name] = pd.DataFrame(encoded_feature) if i != num_of_columns - 1: #skip target categorical_features.append(column_name) if is_regression == False and i == num_of_columns - 1: num_of_classes = len(feature_classes)
In this way, we can handle different data sets. Let’s check the encoded data set.
dataset.head()
Outlook | Temp. | Humidity | Wind | Decision | |
0 | 2 | 85 | 85 | 1 | 0 |
1 | 2 | 80 | 90 | 0 | 0 |
2 | 0 | 83 | 78 | 1 | 1 |
3 | 1 | 70 | 96 | 1 | 1 |
4 | 1 | 68 | 80 | 1 | 1 |
Data set is transformed into the final form. We need to separate input features and output labels to feed LightGBM.
y_train = dataset['Decision'].values x_train = dataset.drop(columns=['Decision']).values
Remember that we have converted string features to integer. Here, we need to specify categorical features. Even though it still work if categorical features wouldn’t mention. But in this case, some node in the decision tree might check that feature is greater than something, or less than or equal to it. Consider that gender would be a feature in our data set. We set unknown gender to 0, male to 1, and woman to 2. What if decision tree checks gender is greater than 0, or less than or equal to 0? We might miss an important gender information. Specifying categorical features enables to check gender for male, for woman and for unknown respectively.
lgb_train = lgb.Dataset(x_train, y_train ,feature_name = features , categorical_feature = categorical_features )
We can solve this problem for both classification and regression. Typically, objective and metric parameters should be different. Passing parameter set and LightGBM’s data set will start training.
params = { 'task': 'train' , 'boosting_type': 'gbdt' , 'objective': 'regression' if is_regression == True else 'multiclass' , 'num_class': num_of_classes , 'metric': 'rmsle' if is_regression == True else 'multi_logloss' , 'min_data': 1 , 'verbose': -1 } gbm = lgb.train(params, lgb_train, num_boost_round=50)
Trained tree stored in gbm variable. We can ask gbm to predict the decision for a new instance. Similarly, we can feed features of training set instances and want gbm to predict decisions.
predictions = gbm.predict(x_train) for index, instance in dataset.iterrows(): actual = instance[target_name] if is_regression == True: prediction = round(predictions[index]) else: #classification prediction = np.argmax(predictions[index]) print((index+1),". actual= ",actual,", prediction= ",prediction)
This code block makes following predictions for the training data set. As seen, all instances can be predicted successfully.
actual= 0 , prediction= 0 actual= 0 , prediction= 0 actual= 1 , prediction= 1 actual= 1 , prediction= 1 actual= 1 , prediction= 1 actual= 0 , prediction= 0 actual= 1 , prediction= 1 actual= 0 , prediction= 0 actual= 1 , prediction= 1 actual= 1 , prediction= 1 actual= 1 , prediction= 1 actual= 1 , prediction= 1 actual= 1 , prediction= 1 actual= 0 , prediction= 0
Luckily, LightGBM enables to visualize built decision tree and importance of data set features. This makes decisions understandable. This requires to install Graph Visualization Software.
Firstly, you need to run pip install graphviz command to install python package.
Secondly, please install graphviz package related to your OS here. You can specify the installed directory as illustrated below.
import matplotlib.pyplot as plt import os os.environ["PATH"] += os.pathsep + 'C:/Program Files (x86)/Graphviz2.38/bin'
Plotting tree is an easy task now.
ax = lgb.plot_importance(gbm, max_num_features=10) plt.show() ax = lgb.plot_tree(gbm) plt.show()
Decision rules can be extracted from the built tree easily.
Now, we know feature importance for the data set.
So, we have discovered Microsoft’s light gradient boosting machine framework adopted by many applied machine learning studies. Moreover, we’ve mentioned its pros and cons compared to its alternatives. Besides, we’ve developed a hello world model with LightGBM. Finally, I pushed the source code of this blog post to my GitHub profile.
The post A Gentle Introduction to LightGBM for Applied Machine Learning appeared first on Sefik Ilkin Serengil.
]]>The post A Step by Step Gradient Boosting Decision Tree Example appeared first on Sefik Ilkin Serengil.
]]>Lecture notes of Zico Colter from Carnegie Mellon University and lecture notes of Cheng Li from Northeastern University guide me to understand the concept. Moreover, Tianqi Chen‘s presentation reinforce to make sense. Also, I referenced all sources help me to make clear the subject as a link in this post. I strongly recommend you to visit these links.
I pushed the core implementation of gradient boosted regression tree algorithm to GitHub. You might want to clone the repository and run it by yourself.
This is very similar to baby step giant step method. We initially create a decision tree for the raw data set. That would be the giant step. Then, it is time to tune and boost. We will create new decision tree based on previous tree’s error. We will apply this approach several times. These would be baby steps. Terence Parr described this process wonderfully in golf playing scenario as illustrated below.
Herein, remember random forest algorithm. We separate data set to n different sub data sets and create different decision trees for these sub data sets. In contrast, data set remains same in GBM. We will create a decision tree, we will feed decision tree algorithm same data set but we will update each instance’s label value as its actual value minus its prediction.
You might think sequential decision trees in gradient boosting.
For instance, the following illustration shows that first decision tree returns 2 as a result for the boy. Then, we will build another decision tree based on errors for the first decision tree’s results. It returns 0.9 in this time for the boy. Final decision for the boy would be 2.9 which sums the prediction of sequential trees.
You might remember that we’ve mentioned regression trees in previous posts. Reading that post will contribute to understand GBM clearly.
We’ve worked on the following data set.
Day | Outlook | Temp. | Humidity | Wind | Decision |
1 | Sunny | Hot | High | Weak | 25 |
2 | Sunny | Hot | High | Strong | 30 |
3 | Overcast | Hot | High | Weak | 46 |
4 | Rain | Mild | High | Weak | 45 |
5 | Rain | Cool | Normal | Weak | 52 |
6 | Rain | Cool | Normal | Strong | 23 |
7 | Overcast | Cool | Normal | Strong | 43 |
8 | Sunny | Mild | High | Weak | 35 |
9 | Sunny | Cool | Normal | Weak | 38 |
10 | Rain | Mild | Normal | Weak | 46 |
11 | Sunny | Mild | Normal | Strong | 48 |
12 | Overcast | Mild | High | Strong | 52 |
13 | Overcast | Hot | Normal | Weak | 44 |
14 | Rain | Mild | High | Strong | 30 |
And we’ve built the following decision tree.
This duty is handled by buildDecisionTree function. We will pass data set, number of inline tabs (this is important in python. we will increase this every inner call and restore it after calling) and file name to store decision rules.
root = 1 buildDecisionTree(df,root,"rules0.py") #generate rules0.py
Running this decision tree algorithm for the data set generates the following decision rules.
def findDecision(obj): if Outlook == 'Rain': if Wind == 'Weak': return 47.666666666666664 if Wind == 'Strong': return 26.5 if Outlook == 'Sunny': if obj[1] == 'Hot': return 27.5 if obj[1] == 'Mild': return 41.5 if obj[1] == 'Cool': return 38 if Outlook == 'Overcast': return 46.25
Building this decision tree was covered in a previous post. That’s why, I skipped how the tree is built. If it is hard to understand, I strongly recommend you to read that post.
Let’s check the Day 1 and Day 2 instances. They both have sunny outlook and hot temperature. Built decision tree says that decision will be 27.5 for sunny outlook and hot temperature. However, day 1 should be 25 and day 2 should be 30. This means that the error (or residual) is 25 – 27.5 = -2.5 for day 1 and 30 – 27.5 = +2.5 for day 2. The following days have similar errors. We will boost these errors.
This is not a must but we will use mean squared error as loss function.
loss = (1/2) x (y – y’)^{2}
where y is the actual value and y’ is the prediction.
Gradient refers to gradient descent in gradient boosting. We will update each prediction as partial derivative of loss function with respect to the prediction. Let’s find this derivative first.
∂ loss / ∂ y’ = ∂((1/2) x (y – f(x))^{2})/∂y’ = 2. (1/2) . (y – y’) . ∂(-y’)/∂y’ = 2. (1/2) . (y – y’) . (-1) = y’ – y
Now, we can update predictions by applying the following formula. Here, α is learning rate.
y’ = y’ – α . (∂ loss / ∂ y’)
Please focus on the updating term only. I set α to 1 to make formula simpler.
– α . (∂ loss / ∂ y’) = – α . (y’ – y) = α . (y – y’) = y – y’
This is the label that we are going to build a new decision tree.
Remember that error was -2.5 for day 1 and +2.5 for day 2. Similarly, we’ll find the errors based on the built decision tree’s results and actual labels for the following days.
import rules0 as myrules for i, instance in df.iterrows(): params = [] #features for current line stored in params list for j in range(0, columns-1): params.append(instance[j]) prediction = myrules.findDecision(params) #apply rules(i-1) for data(i-1) actual = instance[columns-1] gradient = actual - prediction instance[columns-1] = gradient df.loc[i] = instance #end of for loop df.to_csv("data1.py", index=False)
Then, a new data set will be created and residual for each line set to its decision column.
Day | Outlook | Temp. | Humidity | Wind | Decision |
1 | Sunny | Hot | High | Weak | -2.5 |
2 | Sunny | Hot | High | Strong | 2.5 |
3 | Overcast | Hot | High | Weak | -0.25 |
4 | Rain | Mild | High | Weak | -2.66 |
5 | Rain | Cool | Normal | Weak | 4.333 |
6 | Rain | Cool | Normal | Strong | -3.5 |
7 | Overcast | Cool | Normal | Strong | -3.25 |
8 | Sunny | Mild | High | Weak | -6.5 |
9 | Sunny | Cool | Normal | Weak | 0 |
10 | Rain | Mild | Normal | Weak | -1.66 |
11 | Sunny | Mild | Normal | Strong | 6.5 |
12 | Overcast | Mild | High | Strong | 5.75 |
13 | Overcast | Hot | Normal | Weak | -2.25 |
14 | Rain | Mild | High | Strong | 3.55 |
Now, it is time to build a new decision tree based on the data set above. The following code block will generate decision rules for the current data frame.
root = 1 buildDecisionTree(df,root,"rules1.py")
Running regression tree algorithm creates the following decision rules.
def findDecision(Outlook, Temperature, Humidity, Wind): if Wind == 'Weak': if Temperature == 'Hot': return -1.6666666666666667 if Temperature == 'Mild': return -3.6111111111111094 if Temperature == 'Cool': return 2.166666666666668 if Wind == 'Strong': if Temperature == 'Mild': return 5.25 if Temperature == 'Cool': return -3.375 if Temperature == 'Hot': return 2.5
Let’s look for predictions of day 1 and day 2 again. Now, built decision tree says that day 1 has weak wind and hot temperature and it is -1.666 but its actual value was -2.5 in the 2nd data set. This means that error is -2.5 – (-1.666) = -0.833.
Similarly, the tree says that day 2 has strong wind and hot temperature, that’s why, it is predicted as 2.5 whereas its actual value is 2.5, too. In this case, error is equal to 2.5 – 2.5 = 0. In this way, I calculate each instance’s prediction and subtract from its actual value again.
Day | Outlook | Temp. | Humidity | Wind | Golf Players |
1 | Sunny | Hot | High | Weak | -0.833 |
2 | Sunny | Hot | High | Strong | 0.0 |
3 | Overcast | Hot | High | Weak | 1.416 |
4 | Rain | Mild | High | Weak | 0.944 |
5 | Rain | Cool | Normal | Weak | 2.166 |
6 | Rain | Cool | Normal | Strong | -0.125 |
7 | Overcast | Cool | Normal | Strong | 0.125 |
8 | Sunny | Mild | High | Weak | -2.888 |
9 | Sunny | Cool | Normal | Weak | -2.166 |
10 | Rain | Mild | Normal | Weak | 1.944 |
11 | Sunny | Mild | Normal | Strong | 1.25 |
12 | Overcast | Mild | High | Strong | 0.5 |
13 | Overcast | Hot | Normal | Weak | -0.583 |
14 | Rain | Mild | High | Strong | -1.75 |
This time, the following rules will be created for the data set above.
def findDecision(Outlook, Temperature, Humidity, Wind): if Outlook == 'Rain': if Wind == 'Weak': return 1.685185185185186 if Wind == 'Strong': return -0.9375 if Outlook == 'Sunny': if Wind == 'Weak': return -1.962962962962964 if Wind == 'Strong': return 0.625 if Outlook == 'Overcast': return 0.3645833333333334
I skipped epochs from 3 to 5 because same procedures are applied in each step.
Thereafter, I summarize each epoch’s predictions in the table shown below. I’m going to calculate predictions cumulatively and sum values from epoch 1 to epoch 5 in each line to find the final prediction.
Day | Actual | epoch 1 | epoch 2 | epoch 3 | epoch 4 | epoch 5 | prediction |
1 | 25 | 27.5 | -1.667 | -1.963 | 0.152 | 5.55E-17 | 24.023 |
2 | 30 | 27.5 | 2.5 | 0.625 | 0.152 | 5.55E-17 | 30.777 |
3 | 46 | 46.25 | -1.667 | 0.365 | 0.152 | 5.55E-17 | 45.1 |
4 | 45 | 47.667 | -3.611 | 1.685 | -0.586 | -1.88E-01 | 44.967 |
5 | 52 | 47.667 | 2.167 | 1.685 | 0.213 | 1.39E-17 | 51.731 |
6 | 23 | 26.5 | -3.375 | -0.938 | 0.213 | 1.39E-17 | 22.4 |
7 | 43 | 46.25 | -3.375 | 0.365 | 0.213 | 1.39E-17 | 43.452 |
8 | 35 | 41.5 | -3.611 | -1.963 | -0.586 | -7.86E-02 | 35.261 |
9 | 38 | 38 | 2.167 | -1.963 | 0.213 | 1.39E-17 | 38.416 |
10 | 46 | 47.667 | -3.611 | 1.685 | 0.442 | -1.88E-01 | 45.995 |
11 | 48 | 41.5 | 5.25 | 0.625 | 0.442 | -7.86E-02 | 47.739 |
12 | 52 | 46.25 | 5.25 | 0.365 | -0.586 | 7.21E-01 | 52 |
13 | 44 | 46.25 | -1.667 | 0.365 | 0.152 | 5.55E-17 | 45.1 |
14 | 30 | 26.5 | 5.25 | -0.938 | -0.586 | -1.88E-01 | 30.038 |
For instance, predictions will be changed over epoch as illustrated below.
1st Epoch = 27.5
2nd Epoch = 27.5 – 1.667 = 25.833
3rd Epoch = 27.5 – 1.667 – 1.963 = 23.87
4th Epoch = 27.5 – 1.667 – 1.963 + 0.152 = 24.022
Absolute error was |25-27.5| = 2.5 in 1st round for 1st day but we can reduce it to |25-24.023| = 0.97 in 5th round. As seen, each instance’s prediction closes to its actual value when it is boosted.
BTW, learning rate (α) and number of iterations (epoch) should be tuned for different problems.
I pivot mean absolute error value for each epoch.
Day | epoch 1 | epoch 2 | epoch 3 | epoch 4 | epoch 5 |
1 | 2.5 | 0.833 | 1.13 | 0.977 | 0.977 |
2 | 2.5 | 0 | 0.625 | 0.777 | 0.777 |
3 | 0.25 | 1.417 | 1.052 | 0.9 | 0.9 |
4 | 2.667 | 0.944 | 0.741 | 0.155 | 0.033 |
5 | 4.333 | 2.167 | 0.481 | 0.269 | 0.269 |
6 | 3.5 | 0.125 | 0.813 | 0.6 | 0.6 |
7 | 3.25 | 0.125 | 0.24 | 0.452 | 0.452 |
8 | 6.5 | 2.889 | 0.926 | 0.34 | 0.261 |
9 | 0 | 2.167 | 0.204 | 0.416 | 0.416 |
10 | 1.667 | 1.944 | 0.259 | 0.183 | 0.005 |
11 | 6.5 | 1.25 | 0.625 | 0.183 | 0.261 |
12 | 5.75 | 0.5 | 0.135 | 0.721 | 0 |
13 | 2.25 | 0.583 | 0.948 | 1.1 | 1.1 |
14 | 3.5 | 1.75 | 0.813 | 0.227 | 0.038 |
MAE | 3.011111 | 1.112963 | 0.599383 | 0.48669 | 0.406115 |
The result seems interesting when I plot the total error over epochs.
We can definitely say that boosting works well.
So, the intuition behind gradient boosting is covered in this post. XGBoost, LightGBM and Catboost are common variants of gradient boosting. Even though, decision trees are very powerful machine learning algorithms, a single tree is not strong enough for applied machine learning studies. However, experiments show that its sequential form GBM dominates most of applied ML challenges. I pushed the core implementation of gradient boosted regression tree algorithm to GitHub.
The post A Step by Step Gradient Boosting Decision Tree Example appeared first on Sefik Ilkin Serengil.
]]>The post Large Scale Machine Learning with Pandas appeared first on Sefik Ilkin Serengil.
]]>You might remember the Iris flower data set. There are 150 instances of length and width measurement for top and bottom leaf and corresponding class in the data set. Corresponding class can be 3 different iris flower types: setosa, versicolor and virginica. So, there are 4 input features and 3 output labels. Let’s create a hidden layer consisting of 4 nodes in the neural networks. I mostly decide this number as 2/3 times of sum of features and labels. Multi class classification requires to use cross-entropy as loss function. Also, I want to apply Adam optimization algorithm to converge faster.
import keras from keras.models import Sequential from keras.layers import Dense, Activation def createNetwork(): model = Sequential() model.add(Dense(4 #num of hidden units , input_shape=(4,))) #num of features in input layer model.add(Activation('sigmoid')) #activation function from input layer to 1st hidden layer model.add(Dense(num_classes)) #num of classes in output layer model.add(Activation('sigmoid')) #activation function from 1st hidden layer to output layer return model model = createNetwork() model.compile(loss='categorical_crossentropy' , optimizer=keras.optimizers.Adam(lr=0.007) , metrics=['accuracy']
Even though data set is small enough, we will load sub data sets instead of loading all. In this way, we will save on the memory. On the other hand, this will increase the I/O usage but this is reasonable because we cannot store massive data sets on memory.
Chunk size parameter is set to 30. Thus, we will read 30 lines of data set for each iteration. Moreover, column information is missing in the data set. That’s why, we need to define column names. Otherwise, pandas thinks the first row as column names and we will lose that line’s information.
import pandas as pd import numpy as np chunk_size = 30 def processDataset(): for chunk in pd.read_csv("iris.data", chunksize=chunk_size , names = ["sepal_length","sepal_width","petal_length","petal_width","class"]): current_set = chunk.values #convert df to numpy array
Chunk parameter is type of pandas data frame. We can still convert it to numpy array by getting its values. This is important because fit operation will expect features and labels as numpy.
A line of the data set consisting of 4 measurement of a flower, and corresponding class respectively. I can seperate features and label by specifying index values.
features = current_set[:,0:4] labels = current_set[:,4]
Labels are in single column and type of string. I will apply one-hot-encoding to feed network.
for i in range(0,labels.shape[0]): if labels[i] == 'Iris-setosa': labels[i] = 0 elif labels[i] == 'Iris-versicolor': labels[i] = 1 elif labels[i] == 'Iris-virginica': labels[i] = 2 labels = keras.utils.to_categorical(labels, num_classes)
Features and labels are ready. We can feed to neural networks. Learning time or epochs must be set to 1 here. This is important. I will handle epochs in a for loop at the top.
model.fit(features, labels, epochs=1, verbose=0) #epochs handled in the for loop above
We will done processing all train set when processDataset() operation is over. Remeber back-propagation and gradient descent algorithm. We need to apply this processing over and over.
epochs = 1000 for epoch in range(0, epochs): #epoch should be handled here, not in fit command! processDataset()
If you set verbose to 1, then you will face will loss values for current sub data set. You should ignore the loss during training because it does not represent global loss for train set.
So, we’ve adapted pandas to read massive data set as small chunks and feed neural networks learning. It comes with pros and cons. The main advantage is that we can handle massive data set and save on the memory. The disadvantage is that it increases I/O usage. However, the focus of this post is working on massive data sets, it is neither big data nor streaming data. I’ve pushed the source code of this post into GitHub.
The post Large Scale Machine Learning with Pandas appeared first on Sefik Ilkin Serengil.
]]>The post A Beginner’s Guide to TensorFlow.js: Machine Lerning in JavaScript appeared first on Sefik Ilkin Serengil.
]]>In this case, we can just run the code. No prerequisite installation is required. I will create a hello.html file and reference tensorflow js library in head tag. This reference offers to find tensorflow related objects under tf variable. There might be up-to-date version of the library. You should check it in the official site.
Also, I need to define an another script tag after tensorflow js referencing. I have to construct neural networks here.
<html> <head> <script src="https://cdn.jsdelivr.net/npm/@tensorflow/tfjs@0.12.5"> </script> <!-- Place your code in the script tag below --> <script> </script> </head> <body> </body> </html>
I will construct a model for XOR problem. Let’s create the data set first. Here, xtrain stores all potential inputs whereas ytrain stores xor logic gate results respectively as one-hot encoded. I mean that [1, 0] refers to firing 0 whereas [0, 1] refers to firing 1 as xor result.
const xtrain = tf.tensor2d([[0, 0], [0, 1], [1, 0], [1, 1]]); const ytrain = tf.tensor2d([[1, 0], [0, 1], [0, 1], [1, 0]]);
We can construct a neural networks model. I will create a sequential model. Input layer consists of 2 nodes because there 2 input features in xor data set. First and single hidden layer will have 5 nodes and its activation function will be sigmoid. Finally, output layer will have 2 nodes because xor data set has 2 output classes. Activation function of output layer should be softmax because this is a classification problem.
const model = tf.sequential(); model.add(tf.layers.dense({units: 5, activation: 'sigmoid', inputShape: [2]})); model.add(tf.layers.dense({ units: 2, activation: 'softmax' }));
Now, we can specify the optimization algorithm and loss function to train the model. You have to use categorical crossentropy loss function if you use softmax activation function in the output layer. Moreover, I would like to train the model with Adam optimization algorithm to be learnt faster.
var learning_rate = 0.1 model.compile({loss: 'categoricalCrossentropy', optimizer: tf.train.adam(learning_rate)});
Time to train the network. You might remember that we run fitting and prediction respectively in python. Here, running is a little different. Fit command is handled asynchronously. That’s why, you must not run fit and predict commands in separate lines as demonstrated below. Otherwise, predict command dumps the results before training.
//you should not run the prediction in this way const history = model.fit(xtrain, ytrain, {epochs: 200}) console.log("fit is over") model.predict(xtrain).print();
Fit command should include prediction as illustrated below.
const history = model.fit(xtrain, ytrain, {epochs: 200}) .then(()=>{ console.log("fit is over") //model.predict(tf.tensor2d([[0, 0], [0, 1], [1, 0], [1, 1]])).print(); model.predict(xtrain).print(); });
Coding is over for client side solution. Now, you can open the hello.html file in the browser. Do not surprise when you see the blank page. You can see the final predictions by pressing F12 button in chrome. Or you can access the same place under Settings (3 points on the right top side) > More tools > Developer tools > Console tab.
So, we can successfully create the Machine Learning in the browser as shown above. But this is too beyond the ML in browser. Let’s see how.
Server side capabilities enabled for javascript in Node.js recently. We can run the (almost) same code in Node.js server. In this case, you have to install Node.js into your computer. I installed the recommended version 8.11.4 for today. You can run node command in the command prompt after installation.
You should run the following command if you run the node.js first time. This creates packages.json file in the current directory. Otherwise, tensorflow.js installation would not complete successfully. BTW, I run the command on my desktop.
npm init
You can install TensorFlow.js package after initialization. (double dash and save. it seems like single dash in the browser)
npm install @tensorflow/tfjs –save
That’s it! Your environment is ready. Please create a hello.js file. Content of the file will be look like this.
var tf = require('@tensorflow/tfjs'); const model = tf.sequential(); model.add(tf.layers.dense({units: 5, activation: 'sigmoid', inputShape: [2]})); model.add(tf.layers.dense({ units: 2, activation: 'softmax' , outputShape: [2] })); model.compile({loss: 'categoricalCrossentropy', optimizer: tf.train.adam(0.1)}); const xtrain = tf.tensor2d([[0, 0], [0, 1], [1, 0], [1, 1]]); const ytrain = tf.tensor2d([[1,0],[0,1],[0,1],[1,0]]); const history = model.fit(xtrain, ytrain, {epochs: 200}) .then(()=>{ console.log("fit is over") //model.predict(tf.tensor2d([[0, 0], [0, 1], [1, 0], [1, 1]])).print(); model.predict(xtrain ).print(); });
As seen, we’ve run the same code. Model has learnt the principles of xor logic gate successfully.
So, we have mentioned the javascript version of TensorFlow in this post. TensorFlow is not just a tool for research. For instance, Facebook developed both PyTorch and Caffe2 frameworks for deep learning. However, Facebook uses PyTorch as research purposes whereas it uses Caffe2 for production. On the other hand, Google enabled TensorFlow for both research and production. It seems that we will see TensorFlow.js much more common in the following days.
The post A Beginner’s Guide to TensorFlow.js: Machine Lerning in JavaScript appeared first on Sefik Ilkin Serengil.
]]>The post 10 Interview Questions Asked in Machine Learning appeared first on Sefik Ilkin Serengil.
]]>Rewarding branches based on profits might not be fair. Because some of these branches have higher profits and some have more customers. This causes to reward lucky ones. You might apply unsupervised learning and create clusters based on profitability, turnover, transaction volumes, having customers or region. It is like customer segmentation. Then, you should evaluate each branch based on where it is in current cluster. In this way, each branch can compete with same weight competitors. Otherwise, it would be like putting light weighted boxer in front of heavyweight one. In fact, there might be several champs for different weight groups.
This is rare event detection problem. Classifiers expect homogeneous data during training to produce satisfactory results. We cannot always expect to have balanced data for some cases. Firstly, you can feed less number of randomly selected instances to decrease the number of non fraud transactions. This is called sub sampling. But this causes to lose important data. We would not often prefer to apply this. Secondly, we can increase the number of fraud transactions by creating synthetic fraud data. For example, you can pick random two existing fraud instances, calculate average of transaction amount for this two instances, and assign the average amount to the new data. This is called over sampling. This increases the number of fraud instances. This approach might be preferable than sub sampling for the fraud case but it is still dangerous because it causes to feed non existing data to the model. It is like having imaginary friends!
We can ignore the fraud mark and consider the problem as anomaly detection. However, we should work on transactions for customers individually. Suppose that transactions of a customer (e.g. named Sefik) has a normal distribution. Mean (µ) and standard deviation (σ) of transaction amount will lighten us. We have already known that 3 standard deviation beyond the mean (µ ± 3σ) covers 99.7% of all area. We can apply this logic to transactions of a customer. For example, if a customer has averagely 100$ expenses, and standard deviation were 10$, then 99.7% expenses must be less than 130$ and must be greater than 70$. You can mark any transaction of that customer as abnormal if it is greater than 130$. That might not be fraud but still it is abnormal. In this way, we can have an idea for unmarked transactions. BTW, you can increase precision. 6 sigma covers 99.99%.
We thought about the problem for only transaction amount. We can increase the dimensions by adding some additional information such as time and location information.
Some machine learning models such as neural networks or support vector machines produce opaque models. This means that opaque decisions cannot be read and understood by human. Everything is handled in a black box. On the other hand, a decision tree algorithm produces transparent decisions. Transparent decisions can be read and understood by human clearly. In other words, you can follow the steps to make decision. For example, look at the following decision tree. If your decision were accept offer, because the company offers free coffee, commutation does not last more than 1 hour and salary is greater than 50K.
That’s why, you have to build decision tree for credit decisioning. Herein, the most common decision tree algorithms for classification are ID3, C4.5 and CART. On the other hand, CART can be adapted for regression problems.
You might either solve an insignificant problem like how many legs does a cow have or you overfitted. You have most probably the second one. Even the most advanced AI models or intelligent life forms fail. You should not expect to get 100% accuracy anytime. How senior developers do not expect new programs to work bug-free at first time, notice that it makes happy just junior developers. Similarly, machine learning practitioners should never expect to get 100%. Still, you believe that you can solve a problem with 100% accuracy, then it would be automation. In this case, you can create a rule based model and there is no need for AI.
Remember the fraud detection data set. Suppose that there are 1M legal transactions and 100 fraud transactions. This means that 99.99% of the dataset corresponds legal whereas 0.01% corresponds fraud. In this case, you can get 99.99% accuracy if you return not fraud by default. Is this a success? Of course, no! Here, the important thing is that you can classify correctly how many of really fraud instances. Confusion matrix and ROC curve become important instead of overall accuracy. If number of cases for true positive and true negative close to 100%, that would be a good job.
Besides, if your problem is based on human health, then 99.99% accuracy means that you can cause to die of 1 person in every 1000 people. So, metrics might have different meanings based on problems.
Funny, but it includes both regression, classification and clustering. It predicts weather temperature in Fahrenheit or Celsius degrees. This is regression because continuous outputs will be produced. Moreover, it classifies the weather as partly sunny, raining and snowing. This is classification because there are limited number of classes. Finally, it includes unsupervised learning. It clusters some cities /states based on the geographic location.
If you run a decision tree algorithm, then they tend to over-fit on a large scale data sets. A basic approach is to apply random forest. It basically separates data set into several sub data sets (mostly prime number). Then, different decision trees are created for all of those sub data sets. Final decisions of these sub data sets specify the global decision. Moreover, you can apply pruning to avoid over-fitting.
On the other hand, if you run neural networks, it is based on updating weights over epochs. You should monitor the training set and validation set error over epochs. Training set error will decrease over iterations. If validation set error starts to increase for some epoch value, you should terminate epochs. Moreover, you could create a really complex neural networks model (input features, number of hidden layers and nodes). You might re-design a less complex the model.
This question might seem very easy but it is a tricky one. Traditional developers tend to design this kind of systems with for loops.
import numpy as np inputs = np.array([1,0,1]) weights = np.array([0.3, 0.8, 0.4]) sum = 0 for i in range(inputs.shape[0]): sum = sum + inputs[i] * weights[i] print(sum)
However, machine learning practitioners must not apply this approach. They have to apply matrix multiplication. Because, vectorized solution fasten processing time almost 150 times.
import numpy as np inputs = np.array([1,0,1]) weights = np.array([0.3, 0.8, 0.4]) sum = np.matmul(np.transpose(weights), inputs) print(sum)
Your data set can have thousands of features. Feeding all features becomes much more complex model. Training lasts longer and it might tend to over fit. Dropping some features will reduce the complexity and fasten training but in this case we might lose some significant information. Autoencoders are typical way to represent data and reduce dimensions. Thus, you can zip the data (lossy) but it offers you to have less complex model, faster training and you do not lose any information just like in dropping.
Besides, face recognition technology and art style transfer techniques are mainly based on dimension reduction and auto-encoders.
So, I collected some job interview questions asked for data scientists and machine learning practitioners and I try to respond. Responses reflect my personal opinions. You might find some answers true or partially false. These questions asked to test solution approach of a candidate. In other words, solution approach is more important than the pure answer.
The post 10 Interview Questions Asked in Machine Learning appeared first on Sefik Ilkin Serengil.
]]>