The post Apparent Age and Gender Prediction in Keras appeared first on Sefik Ilkin Serengil.
]]>The original work consumed face pictures collected from IMDB (7 GB) and Wikipedia (1 GB). You can find these data sets here. In this post, I will just consume wiki data source to develop solution fast. You should download faces only files.
Extracting wiki_crop.tar creates 100 folders and an index file (wiki.mat). The index file is saved as Matlab format. We can read Matlab files in python with SciPy.
import scipy.io mat = scipy.io.loadmat('wiki_crop/wiki.mat')
Converting pandas dataframe will make transformations easier.
instances = mat['wiki'][0][0][0].shape[1] columns = ["dob", "photo_taken", "full_path", "gender", "name", "face_location", "face_score", "second_face_score"] import pandas as pd df = pd.DataFrame(index = range(0,instances), columns = columns) for i in mat: if i == "wiki": current_array = mat[i][0][0] for j in range(len(current_array)): df[columns[j]] = pd.DataFrame(current_array[j][0])
Data set contains date of birth (dob) in Matlab datenum format. We need to convert this to Python datatime format. We just need the birth year.
from datetime import datetime, timedelta def datenum_to_datetime(datenum): days = datenum % 1 hours = days % 1 * 24 minutes = hours % 1 * 60 seconds = minutes % 1 * 60 exact_date = datetime.fromordinal(int(datenum)) \ + timedelta(days=int(days)) + timedelta(hours=int(hours)) \ + timedelta(minutes=int(minutes)) + timedelta(seconds=round(seconds)) \ - timedelta(days=366) return exact_date.year df['date_of_birth'] = df['dob'].apply(datenum_to_datetime)
Extracting date of birth from matlab datenum format
Now, we have both date of birth and photo taken time. Subtracting these values will give us the ages.
df['age'] = df['photo_taken'] - df['date_of_birth']
Some pictures don’t include people in the wiki data set. For example, a vase picture exists in the data set. Moreover, some pictures might include two person. Furthermore, some are taken distant. Face score value can help us to understand the picture is clear or not. Also, age information is missing for some records. They all might confuse the model. We should ignore them. Finally, unnecessary columns should be dropped to occupy less memory.
#remove pictures does not include face df = df[df['face_score'] != -np.inf] #some pictures include more than one face, remove them df = df[df['second_face_score'].isna()] #check threshold df = df[df['face_score'] >= 3] #some records do not have a gender information df = df[~df['gender'].isna()] df = df.drop(columns = ['name','face_score','second_face_score','date_of_birth','face_location'])
Some pictures are taken for unborn people. Age value seems to be negative for some records. Dirty data might cause this. Moreover, some seems to be alive for more than 100. We should restrict the age prediction problem for 0 to 100 years.
#some guys seem to be greater than 100. some of these are paintings. remove these old guys df = df[df['age'] <= 100] #some guys seem to be unborn in the data set df = df[df['age'] > 0]
The raw data set will be look like the following data frame.
We can visualize the target label distribution.
histogram_age = df['age'].hist(bins=df['age'].nunique()) histogram_gender = df['gender'].hist(bins=df['gender'].nunique())
Full path column states the exact location of the picture on the disk. We need its pixel values.
target_size = (224, 224) def getImagePixels(image_path): img = image.load_img("wiki_crop/%s" % image_path[0], grayscale=False, target_size=target_size) x = image.img_to_array(img).reshape(1, -1)[0] #x = preprocess_input(x) return x df['pixels'] = df['full_path'].apply(getImagePixels)
We can extract the real pixel values of pictures
Age prediction is a regression problem. But researchers define it as a classification problem. There are 101 classes in the output layer for ages 0 to 100. they applied transfer learning for this duty. Their choice was VGG for imagenet.
Pandas data frame includes both input and output information for age and gender prediction tasks. Wee should just focus on the age task.
classes = 101 #0 to 100 target = df['age'].values target_classes = keras.utils.to_categorical(target, classes) features = [] for i in range(0, df.shape[0]): features.append(df['pixels'].values[i]) features = np.array(features) features = features.reshape(features.shape[0], 224, 224, 3)
Also, we need to split data set as training and testing set.
from sklearn.model_selection import train_test_split train_x, test_x, train_y, test_y = train_test_split(features, target_classes, test_size=0.30)
The final data set consists of 22578 instances. It is splitted into 15905 train instances and 6673 test instances .
As mentioned, researcher used VGG imagenet model. Still, they tuned weights for this data set. Herein, I prefer to use VGG-Face model. Because, this model is tuned for face recognition task. In this way, we might have outcomes for patterns in the human face.
#VGG-Face model model = Sequential() model.add(ZeroPadding2D((1,1),input_shape=(224,224, 3))) model.add(Convolution2D(64, (3, 3), activation='relu')) model.add(ZeroPadding2D((1,1))) model.add(Convolution2D(64, (3, 3), activation='relu')) model.add(MaxPooling2D((2,2), strides=(2,2))) model.add(ZeroPadding2D((1,1))) model.add(Convolution2D(128, (3, 3), activation='relu')) model.add(ZeroPadding2D((1,1))) model.add(Convolution2D(128, (3, 3), activation='relu')) model.add(MaxPooling2D((2,2), strides=(2,2))) model.add(ZeroPadding2D((1,1))) model.add(Convolution2D(256, (3, 3), activation='relu')) model.add(ZeroPadding2D((1,1))) model.add(Convolution2D(256, (3, 3), activation='relu')) model.add(ZeroPadding2D((1,1))) model.add(Convolution2D(256, (3, 3), activation='relu')) model.add(MaxPooling2D((2,2), strides=(2,2))) model.add(ZeroPadding2D((1,1))) model.add(Convolution2D(512, (3, 3), activation='relu')) model.add(ZeroPadding2D((1,1))) model.add(Convolution2D(512, (3, 3), activation='relu')) model.add(ZeroPadding2D((1,1))) model.add(Convolution2D(512, (3, 3), activation='relu')) model.add(MaxPooling2D((2,2), strides=(2,2))) model.add(ZeroPadding2D((1,1))) model.add(Convolution2D(512, (3, 3), activation='relu')) model.add(ZeroPadding2D((1,1))) model.add(Convolution2D(512, (3, 3), activation='relu')) model.add(ZeroPadding2D((1,1))) model.add(Convolution2D(512, (3, 3), activation='relu')) model.add(MaxPooling2D((2,2), strides=(2,2))) model.add(Convolution2D(4096, (7, 7), activation='relu')) model.add(Dropout(0.5)) model.add(Convolution2D(4096, (1, 1), activation='relu')) model.add(Dropout(0.5)) model.add(Convolution2D(2622, (1, 1))) model.add(Flatten()) model.add(Activation('softmax'))
Load the pre-trained weights for VGG-Face model. You can find the related blog post here.
#pre-trained weights of vgg-face model. #you can find it here: https://drive.google.com/file/d/1CPSeum3HpopfomUEK1gybeuIVoeJT_Eo/view?usp=sharing #related blog post: https://sefiks.com/2018/08/06/deep-face-recognition-with-keras/ model.load_weights('vgg_face_weights.h5')
We should lock the layer weights for early layers because they could already detect some patterns. Fitting the network from scratch might cause to lose this important information. I prefer to freeze all layers except last 3 convolution layers (make exception for last 7 model.add units). Also, I cut the last convolution layer because it has 2622 units. I need just 101 (ages from 0 to 100) units for age prediction task. Then, add a custom convolution layer consisting of 101 units.
for layer in model.layers[:-7]: layer.trainable = False base_model_output = Sequential() base_model_output = Convolution2D(101, (1, 1), name='predictions')(model.layers[-4].output) base_model_output = Flatten()(base_model_output) base_model_output = Activation('softmax')(base_model_output) age_model = Model(inputs=model.input, outputs=base_model_output)
This is a multi-class classification problem. Loss function must be categorical crossentropy. Optimization algorithm will be Adam to converge loss faster. I create a checkpoint to monitor model over iterations and avoid overfitting. The iteration which has the minimum validation loss value will include the optimum weights. That’s why, I’ll monitor validation loss and save the best one only.
To avoid overfitting, I feed random 256 instances for each epoch.
age_model.compile(loss='categorical_crossentropy', optimizer=keras.optimizers.Adam(), metrics=['accuracy']) checkpointer = ModelCheckpoint(filepath='age_model.hdf5' , monitor = "val_loss", verbose=1, save_best_only=True, mode = 'auto') scores = [] epochs = 250; batch_size = 256 for i in range(epochs): print("epoch ",i) ix_train = np.random.choice(train_x.shape[0], size=batch_size) score = age_model.fit(train_x[ix_train], train_y[ix_train] , epochs=1, validation_data=(test_x, test_y), callbacks=[checkpointer]) scores.append(score)
It seems that validation loss reach the minimum. Increasing epochs will cause to overfitting.
We can evaluate the final model on the test set.
age_model.evaluate(test_x, test_y, verbose=1)
This gives both validation loss and accuracy respectively for 6673 test instances. It seems that we have the following results.
[2.871919590848929, 0.24298789490543357]
24% accuracy seems very low, right? Actually, it is not. Herein, researchers develop an age prediction approach and convert classification task to regression. They propose that you should multiply each softmax out with its label. Summing this multiplications will be the apparent age prediction.
This is a very easy operation in Python numpy.
predictions = age_model.predict(test_x) output_indexes = np.array([i for i in range(0, 101)]) apparent_predictions = np.sum(predictions * output_indexes, axis = 1)
Herein, mean absolute error metric might be more meaningful to evaluate the system.
mae = 0 for i in range(0 ,apparent_predictions.shape[0]): prediction = int(apparent_predictions[i]) actual = np.argmax(test_y[i]) abs_error = abs(prediction - actual) actual_mean = actual_mean + actual mae = mae + abs_error mae = mae / apparent_predictions.shape[0] print("mae: ",mae) print("instances: ",apparent_predictions.shape[0])
Our apparent age prediction model averagely predict ages ± 4.65 error. This is acceptable.
We can feel the power of the model when we feed custom images into it.
from keras.preprocessing import image from keras.preprocessing.image import ImageDataGenerator def loadImage(filepath): test_img = image.load_img(filepath, target_size=(224, 224)) test_img = image.img_to_array(test_img) test_img = np.expand_dims(test_img, axis = 0) test_img /= 255 return test_img picture = "marlon-brando.jpg" prediction = age_model.predict(loadImage(picture))
Prediction variable stores distribution for each age class. Monitoring it might be intersting.
y_pos = np.arange(101) plt.bar(y_pos, prediction[0], align='center', alpha=0.3) plt.ylabel('percentage') plt.title('age') plt.show()
This is the age prediction distribution of Marlon Brando in Godfather. The most dominant age class is 44 whereas weighted age is 48 which is the exact age of him in 1972.
We’ll calculate apparent age from these age distributions
img = image.load_img(picture) plt.imshow(img) plt.show() print("most dominant age class (not apparent age): ",np.argmax(prediction)) apparent_age = np.round(np.sum(prediction * output_indexes, axis = 1)) print("apparent age: ", int(apparent_age[0]))
Results are very satisfactory even though it does not have a good perspective. Marlon Brando was 48 and Al Pacino was 32 in Godfather Part I.
Apparent age prediction was a challenging problem. However, gender prediction is much more predictable.
We’ll apply binary encoding to target gender class.
target = df['gender'].values target_classes = keras.utils.to_categorical(target, 2)
We then just need to put 2 classes in the output layer for man and woman.
for layer in model.layers[:-7]: layer.trainable = False base_model_output = Sequential() base_model_output = Convolution2D(2, (1, 1), name='predictions')(model.layers[-4].output) base_model_output = Flatten()(base_model_output) base_model_output = Activation('softmax')(base_model_output) gender_model = Model(inputs=model.input, outputs=base_model_output)
Now, the model is ready to fit.
scores = [] epochs = 250; batch_size = 256 for i in range(epochs): print("epoch ",i) ix_train = np.random.choice(train_x.shape[0], size=batch_size) score = gender_model.fit(train_x[ix_train], train_y[ix_train] , epochs=1, validation_data=(test_x, test_y), callbacks=[checkpointer]) scores.append(score)
It seems that the model is saturated. Terminating training will be clever.
gender_model.evaluate(test_x, test_y, verbose=1)
The model has the following validation loss and accuracy. It is really satisfactory.
[0.07324957040103375, 0.9744245524655362]
This is a real classification problem instead of age prediction. The accuracy should not be the only metric we need to monitor. Precision and recall should also be checked.
from sklearn.metrics import classification_report, confusion_matrix predictions = gender_model.predict(test_x) pred_list = []; actual_list = [] for i in predictions: pred_list.append(np.argmax(i)) for i in test_y: actual_list.append(np.argmax(i)) confusion_matrix(actual_list, pred_list)
The model generates the following confusion matrix.
Prediction | |||
Female | Male | ||
Actual | Female | 1873 | 98 |
Male | 72 | 4604 |
This means that we have 96.29% precision, 95.05% recall. These metrics are as satisfactory as the accuracy.
We just need to feed images to the model.
picture = "katy-perry.jpg" prediction = gender_model.predict(loadImage(picture)) img = image.load_img(picture)#, target_size=(224, 224)) plt.imshow(img) plt.show() gender = "Male" if np.argmax(prediction) == 1 else "Female" print("gender: ", gender)
So, we’ve built an apparent age and gender predictors from scratch based on the research article of computer vision group of ETH Zurich. In particular, the way they proposed to calculate apparent age is an over-performing novel method. Deep learning really has a limitless power for learning.
I pushed the source code for both apparent age prediction and gender prediction to GitHub. You might want to just use pre-trained weights. Similarly, I put pre-trained weights for age and gender tasks to Google Drive.
The post Apparent Age and Gender Prediction in Keras appeared first on Sefik Ilkin Serengil.
]]>The post Twisted Edwards Curves appeared first on Sefik Ilkin Serengil.
]]>Twisted Edward curves look like a bird’s-eye roundabout intersection of a road.
Regular Edwards curves are special form of twisted Edwards curves where a = 1. We can prove the addition formula of twisted ones similarly. Besides, proof of twisted Edwards curves will also prove the regular Edwards forms Bernstein and Tanja simplified.
Suppose that (x_{1}, y_{1}) and (x_{2}, y_{2}) are points on the curve ax^{2} + y^{2} = 1 + dx^{2}y^{2}. In this case, (x_{3}, y_{3}) derived from the following formula will be on the same curve.
x_{3} = (x_{1}y_{2} + y_{1}x_{2})/(1 + dx_{1}x_{2}y_{1}y_{2})
y_{3} = (y_{1}y_{2} – ax_{1}x_{2})/(1 – dx_{1}x_{2}y_{1}y_{2})
We can validate the addition formula by putting (x_{3}, y_{3}) values to the twisted edwards curve equation.
ax_{3}^{2} + y_{3}^{2} = 1 + dx_{3}^{2}y_{3}^{2}
a(x_{1}y_{2} + y_{1}x_{2})^{2}/(1 + dx_{1}x_{2}y_{1}y_{2})^{2} + (y_{1}y_{2} – ax_{1}x_{2})^{2}/(1 – dx_{1}x_{2}y_{1}y_{2})^{2} = 1 + d(x_{1}y_{2} + y_{1}x_{2})^{2}(y_{1}y_{2} – ax_{1}x_{2})^{2}/(1 + dx_{1}x_{2}y_{1}y_{2})^{2}(1 – dx_{1}x_{2}y_{1}y_{2})^{2}
Make denominators same
a(x_{1}y_{2} + y_{1}x_{2})^{2}(1 – dx_{1}x_{2}y_{1}y_{2})^{2}/(1 + dx_{1}x_{2}y_{1}y_{2})^{2}(1 – dx_{1}x_{2}y_{1}y_{2})^{2} + (y_{1}y_{2} – ax_{1}x_{2})^{2}(1 + dx_{1}x_{2}y_{1}y_{2})^{2}/(1 – dx_{1}x_{2}y_{1}y_{2})^{2}(1 + dx_{1}x_{2}y_{1}y_{2})^{2} = (1 – dx_{1}x_{2}y_{1}y_{2})^{2}(1 + dx_{1}x_{2}y_{1}y_{2})^{2}/(1 – dx_{1}x_{2}y_{1}y_{2})^{2}(1 + dx_{1}x_{2}y_{1}y_{2})^{2} + d(x_{1}y_{2} + y_{1}x_{2})^{2}(y_{1}y_{2} – ax_{1}x_{2})^{2}/(1 + dx_{1}x_{2}y_{1}y_{2})^{2}(1 – dx_{1}x_{2}y_{1}y_{2})^{2}
Now, all denominators are same, we can simplify them. Note that (1 + dx_{1}x_{2}y_{1}y_{2})^{2}(1 – dx_{1}x_{2}y_{1}y_{2})^{2} cannot be 0.
a(x_{1}y_{2} + y_{1}x_{2})^{2}(1 – dx_{1}x_{2}y_{1}y_{2})^{2} + (y_{1}y_{2} – ax_{1}x_{2})^{2}(1 + dx_{1}x_{2}y_{1}y_{2})^{2} = (1 – dx_{1}x_{2}y_{1}y_{2})^{2}(1 + dx_{1}x_{2}y_{1}y_{2})^{2} + d(x_{1}y_{2} + y_{1}x_{2})^{2}(y_{1}y_{2} – ax_{1}x_{2})^{2}
Set P to x_{1}x_{2}y_{1}y_{2} to express this complex equation simpler.
a(x_{1}y_{2} + y_{1}x_{2})^{2}(1 – dP)^{2} + (y_{1}y_{2} – ax_{1}x_{2})^{2}(1 + dP)^{2} = (1 – dP)^{2}(1 + dP)^{2} + d(x_{1}y_{2} + y_{1}x_{2})^{2}(y_{1}y_{2} – ax_{1}x_{2})^{2}
The term (1 – dP)^{2}(1 + dP)^{2} can also be written as [(1 – dP)(1 + dP)]^{2} = (1 – d^{2}P^{2})^{2}
a(x_{1}y_{2} + y_{1}x_{2})^{2}(1 – dP)^{2} + (y_{1}y_{2} – ax_{1}x_{2})^{2}(1 + dP)^{2} = (1 – d^{2}P^{2})^{2} + d(x_{1}y_{2} + y_{1}x_{2})^{2}(y_{1}y_{2} – ax_{1}x_{2})^{2}
Evaluate the powers except (1 – d^{2}P^{2})^{2}
a(x_{1}^{2}y_{2}^{2} + y_{1}^{2}x_{2}^{2} + 2P)(1 + d^{2}P^{2} – 2dP) + (y_{1}^{2}y_{2}^{2} + a^{2}x_{1}^{2}x_{2}^{2} – 2aP)(1 + d^{2}P^{2} + 2dP) = (1 – d^{2}P^{2})^{2} + d(x_{1}^{2}y_{2}^{2} + y_{1}^{2}x_{2}^{2} + 2P)(y_{1}^{2}y_{2}^{2} + a^{2}x_{1}^{2}x_{2}^{2} – 2aP)
Focus on the left side. We can separate 1 + d^{2}P^{2} and 2dP multipliers.
(1 + d^{2}P^{2})(a(x_{1}^{2}y_{2}^{2} + y_{1}^{2}x_{2}^{2} + 2P) + y_{1}^{2}y_{2}^{2} + a^{2}x_{1}^{2}x_{2}^{2} – 2aP) + (2dP)(a(- x_{1}^{2}y_{2}^{2} – y_{1}^{2}x_{2}^{2} – 2P) + y_{1}^{2}y_{2}^{2} + a^{2}x_{1}^{2}x_{2}^{2} – 2aP)
Move a multiplier into the paranthesis
(1 + d^{2}P^{2})(ax_{1}^{2}y_{2}^{2} + ay_{1}^{2}x_{2}^{2} + 2aP + y_{1}^{2}y_{2}^{2} + a^{2}x_{1}^{2}x_{2}^{2} – 2aP) + (2dP)(- ax_{1}^{2}y_{2}^{2} – ay_{1}^{2}x_{2}^{2} – 2aP + y_{1}^{2}y_{2}^{2} + a^{2}x_{1}^{2}x_{2}^{2} – 2aP)
Plus and minus 2aP terms exist in the first paranthesis. We can remove them.
(1 + d^{2}P^{2})(ax_{1}^{2}y_{2}^{2} + ay_{1}^{2}x_{2}^{2} + y_{1}^{2}y_{2}^{2} + a^{2}x_{1}^{2}x_{2}^{2}) + (2dP)(- ax_{1}^{2}y_{2}^{2} – ay_{1}^{2}x_{2}^{2} – 2aP + y_{1}^{2}y_{2}^{2} + a^{2}x_{1}^{2}x_{2}^{2} – 2aP)
We can rewrite the term (ax_{1}^{2}y_{2}^{2} + ay_{1}^{2}x_{2}^{2} + y_{1}^{2}y_{2}^{2} + a^{2}x_{1}^{2}x_{2}^{2}) as (ax_{1}^{2} + y_{1}^{2})(ax_{2}^{2} + y_{2}^{2}). Similarly, the term (- ax_{1}^{2}y_{2}^{2} – ay_{1}^{2}x_{2}^{2} – 2aP + y_{1}^{2}y_{2}^{2} + a^{2}x_{1}^{2}x_{2}^{2} – 2aP) can be rewritten as ((ax_{1}^{2} – y_{1}^{2})(ax_{2}^{2} – y_{2}^{2}) – 4aP).
(1 + d^{2}P^{2})(ax_{1}^{2} + y_{1}^{2})(ax_{2}^{2} + y_{2}^{2}) + (2dP)[(ax_{1}^{2} – y_{1}^{2})(ax_{2}^{2} – y_{2}^{2}) – 4aP].
Move 2dP into the paranthesis
(1 + d^{2}P^{2})(ax_{1}^{2} + y_{1}^{2})(ax_{2}^{2} + y_{2}^{2}) + (2dP)(ax_{1}^{2} – y_{1}^{2})(ax_{2}^{2} – y_{2}^{2}) – 8adP^{2}
Now, focus on the right side.
(1 – d^{2}P^{2})^{2} + d(x_{1}^{2}y_{2}^{2} + y_{1}^{2}x_{2}^{2} + 2P)(y_{1}^{2}y_{2}^{2} + a^{2}x_{1}^{2}x_{2}^{2} – 2aP)
Move 2P term to the outside of the paranthesis
(1 – d^{2}P^{2})^{2} + d[(x_{1}^{2}y_{2}^{2} + y_{1}^{2}x_{2}^{2})(y_{1}^{2}y_{2}^{2} + a^{2}x_{1}^{2}x_{2}^{2}) + 2P(y_{1}^{2}y_{2}^{2} + a^{2}x_{1}^{2}x_{2}^{2} – ax_{1}^{2}y_{2}^{2} + ay_{1}^{2}x_{2}^{2}) – 4aP^{2}]
The term (y_{1}^{2}y_{2}^{2} + a^{2}x_{1}^{2}x_{2}^{2} – ax_{1}^{2}y_{2}^{2} + ay_{1}^{2}x_{2}^{2}) can be rewritten as (ax_{1}^{2} – y_{1}^{2})(ax_{2}^{2} – y_{2}^{2})
(1 – d^{2}P^{2})^{2} + d[(x_{1}^{2}y_{2}^{2} + y_{1}^{2}x_{2}^{2})(y_{1}^{2}y_{2}^{2} + a^{2}x_{1}^{2}x_{2}^{2}) + 2P(ax_{1}^{2} – y_{1}^{2})(ax_{2}^{2} – y_{2}^{2})- 4aP^{2}]
Also, (x_{1}^{2}y_{2}^{2} + y_{1}^{2}x_{2}^{2})(y_{1}^{2}y_{2}^{2} + a^{2}x_{1}^{2}x_{2}^{2}) can be rewritten as (x_{1}^{2}y_{1}^{2}y_{2}^{4} + a^{2}x_{1}^{4}x_{2}^{2}y_{2}^{2} + x_{2}^{2}y_{1}^{4}y_{2}^{2} + a^{2}x_{1}^{2}x_{2}^{4}y_{1}^{2})
(1 – d^{2}P^{2})^{2} + d[(x_{1}^{2}y_{1}^{2}y_{2}^{4} + a^{2}x_{1}^{4}x_{2}^{2}y_{2}^{2} + x_{2}^{2}y_{1}^{4}y_{2}^{2} + a^{2}x_{1}^{2}x_{2}^{4}y_{1}^{2}) + 2P(ax_{1}^{2} – y_{1}^{2})(ax_{2}^{2} – y_{2}^{2})- 4aP^{2}]
The left side of the equation has (1 + d^{2}P^{2}). We can refactor the term (1 – d^{2}P^{2})^{2} on the right side.
(1 – d^{2}P^{2})^{2} = (1 + d^{2}P^{2})^{2} – 4d^{2}P^{2} = (1 + d^{2}P^{2})(1 + d^{2}P^{2}) – 4d^{2}P^{2}
We got the term (1 + d^{2}P^{2}). Now, replace P value with original one in the multiplier.
(1 + d^{2}P^{2})(1 + d^{2}x_{1}^{2}x_{2}^{2}y_{1}^{2}y_{2}^{2}) – 4d^{2}P^{2}
Adding and substracting dx_{1}^{2}y_{1}^{2} and dx_{2}^{2}y_{2}^{2} values would not change the content.
(1 + d^{2}P^{2})(1 + d^{2}x_{1}^{2}x_{2}^{2}y_{1}^{2}y_{2}^{2} + dx_{1}^{2}y_{1}^{2} + dx_{2}^{2}y_{2}^{2} – dx_{1}^{2}y_{1}^{2} – dx_{2}^{2}y_{2}^{2}) – 4d^{2}P^{2}
Separate plus and minus sign terms in same side.
(1 + d^{2}P^{2})(1 + d^{2}x_{1}^{2}x_{2}^{2}y_{1}^{2}y_{2}^{2} + dx_{1}^{2}y_{1}^{2} + dx_{2}^{2}y_{2}^{2}) +(1 + d^{2}P^{2})( – dx_{1}^{2}y_{1}^{2} – dx_{2}^{2}y_{2}^{2}) – 4d^{2}P^{2}
We can rewrite the term (1 + d^{2}x_{1}^{2}x_{2}^{2}y_{1}^{2}y_{2}^{2} + dx_{1}^{2}y_{1}^{2} + dx_{2}^{2}y_{2}^{2}) as (1 + dx_{1}^{2}y_{1}^{2})(1 + dx_{2}^{2}y_{2}^{2})
(1 + d^{2}P^{2})(1 + dx_{1}^{2}y_{1}^{2})(1 + dx_{2}^{2}y_{2}^{2}) + (1 + d^{2}P^{2})( – dx_{1}^{2}y_{1}^{2} – dx_{2}^{2}y_{2}^{2}) – 4d^{2}P^{2}
Rewrite (1 + d^{2}P^{2})( – dx_{1}^{2}y_{1}^{2} – dx_{2}^{2}y_{2}^{2})
(1 + d^{2}P^{2})(1 + dx_{1}^{2}y_{1}^{2})(1 + dx_{2}^{2}y_{2}^{2}) – dx_{1}^{2}y_{1}^{2} – dx_{2}^{2}y_{2}^{2} – d^{3}x_{2}^{2}y_{2}^{2}(x_{1}^{4}y_{1}^{4}) – d^{3}x_{1}^{2}y_{1}^{2}(x_{2}^{4}y_{2}^{4}) – 2d^{2}x_{1}^{2}y_{1}^{2}x_{2}^{2}y_{2}^{2} – 2d^{2}x_{1}^{2}y_{1}^{2}x_{2}^{2}y_{2}^{2}
Combine the parts that containing dx_{1}^{2}y_{1}^{2} and dx_{2}^{2}y_{2}^{2}
(1 + d^{2}P^{2})(1 + dx_{1}^{2}y_{1}^{2})(1 + dx_{2}^{2}y_{2}^{2}) – dx_{1}^{2}y_{1}^{2}(1 + d^{2}x_{2}^{4}y_{2}^{4} + 2dx_{2}^{2}y_{2}^{2}) – dx_{2}^{2}y_{2}^{2}(1 + d^{2}x_{1}^{4}y_{1}^{4} + 2dx_{1}^{2}y_{1}^{2})
The second and third terms can be expressed as power of an addition.
(1 + d^{2}P^{2})(1 + dx_{1}^{2}y_{1}^{2})(1 + dx_{2}^{2}y_{2}^{2}) – dx_{1}^{2}y_{1}^{2}(1 + dx_{2}^{2}y_{2}^{2})^{2} – dx_{2}^{2}y_{2}^{2}(1 + dx_{1}^{2}y_{1}^{2})^{2}
Combine left and right sides
(1 + d^{2}P^{2})(ax_{1}^{2} + y_{1}^{2})(ax_{2}^{2} + y_{2}^{2}) + (2dP)(ax_{1}^{2} – y_{1}^{2})(ax_{2}^{2} – y_{2}^{2}) – 8adP^{2} = d[(x_{1}^{2}y_{1}^{2}y_{2}^{4} + a^{2}x_{1}^{4}x_{2}^{2}y_{2}^{2} + x_{2}^{2}y_{1}^{4}y_{2}^{2} + a^{2}x_{1}^{2}x_{2}^{4}y_{1}^{2}) + 2P(ax_{1}^{2} – y_{1}^{2})(ax_{2}^{2} – y_{2}^{2})- 4aP^{2}] + (1 + d^{2}P^{2})(1 + dx_{1}^{2}y_{1}^{2})(1 + dx_{2}^{2}y_{2}^{2}) – dx_{1}^{2}y_{1}^{2}(1 + dx_{2}^{2}y_{2}^{2})^{2} – dx_{2}^{2}y_{2}^{2}(1 + dx_{1}^{2}y_{1}^{2})^{2}
Move d multiplier into the paranthesis
(1 + d^{2}P^{2})(ax_{1}^{2} + y_{1}^{2})(ax_{2}^{2} + y_{2}^{2}) + (2dP)(ax_{1}^{2} – y_{1}^{2})(ax_{2}^{2} – y_{2}^{2}) – 8adP^{2} = dx_{1}^{2}y_{1}^{2}y_{2}^{4} + da^{2}x_{1}^{4}x_{2}^{2}y_{2}^{2} + dx_{2}^{2}y_{1}^{4}y_{2}^{2} + da^{2}x_{1}^{2}x_{2}^{4}y_{1}^{2} + 2dP(ax_{1}^{2} – y_{1}^{2})(ax_{2}^{2} – y_{2}^{2})- 4adP^{2} + (1 + d^{2}P^{2})(1 + dx_{1}^{2}y_{1}^{2})(1 + dx_{2}^{2}y_{2}^{2}) – dx_{1}^{2}y_{1}^{2}(1 + dx_{2}^{2}y_{2}^{2})^{2} – dx_{2}^{2}y_{2}^{2}(1 + dx_{1}^{2}y_{1}^{2})^{2}
The both left and right side have (2dP)(ax_{1}^{2} – y_{1}^{2})(ax_{2}^{2} – y_{2}^{2}). We can remove these terms.
(1 + d^{2}P^{2})(ax_{1}^{2} + y_{1}^{2})(ax_{2}^{2} + y_{2}^{2}) – 8adP^{2} = dx_{1}^{2}y_{1}^{2}y_{2}^{4} + da^{2}x_{1}^{4}x_{2}^{2}y_{2}^{2} + dx_{2}^{2}y_{1}^{4}y_{2}^{2} + da^{2}x_{1}^{2}x_{2}^{4}y_{1}^{2} – 4adP^{2} + (1 + d^{2}P^{2})(1 + dx_{1}^{2}y_{1}^{2})(1 + dx_{2}^{2}y_{2}^{2}) – dx_{1}^{2}y_{1}^{2}(1 + dx_{2}^{2}y_{2}^{2})^{2} – dx_{2}^{2}y_{2}^{2}(1 + dx_{1}^{2}y_{1}^{2})^{2}
Move – 8adP^{2 }to the right side.
(1 + d^{2}P^{2})(ax_{1}^{2} + y_{1}^{2})(ax_{2}^{2} + y_{2}^{2}) = dx_{1}^{2}y_{1}^{2}y_{2}^{4} + da^{2}x_{1}^{4}x_{2}^{2}y_{2}^{2} + dx_{2}^{2}y_{1}^{4}y_{2}^{2} + da^{2}x_{1}^{2}x_{2}^{4}y_{1}^{2} + 4adP^{2} + (1 + d^{2}P^{2})(1 + dx_{1}^{2}y_{1}^{2})(1 + dx_{2}^{2}y_{2}^{2}) – dx_{1}^{2}y_{1}^{2}(1 + dx_{2}^{2}y_{2}^{2})^{2} – dx_{2}^{2}y_{2}^{2}(1 + dx_{1}^{2}y_{1}^{2})^{2}
Put the real value of P in 4adP^{2}
(1 + d^{2}P^{2})(ax_{1}^{2} + y_{1}^{2})(ax_{2}^{2} + y_{2}^{2}) = dx_{1}^{2}y_{1}^{2}y_{2}^{4} + da^{2}x_{1}^{4}x_{2}^{2}y_{2}^{2} + dx_{2}^{2}y_{1}^{4}y_{2}^{2} + da^{2}x_{1}^{2}x_{2}^{4}y_{1}^{2} + 2adx_{1}^{2}x_{2}^{2}y_{1}^{2}y_{2}^{2} + 2adx_{1}^{2}x_{2}^{2}y_{1}^{2}y_{2}^{2 }+ (1 + d^{2}P^{2})(1 + dx_{1}^{2}y_{1}^{2})(1 + dx_{2}^{2}y_{2}^{2}) – dx_{1}^{2}y_{1}^{2}(1 + dx_{2}^{2}y_{2}^{2})^{2} – dx_{2}^{2}y_{2}^{2}(1 + dx_{1}^{2}y_{1}^{2})^{2}
Combine terms that containing x_{1}^{2}y_{1}^{2} and x_{2}^{2}y_{2}^{2}
(1 + d^{2}P^{2})(ax_{1}^{2} + y_{1}^{2})(ax_{2}^{2} + y_{2}^{2}) = dx_{1}^{2}y_{1}^{2}(y_{2}^{4} + ax_{2}^{4} + 2ax_{2}^{2}y_{2}^{2}) + dx_{2}^{2}y_{2}^{2}(y_{1}^{4} + ax_{1}^{4} + 2ax_{1}^{2}y_{1}^{2}) + (1 + d^{2}P^{2})(1 + dx_{1}^{2}y_{1}^{2})(1 + dx_{2}^{2}y_{2}^{2}) – dx_{1}^{2}y_{1}^{2}(1 + dx_{2}^{2}y_{2}^{2})^{2} – dx_{2}^{2}y_{2}^{2}(1 + dx_{1}^{2}y_{1}^{2})^{2}
We can rewrite the term (y_{2}^{4} + ax_{2}^{4} + 2ax_{2}^{2}y_{2}^{2}) as (ax_{2}^{2} + y_{2}^{2})^{2} and (y_{1}^{4} + ax_{1}^{4} + 2ax_{1}^{2}y_{1}^{2}) as (ax_{1}^{2} + y_{1}^{2})^{2}
(1 + d^{2}P^{2})(ax_{1}^{2} + y_{1}^{2})(ax_{2}^{2} + y_{2}^{2}) = dx_{1}^{2}y_{1}^{2}(ax_{2}^{2} + y_{2}^{2})^{2} + dx_{2}^{2}y_{2}^{2}(ax_{1}^{2} + y_{1}^{2})^{2}+ (1 + d^{2}P^{2})(1 + dx_{1}^{2}y_{1}^{2})(1 + dx_{2}^{2}y_{2}^{2}) – dx_{1}^{2}y_{1}^{2}(1 + dx_{2}^{2}y_{2}^{2})^{2} – dx_{2}^{2}y_{2}^{2}(1 + dx_{1}^{2}y_{1}^{2})^{2}
Combine the terms that containing (1 + d^{2}P^{2})
(1 + d^{2}P^{2})[(ax_{1}^{2} + y_{1}^{2})(ax_{2}^{2} + y_{2}^{2}) – (1 + dx_{1}^{2}y_{1}^{2})(1 + dx_{2}^{2}y_{2}^{2}) ] = dx_{1}^{2}y_{1}^{2}(ax_{2}^{2} + y_{2}^{2})^{2} – dx_{1}^{2}y_{1}^{2}(1 + dx_{2}^{2}y_{2}^{2})^{2} + dx_{2}^{2}y_{2}^{2}(ax_{1}^{2} + y_{1}^{2})^{2} – dx_{2}^{2}y_{2}^{2}(1 + dx_{1}^{2}y_{1}^{2})^{2}
Still, we can combine terms containing dx_{1}^{2}y_{1}^{2} and dx_{2}^{2}y_{2}^{2}
(1 + d^{2}P^{2})[(ax_{1}^{2} + y_{1}^{2})(ax_{2}^{2} + y_{2}^{2}) – (1 + dx_{1}^{2}y_{1}^{2})(1 + dx_{2}^{2}y_{2}^{2}) ] = dx_{1}^{2}y_{1}^{2}[(ax_{2}^{2} + y_{2}^{2})^{2} – (1 + dx_{2}^{2}y_{2}^{2})^{2}]+ dx_{2}^{2}y_{2}^{2}[(ax_{1}^{2} + y_{1}^{2})^{2} – (1 + dx_{1}^{2}y_{1}^{2})^{2}]
All terms in brackets must be equal to zero based on the twisted Edwards curve equation. This proves the theorem as claimed.
We have already proven the addition formula for regular Edward curves Bernstein and Tanja introduced by setting variable a to 1.
x^{2} + y^{2} = 1 + dx^{2}y^{2}
x_{3} = (x_{1}y_{2} + y_{1}x_{2})/(1 + dx_{1}x_{2}y_{1}y_{2})
y_{3} = (y_{1}y_{2} – ax_{1}x_{2})/(1 – dx_{1}x_{2}y_{1}y_{2}) = (y_{1}y_{2} – x_{1}x_{2})/(1 – dx_{1}x_{2}y_{1}y_{2})
So, we haven proven addition formula for both twisted and regular Edwards curves. In particular, twisted ones are backbones of edwards curve based digital signatures. This signatures offer either high speed and high security. Every security specialist must put Edwards curves in their toolbox.
The post Twisted Edwards Curves appeared first on Sefik Ilkin Serengil.
]]>The post A Gentle Introduction to Edwards-curve Digital Signature Algorithm (EdDSA) appeared first on Sefik Ilkin Serengil.
]]>The original paper recommends to use twisted Edwards curve. This curve looks like a bird’s-eye roundabout intersection of a road.
ax^{2} + y^{2} = 1 + dx^{2}y^{2}
This has a similar addition formula to regular Edwards curves. I mention the difference boldly. It exists in the numerator of the y coordinate of the new point.
(x_{1}, y_{1}) + (x_{2}, y_{2}) = (x_{3}, y_{3})
x_{3} = (x_{1}y_{2} + y_{1}x_{2})/(1 + dx_{1}x_{2}y_{1}y_{2})
y_{3} = (y_{1}y_{2} + ax_{1}x_{2})/(1 – dx_{1}x_{2}y_{1}y_{2})
Ed25519 is a special form of this curve where a = -1, d = -121665/121666. It handles over prime fields where p = 2^{255} – 19. The final form of the ed25519 is illustrated below.
-x^{2} + y^{2} = 1 – (121665/121666) dx^{2}y^{2} (mod 2^{255} – 19)
The base point of the curve is y = (u-1)/(u+1) where u = 9. The integer equivalent is demonstrated below.
p = pow(2, 255) - 19 base = 15112221349535400772501151409588531511454012693041857206046113283949847762202 , 46316835694926478169428394003475163141307993866256225615783033603165251855960
Moreover, the variable d is a double value. We can convert it to integer by moving its denominator to numerator by switching its multiplicative inverse value.
#ax^2 + y^2 = 1 + dx^2y^2 a = -1; d = findPositiveModulus(-121665 * findModInverse(121666, p), p) #ed25519
Regular elliptic curves in weierstrass form have different formulas for addition and doubling. In contrast, the both operation will be handled by same addition formula in edwards curves. Working on finite fields requires to move denominators into the numerator by replacing its multiplicative inverse.
def pointAddition(P, Q, a, d, mod): x1 = P[0]; y1 = P[1]; x2 = Q[0]; y2 = Q[1] x3 = (((x1*y2 + y1*x2) % mod) * findModInverse(1+d*x1*x2*y1*y2, mod)) % mod y3 = (((y1*y2 - a*x1*x2) % mod) * findModInverse(1- d*x1*x2*y1*y2, mod)) % mod return x3, y3
Alice needs to generate a 32-byte private key. Then, she need to calculate private key times base point. This would be her public key.
import random privateKey = random.getrandbits(256) #32 byte secret key publicKey = applyDoubleAndAddMethod(base, privateKey, a, d, p)
She can use double-and-add method to find her public key fast.
def applyDoubleAndAddMethod(P, k, a, d, mod): additionPoint = (P[0], P[1]) kAsBinary = bin(k) #0b1111111001 kAsBinary = kAsBinary[2:len(kAsBinary)] #1111111001 #print(kAsBinary) for i in range(1, len(kAsBinary)): currentBit = kAsBinary[i: i+1] #always apply doubling additionPoint = pointAddition(additionPoint, additionPoint, a, d, mod) if currentBit == '1': #add base point additionPoint = pointAddition(additionPoint, P, a, d, mod) return additionPoint
Firstly, Alice needs to convert the message to the numeric value.
def textToInt(text): encoded_text = text.encode('utf-8') hex_text = encoded_text.hex() int_text = int(hex_text, 16) return int_text message = textToInt("Hello, world!")
Remember that a random key was involved in elliptic curve digital signature algorithm (ECDSA). This must be different for each signing. Otherwise, it causes an important security issue. This security disaster appears in Sony PlayStation game console in 2010. In EdDSA, this is handled by generating random key based on the hash of the message. In this way, every message has a different random key.
def hashing(message): import hashlib return int(hashlib.sha512(str(message).encode("utf-8")).hexdigest(), 16) r = hashing(hashing(message) + message) % p
Random key times base point will be random point R and it is a type of curve point. Extracting secret random key r from known random point R is a really hard problem (ECDLP). Besides, combination of the random point, public key and the message will be stored in the variable h after hashing. This can be calculated by receiver party, too. Then, s variable stores (r + h x private key) which is a type of integer. Signature of the message consists of (R, s) pair.
R = applyDoubleAndAddMethod(base, r, a, d, p) h = hashing(R[0] + publicKey[0] + message) % p s = (r + h * privateKey)
Bob receives the message and its signature (R, s). Also, he knows Alice’s public key, and public curve configuration (base point, a, d, p). He needs to find the folowing P1 and P2 pair.
h = hashing(R[0] + publicKey[0] + message) % p P1 = applyDoubleAndAddMethod(base, s, a, d, p) P2 = pointAddition(R , applyDoubleAndAddMethod(publicKey, h, a, d, p) , a, d, p)
P1 is signature’s s value times base point. P2 is the addition of signature’s R value and h times public key. Remember that h can be calculated by Bob, too. Herein, P1 and P2 pair must be equal if the signature if valid.
You might wonder how this works. Focus on the calculation of P1.
P1 = s x basePoint
Signature’s s value is retrieved by (r + h x privateKey). Bob knows exact value of s also he knows h and r values but he does not know the private key of Alice. Replace s in P1 calculation.
P1 = (r + h x privateKey) x basePoint
Transfer base point multiplication into the parenthesis.
P1 = r x basePoint + h x privateKey x basePoint
In the equation above includes private key times base point. This is exactly equal to public key of Alice. Moreover, random key r times base point is equal to random point R.
P1 = R+ h x publicKey
Now, P1 is exactly equal to P2. Expectation of equality of these two points is obviously normal.
So, we have mentioned the EdDSA and covered simply python implementation. This scheme is designed to be faster than any existing digital signature scheme. Also, signing two different messages with same random key causes secret key to be disclosed in ECDSA. This issue is handled in EdDSA. Finally, the code project of this post is pushed into the GitHub.
The post A Gentle Introduction to Edwards-curve Digital Signature Algorithm (EdDSA) appeared first on Sefik Ilkin Serengil.
]]>The post A Gentle Introduction to Edwards Curves appeared first on Sefik Ilkin Serengil.
]]>Edwards curves show similarity with the unit circle. The unit circle satisfies the following equation.
x^{2} + y^{2} = 1
Suppose that (x_{1}, y_{1}) and (x_{2}, y_{2}) are points on the unit circle. The angle between y-axis and (x_{1}, y_{1}) is α and angle between y-axis and (x_{2}, y_{2}) is β.
We can express (x_{1}, y_{1}) as (sinα, cosα) and (x_{2}, y_{2}) as (sinβ, cosβ).
Now, I can add these two points by adding their corresponding angles. Angle sum identities will help me to formulate.
x_{3}= sin(α+β) = sinα.cosβ + cosα.sinβ
y_{3} = cos(α+β) = cosα.cosβ – sinα.sinβ
We know that (x_{3}, y_{3}) will satisfy the unit circle equation. This is satisfactory but not elliptic!
Edwards curves satisfy the form x^{2} + y^{2} = a^{2} + a^{2}x^{2}y^{2}. This is the form Harold Edwards studied in the original paper. Additionally, Bernstein and Lange contributed the study and transformed edward curves to a simpler form x^{2} + y^{2} = 1+ dx^{2}y^{2}.
Setting the variable d to 0 create unit circle. The curve looks like a Starfish when the variable d increases in negative direction.
I will mention the form that Harold Edwards mentioned in the following parts of this post. You can find the proof of simpler form x^{2} + y^{2} = 1+ dx^{2}y^{2} here.
Elliptic curves are based on constructing new points from existing points. Traditional forms such as Weierstrass or Koblitz use chords and tangents to construct a new point.
Edwards Curves use neither chords nor tangents. They have a their own characteristic construction method similar to unit circle’s addition law.
Edwards addition law says that if (x_{1}, y_{1}) and (x_{2}, y_{2}) are points on the edwards curve, the following (x_{3}, y_{3}) point derived from known points must be on the same curve.
x_{3} = (x_{1}y_{2 }+ x_{2}y_{1})/(a.(1+x_{1}y_{1}x_{2}y_{2}))
y_{3} = (y_{1}y_{2} – x_{1}x_{2})/(a.(1 – x_{1}y_{1}x_{2}y_{2}))
Euler and Gauss have worked on this kind of elliptic equations and discovered addition formulas in late 1700s.
This is Euler’s very first study that appears in Observations on the comparison of arcs of unrectifiable curves (Observationes de Comparatione Arcuum Curvarum Irrectificabilium) published in 1761 (pp. 83 to 103).
Then, Gauss was interested in same integrals and documented it in his Werke published in 1799 (pp. 404). He worked on the form x^{2} + y^{2} = 1+ dx^{2}y^{2} where d = -1.
The following illustration demonstrates the difference between unit circle and elliptic form worked by Gauss.
We can say that Harold Edwards revealed already discovered theorems.
It is not fully clear that how Euler and Gauss found this addition formula. They might just observe. However, we can still apply mathematical induction to validate the formula. Exact values of the new point (x_{3}, y_{3}) derived from the known points x_{1}, y_{1}, x_{2}, y_{2} must satisfy the equation x^{2} + y^{2} = a^{2} + a^{2}x^{2}y^{2} if the addition formula is valid. This is how exactly Harold Edwards proves the addition law in the original paper.
x_{3}^{2} + y_{3}^{2} = a^{2} + a^{2}x_{3}^{2}y_{3}^{2}
x_{3} = (x_{1}y_{2 }+ x_{2}y_{1})/(a.(1+x_{1}y_{1}x_{2}y_{2})) , y_{3} = (y_{1}y_{2} – x_{1}x_{2})/(a.(1 – x_{1}y_{1}x_{2}y_{2}))
Put the exact values into the Edwards form.
(x_{1}y_{2 }+ x_{2}y_{1})^{2}/a^{2}.(1+x_{1}y_{1}x_{2}y_{2})^{2} + (y_{1}y_{2} – x_{1}x_{2})^{2}/a^{2}.(1 – x_{1}y_{1}x_{2}y_{2})^{2} = a^{2} + a^{2}.(x_{1}y_{2 }+ x_{2}y_{1})^{2}(y_{1}y_{2} – x_{1}x_{2})^{2}/a^{2}.(1+x_{1}y_{1}x_{2}y_{2})^{2}.a^{2}.(1 – x_{1}y_{1}x_{2}y_{2})^{2}
The identity will become very complex. That’s why, say P to x_{1}y_{1}x_{2}y_{2}. We will restore it later.
(x_{1}y_{2 }+ x_{2}y_{1})^{2}/a^{2}.(1+P)^{2} + (y_{1}y_{2} – x_{1}x_{2})^{2}/a^{2}.(1 – P)^{2} = a^{2} + a^{2}.(x_{1}y_{2 }+ x_{2}y_{1})^{2}(y_{1}y_{2} – x_{1}x_{2})^{2}/a^{2}.(1+P)^{2}.a^{2}.(1 – P)^{2}
Denominators must be equal to apply addition
(x_{1}y_{2 }+ x_{2}y_{1})^{2}.(1 – P)^{2}/a^{2}.(1+P)^{2}.(1 – P)^{2} + (y_{1}y_{2} – x_{1}x_{2})^{2}.(1+P)^{2}/a^{2}.(1 – P)^{2}.(1+P)^{2} = a^{2}.a^{2}.(1+P)^{2}.(1 – P)^{2}/1.a^{2}.(1+P)^{2}.(1 – P)^{2} + a^{2}.(x_{1}y_{2 }+ x_{2}y_{1})^{2}(y_{1}y_{2} – x_{1}x_{2})^{2}/a^{2}.(1+P)^{2}.a^{2}.(1 – P)^{2}
The second term on the right side has a^{2} multiplier in both dividend and denominator. We can simplify the expression.
(x_{1}y_{2 }+ x_{2}y_{1})^{2}.(1 – P)^{2}/a^{2}.(1+P)^{2}.(1 – P)^{2} + (y_{1}y_{2} – x_{1}x_{2})^{2}.(1+P)^{2}/a^{2}.(1 – P)^{2}.(1+P)^{2} = a^{2}.a^{2}.(1+P)^{2}.(1 – P)^{2}/1.a^{2}.(1+P)^{2}.(1 – P)^{2} + (x_{1}y_{2 }+ x_{2}y_{1})^{2}(y_{1}y_{2} – x_{1}x_{2})^{2}/a^{2}.(1+P)^{2}.(1 – P)^{2}
Now, all denominators are same. We can simplify the denominators.
(x_{1}y_{2 }+ x_{2}y_{1})^{2}.(1 – P)^{2} + (y_{1}y_{2} – x_{1}x_{2})^{2}.(1+P)^{2} = a^{4}.(1+P)^{2}.(1 – P)^{2} + (x_{1}y_{2 }+ x_{2}y_{1})^{2}(y_{1}y_{2} – x_{1}x_{2})^{2}
Note that simplified denominator must not be equal to 0.
a^{2}.(1+x_{1}y_{1}x_{2}y_{2})^{2}.(1 – x_{1}y_{1}x_{2}y_{2})^{2} ≠ 0
Please focus on the term (1+P)^{2}.(1 – P)^{2}. We can rewrite it as [(1+P)(1-P)]^{2} = (1 – P^{2})^{2}
(x_{1}y_{2 }+ x_{2}y_{1})^{2}.(1 – P)^{2} + (y_{1}y_{2} – x_{1}x_{2})^{2}.(1+P)^{2} = a^{4}.(1 – P^{2})^{2} + (x_{1}y_{2 }+ x_{2}y_{1})^{2}(y_{1}y_{2} – x_{1}x_{2})^{2}
Focus on the left side of the equation. Evaluate the powers.
(x_{1}^{2}y_{2}^{2} + x_{2}^{2}y_{1}^{2} + 2P)(1 + P^{2} – 2P) + (y_{1}^{2}y_{2}^{2} + x_{1}^{2}x_{2}^{2} – 2P)(1 + P^{2} + 2P)
Combine the parts that contain 1 + P^{2} and 2P respectively.
(1 + P^{2})(x_{1}^{2}y_{2}^{2} + x_{2}^{2}y_{1}^{2} + 2P + y_{1}^{2}y_{2}^{2} + x_{1}^{2}x_{2}^{2} – 2P) + (2P)(- x_{1}^{2}y_{2}^{2} – x_{2}^{2}y_{1}^{2} – 2P + y_{1}^{2}y_{2}^{2} + x_{1}^{2}x_{2}^{2} – 2P)
(1 + P^{2})(x_{1}^{2}y_{2}^{2} + x_{2}^{2}y_{1}^{2} + y_{1}^{2}y_{2}^{2} + x_{1}^{2}x_{2}^{2}) + (2P)(- x_{1}^{2}y_{2}^{2} – x_{2}^{2}y_{1}^{2} + y_{1}^{2}y_{2}^{2} + x_{1}^{2}x_{2}^{2} – 4P)
(1 + P^{2})(x_{1}^{2} + y_{1}^{2})(x_{2}^{2} + y_{2}^{2}) + (2P)[(x_{1}^{2} – y_{1}^{2})(x_{2}^{2} – y_{2}^{2}) – 4P]
(1 + P^{2})(x_{1}^{2} + y_{1}^{2})(x_{2}^{2} + y_{2}^{2}) + 2P.(x_{1}^{2} – y_{1}^{2})(x_{2}^{2} – y_{2}^{2}) – 8P^{2}
Now, focus on the right side of the equation
(x_{1}y_{2 }+ x_{2}y_{1})^{2}(y_{1}y_{2} – x_{1}x_{2})^{2} = (x_{1}^{2}y_{2}^{2}_{ }+ x_{2}^{2}y_{1}^{2} + 2P)(y_{1}^{2}y_{2}^{2} + x_{1}^{2}x_{2}^{2} – 2P)
Put the term 2P to the outside.
(x_{1}^{2}y_{2}^{2}_{ }+ x_{2}^{2}y_{1}^{2})(y_{1}^{2}y_{2}^{2} + x_{1}^{2}x_{2}^{2}) + 2P(y_{1}^{2}y_{2}^{2} + x_{1}^{2}x_{2}^{2} – x_{1}^{2}y_{2}^{2}_{ }– x_{2}^{2}y_{1}^{2}) – 4P^{2}
(x_{1}^{2}y_{1}^{2}y_{2}^{4} + x_{1}^{4}x_{2}^{2}y_{2}^{2} + x_{2}^{2}y_{1}^{4}y_{2}^{2} + x_{1}^{2}x_{2}^{4}y_{1}^{2}) + 2P(x_{1}^{2} – y_{1}^{2})(x_{2}^{2} – y_{2}^{2}) – 4P^{2}
Put left and right side together again
(1 + P^{2})(x_{1}^{2} + y_{1}^{2})(x_{2}^{2} + y_{2}^{2}) + 2P.(x_{1}^{2} – y_{1}^{2})(x_{2}^{2} – y_{2}^{2}) – 8P^{2} = a^{4}.(1 – P^{2})^{2} + (x_{1}^{2}y_{1}^{2}y_{2}^{4} + x_{1}^{4}x_{2}^{2}y_{2}^{2} + x_{2}^{2}y_{1}^{4}y_{2}^{2} + x_{1}^{2}x_{2}^{4}y_{1}^{2}) + 2P(x_{1}^{2} – y_{1}^{2})(x_{2}^{2} – y_{2}^{2}) – 4P^{2}
The both left and right side have the term 2P.(x_{1}^{2} – y_{1}^{2})(x_{2}^{2} – y_{2}^{2}). We can simplify the equation. Also, we can add the term +8P^{2} on both side.
(1 + P^{2})(x_{1}^{2} + y_{1}^{2})(x_{2}^{2} + y_{2}^{2}) = a^{4}.(1 – P^{2})^{2} + (x_{1}^{2}y_{1}^{2}y_{2}^{4} + x_{1}^{4}x_{2}^{2}y_{2}^{2} + x_{2}^{2}y_{1}^{4}y_{2}^{2} + x_{1}^{2}x_{2}^{4}y_{1}^{2}) + 4P^{2}
Here, we can manipulate the term (1 – P^{2})^{2}. The left side has the term (1 + P^{2}), we should assimilate (1 – P^{2})^{2} to (1 + P^{2}).
(1 – P^{2})^{2} = (1 + P^{2})^{2} – 4P^{2} = (1 + P^{2})(1 + P^{2}) – 4P^{2}
Put the real value instead of P
(1 + P^{2})(1 + x_{1}^{2}y_{1}^{2}x_{2}^{2}y_{2}^{2}) – 4P^{2}
Adding and subtracting same values would not change the value
(1 + P^{2})(1 + x_{1}^{2}y_{1}^{2}x_{2}^{2}y_{2}^{2} + x_{1}^{2}y_{1}^{2} + x_{2}^{2}y_{2}^{2} – x_{1}^{2}y_{1}^{2} – x_{2}^{2}y_{2}^{2}) – 4P^{2}
We can seperate the second parentheses
(1 + P^{2})(1 + x_{1}^{2}y_{1}^{2}x_{2}^{2}y_{2}^{2} + x_{1}^{2}y_{1}^{2} + x_{2}^{2}y_{2}^{2}) – (1 + P^{2})(x_{1}^{2}y_{1}^{2} + x_{2}^{2}y_{2}^{2}) – 4P^{2}
The second parentheses can be expressed as multiplication of two terms
(1 + P^{2})(1 + x_{1}^{2}y_{1}^{2})(1 + x_{2}^{2}y_{2}^{2}) – (1 + P^{2})(x_{1}^{2}y_{1}^{2} + x_{2}^{2}y_{2}^{2}) – 4P^{2}
Reflect the minus sign into the parentheses
(1 + P^{2})(1 + x_{1}^{2}y_{1}^{2})(1 + x_{2}^{2}y_{2}^{2}) + (- 1 – P^{2})(x_{1}^{2}y_{1}^{2} + x_{2}^{2}y_{2}^{2}) – 4P^{2}
Multiply two parentheses on the second term
(1 + P^{2})(1 + x_{1}^{2}y_{1}^{2})(1 + x_{2}^{2}y_{2}^{2}) – x_{1}^{2}y_{1}^{2} – x_{2}^{2}y_{2}^{2} – P^{2}x_{1}^{2}y_{1}^{2} – P^{2}x_{2}^{2}y_{2}^{2} – 4P^{2}
(1 + P^{2})(1 + x_{1}^{2}y_{1}^{2})(1 + x_{2}^{2}y_{2}^{2}) – x_{1}^{2}y_{1}^{2} – x_{2}^{2}y_{2}^{2} – x_{2}^{2}y_{2}^{2}(x_{1}^{4}y_{1}^{4})- x_{1}^{2}y_{1}^{2}(x_{2}^{4}y_{2}^{4}) – 2x_{1}^{2}y_{1}^{2}x_{2}^{2}y_{2}^{2} – 2x_{1}^{2}y_{1}^{2}x_{2}^{2}y_{2}^{2}
(1 + P^{2})(1 + x_{1}^{2}y_{1}^{2})(1 + x_{2}^{2}y_{2}^{2}) – x_{1}^{2}y_{1}^{2}(1 + x_{2}^{4}y_{2}^{4} + 2x_{2}^{2}y_{2}^{2}) – x_{2}^{2}y_{2}^{2}(1 + x_{1}^{4}y_{1}^{4} + 2x_{1}^{2}y_{1}^{2})
(1 + P^{2})(1 + x_{1}^{2}y_{1}^{2})(1 + x_{2}^{2}y_{2}^{2}) – x_{1}^{2}y_{1}^{2}(1 + x_{2}^{2}y_{2}^{2})^{2} – x_{2}^{2}y_{2}^{2}(1 + x_{1}^{2}y_{1}^{2})^{2}
So, the term (1 – P^{2})^{2} can be expressed as (1 + P^{2})(1 + x_{1}^{2}y_{1}^{2})(1 + x_{2}^{2}y_{2}^{2}) – x_{1}^{2}y_{1}^{2}(1 + x_{2}^{2}y_{2}^{2})^{2} – x_{2}^{2}y_{2}^{2}(1 + x_{1}^{2}y_{1}^{2})^{2}
Turn back to the main equation
(1 + P^{2})(x_{1}^{2} + y_{1}^{2})(x_{2}^{2} + y_{2}^{2}) = a^{4}.(1 – P^{2})^{2} + (x_{1}^{2}y_{1}^{2}y_{2}^{4} + x_{1}^{4}x_{2}^{2}y_{2}^{2} + x_{2}^{2}y_{1}^{4}y_{2}^{2} + x_{1}^{2}x_{2}^{4}y_{1}^{2}) + 4P^{2}
Restore the 4P^{2}
(1 + P^{2})(x_{1}^{2} + y_{1}^{2})(x_{2}^{2} + y_{2}^{2}) = a^{4}.(1 – P^{2})^{2} + (x_{1}^{2}y_{1}^{2}y_{2}^{4} + x_{1}^{4}x_{2}^{2}y_{2}^{2} + x_{2}^{2}y_{1}^{4}y_{2}^{2} + x_{1}^{2}x_{2}^{4}y_{1}^{2}) + 2x_{1}^{2}y_{1}^{2}x_{2}^{2}y_{2}^{2} + 2x_{1}^{2}y_{1}^{2}x_{2}^{2}y_{2}^{2}
Combine the parts that contain x_{1}^{2}y_{1}^{2} and x_{2}^{2}y_{2}^{2}
(1 + P^{2})(x_{1}^{2} + y_{1}^{2})(x_{2}^{2} + y_{2}^{2}) = a^{4}.(1 – P^{2})^{2} + x_{1}^{2}y_{1}^{2}(x_{2}^{4} + y_{2}^{4} + 2x_{2}^{2}y_{2}^{2}) + x_{2}^{2}y_{2}^{2}(x_{1}^{4} + y_{1}^{4} + 2x_{1}^{2}y_{1}^{2})
(1 + P^{2})(x_{1}^{2} + y_{1}^{2})(x_{2}^{2} + y_{2}^{2}) = a^{4}.(1 – P^{2})^{2} + x_{1}^{2}y_{1}^{2}(x_{2}^{2} + y_{2}^{2})^{2} + x_{2}^{2}y_{2}^{2}(x_{1}^{2} + y_{1}^{2})^{2}
Now, set (1 – P^{2})^{2} to its manipulated value
(1 + P^{2})(x_{1}^{2} + y_{1}^{2})(x_{2}^{2} + y_{2}^{2}) = a^{4}.[(1 + P^{2})(1 + x_{1}^{2}y_{1}^{2})(1 + x_{2}^{2}y_{2}^{2}) – x_{1}^{2}y_{1}^{2}(1 + x_{2}^{2}y_{2}^{2})^{2} – x_{2}^{2}y_{2}^{2}(1 + x_{1}^{2}y_{1}^{2})^{2}] + x_{1}^{2}y_{1}^{2}(x_{2}^{2} + y_{2}^{2})^{2} + x_{2}^{2}y_{2}^{2}(x_{1}^{2} + y_{1}^{2})^{2}
(1 + P^{2})(x_{1}^{2} + y_{1}^{2})(x_{2}^{2} + y_{2}^{2}) = a^{4}(1 + P^{2})(1 + x_{1}^{2}y_{1}^{2})(1 + x_{2}^{2}y_{2}^{2}) – a^{4}.x_{1}^{2}y_{1}^{2}(1 + x_{2}^{2}y_{2}^{2})^{2} – a^{4}.x_{2}^{2}y_{2}^{2}(1 + x_{1}^{2}y_{1}^{2})^{2} + x_{1}^{2}y_{1}^{2}(x_{2}^{2} + y_{2}^{2})^{2} + x_{2}^{2}y_{2}^{2}(x_{1}^{2} + y_{1}^{2})^{2}
Combine the parts that contain (1 + P^{2})
(1 + P^{2}).[(x_{1}^{2} + y_{1}^{2})(x_{2}^{2} + y_{2}^{2}) – a^{4}(1 + x_{1}^{2}y_{1}^{2})(1 + x_{2}^{2}y_{2}^{2})] + a^{4}.x_{1}^{2}y_{1}^{2}(1 + x_{2}^{2}y_{2}^{2})^{2} + a^{4}.x_{2}^{2}y_{2}^{2}(1 + x_{1}^{2}y_{1}^{2})^{2} – x_{1}^{2}y_{1}^{2}(x_{2}^{2} + y_{2}^{2})^{2} – x_{2}^{2}y_{2}^{2}(x_{1}^{2} + y_{1}^{2})^{2} = 0
Reflect a multipliers into the parentheses
(1 + P^{2}).[(x_{1}^{2} + y_{1}^{2})(x_{2}^{2} + y_{2}^{2}) – (a^{2} + a^{2}x_{1}^{2}y_{1}^{2})(a^{2} + a^{2}x_{2}^{2}y_{2}^{2})] + (a^{2})^{2}.x_{1}^{2}y_{1}^{2}(1 + x_{2}^{2}y_{2}^{2})^{2} + (a^{2})^{2}.x_{2}^{2}y_{2}^{2}(1 + x_{1}^{2}y_{1}^{2})^{2} – x_{1}^{2}y_{1}^{2}(x_{2}^{2} + y_{2}^{2})^{2} – x_{2}^{2}y_{2}^{2}(x_{1}^{2} + y_{1}^{2})^{2} = 0
(1 + P^{2}).[(x_{1}^{2} + y_{1}^{2})(x_{2}^{2} + y_{2}^{2}) – (a^{2} + a^{2}x_{1}^{2}y_{1}^{2})(a^{2} + a^{2}x_{2}^{2}y_{2}^{2})] + x_{1}^{2}y_{1}^{2}(a^{2} + a^{2}x_{2}^{2}y_{2}^{2})^{2} + x_{2}^{2}y_{2}^{2}(a^{2} + a^{2}x_{1}^{2}y_{1}^{2})^{2} – x_{1}^{2}y_{1}^{2}(x_{2}^{2} + y_{2}^{2})^{2} – x_{2}^{2}y_{2}^{2}(x_{1}^{2} + y_{1}^{2})^{2} = 0
(1 + P^{2}).[(x_{1}^{2} + y_{1}^{2})(x_{2}^{2} + y_{2}^{2}) – (a^{2} + a^{2}x_{1}^{2}y_{1}^{2})(a^{2} + a^{2}x_{2}^{2}y_{2}^{2})] + (x_{1}^{2}y_{1}^{2})[(a^{2} + a^{2}x_{2}^{2}y_{2}^{2})^{2} – (x_{2}^{2} + y_{2}^{2})^{2}] + (x_{2}^{2}y_{2}^{2})[(a^{2} + a^{2}x_{1}^{2}y_{1}^{2})^{2} – (x_{1}^{2} + y_{1}^{2})^{2}] = 0
Remember the main equation for Edwards form x^{2} + y^{2} = a^{2} + a^{2}x^{2}y^{2} . We’ve already known that points (x_{1}, y_{1}) and (x_{2}, y_{2}) satisfy this equation.
x_{1}^{2} + y_{1}^{2} = a^{2} + a^{2}x_{1}^{2}y_{1}^{2}
x_{2}^{2} + y_{2}^{2} = a^{2} + a^{2}x_{2}^{2}y_{2}^{2}
Multiply these two equations
(x_{1}^{2} + y_{1}^{2}).(x_{2}^{2} + y_{2}^{2}) = (a^{2} + a^{2}x_{1}^{2}y_{1}^{2})(a^{2} + a^{2}x_{2}^{2}y_{2}^{2})
Move the terms on the right side to the left side
(x_{1}^{2} + y_{1}^{2})(x_{2}^{2} + y_{2}^{2}) – (a^{2} + a^{2}x_{1}^{2}y_{1}^{2})(a^{2} + a^{2}x_{2}^{2}y_{2}^{2}) = 0
Also, we can apply same approach to individual equation
x_{1}^{2} + y_{1}^{2} – (a^{2} + a^{2}x_{1}^{2}y_{1}^{2}) = 0
x_{2}^{2} + y_{2}^{2} – (a^{2} + a^{2}x_{2}^{2}y_{2}^{2}) = 0
As seen, these all appear in the final form of the equation
(1 + P^{2}).[(x_{1}^{2} + y_{1}^{2})(x_{2}^{2} + y_{2}^{2}) – (a^{2} + a^{2}x_{1}^{2}y_{1}^{2})(a^{2} + a^{2}x_{2}^{2}y_{2}^{2})] + (x_{1}^{2}y_{1}^{2})[(a^{2} + a^{2}x_{2}^{2}y_{2}^{2})^{2} – (x_{2}^{2} + y_{2}^{2})^{2}] + (x_{2}^{2}y_{2}^{2})[(a^{2} + a^{2}x_{1}^{2}y_{1}^{2})^{2} – (x_{1}^{2} + y_{1}^{2})^{2}] = 0
(1 + P^{2}).[0] + (x_{1}^{2}y_{1}^{2})[0] + (x_{2}^{2}y_{2}^{2})[0] = 0
Finally, the equation becomes 0 = 0. This proves the addition law as claimed!
Addition law can also be applied for doubling a point. Replacing (x_{2}, y_{2}) pair with (x_{1}, y_{1}) in the addition formula gives the doubling formula.
(x_{1}, y_{1}) + (x_{1}, y_{1}) = (x_{3}, y_{3})
x_{3} = (x_{1}y_{1 }+ x_{1}y_{1})/(a.(1+x_{1}y_{1}x_{1}y_{1})) = (x_{1}y_{1 }+ x_{1}y_{1})/(a.(1+x_{1}^{2}y_{1}^{2}))
y_{3} = (y_{1}y_{1} – x_{1}x_{1})/(a.(1 – x_{1}y_{1}x_{1}y_{1})) = (y_{1}^{2} – x_{1}^{2})/(a.(1 – x_{1}^{2}y_{1}^{2}))
So, we have all the necessary stuff to find the coordinates of a point. Point addition and doubling enable to calculate a target point fast with double and add method.
So, we’ve mentioned elliptic curves in Edwards form. Even though addition law was discovered more than 2 centuries ago by math geniuses, adapting into the cryptography occurs in last decade. Proof of Edwards addition law may seem much harder than Weierstrass or Koblitz curves, but calculations would be handled much easier. This makes Edwards curves so popular today.
Bernstein shows Weierstrass as a turtle (bird’s-eye view) and Edwards as a starfish in his slides. This metaphor supports the speed of elliptic curve forms. Weierstrass is old and slow whereas Edwards is new and fast. This is really funny!
Publications of Christiane Peters, in particular her PhD thesis has been driver for me to enjoy and understand Edwards Curves. Besides, I got much help from studies of Tanja Lange and Daniel J. Bernstein.
The post A Gentle Introduction to Edwards Curves appeared first on Sefik Ilkin Serengil.
]]>The post A Step by Step Hill Cipher Example appeared first on Sefik Ilkin Serengil.
]]>First, sender and receiver parties need to agree with a secret key. This key must be a square matrix.
key = np.array([ [3, 10, 20], [20, 9, 17], [9, 4, 17] ]) key_rows = key.shape[0] key_columns = key.shape[1] if key_rows != key_columns: raise Exception('key must be square matrix!')
The key matrix must have an inverse matrix. This means that determinant of the matrix must not be 0.
if np.linalg.det(key) == 0: raise Exception('matrix must have an inverse matrix')
Hill cipher is language dependent encryption method. That’s why, all character will be in lowercase and we’ll remove blank characters as well. Then, every letter will be replaced with its index value in the alphabet.
def letterToNumber(letter): return string.ascii_lowercase.index(letter) raw_message = "attack is to night" print("raw message: ",raw_message) message = [] for i in range(0, len(raw_message)): current_letter = raw_message[i:i+1].lower() if current_letter != ' ': #discard blank characters letter_index = letterToNumber(current_letter) message.append(letter_index)
Encryption will be handled by multiplying message and key. This requires that column size of the message must be equal to row size of the key. Otherwise, multiplication cannot be handled. We can append beginning letter of the alphabet to the end of the message until multiplication can be handled. Hill cipher is a block cipher method and repetition won’t be cause weakness. Still, I prefer to append beginning of the message instead of repeating characters. BTW, column number of my message and row number of my key are equal. The following code block won’t be run for this case.
if len(message) % key_rows != 0: for i in range(0, len(message)): message.append(message[i]) if len(message) % key_rows == 0: break
Now, we can transform the message into a matrix.
message = np.array(message) message_length = message.shape[0] message.resize(int(message_length/key_rows), key_rows)
Now, my message is stored in a 5×3 sized matrix as illustrated below.
[[ 0 19 19] [ 0 2 10] [ 8 18 19] [14 13 8] [ 6 7 19]]
The message is 5×3 sized matrix and the key is 3×3 sized matrix. Message’s column size is equal to key matrix’s row count. They can be multiplied. Multiplication might produce values greater than the alphabet size. That’s why, we will apply modular arithmetic. Here, 26 refers to the size of English alphabet. We can consume either matmul or dot functions.
encryption = np.matmul(message, key) encryption = np.remainder(encryption, 26)
Encrypted text will be stored in 5×3 sized matrix as illustrated below.
[[ 5 13 22] [ 0 6 22] [ 9 6 9] [10 3 13] [17 17 16]]
Remember that plaintext was attackistonight. Please focus on the 2nd and 3rd letter in plaintext. They are both letter of t. However, 2nd and 3rd characters in the ciphertext are 13 and 22 respectively. Same characters substituted with different characters. This is idea behind block ciphers.
Multiplying ciphertext and inverse of key will create plaintext. Here, we need to find the inverse of key. Finding matrix inverse is a complex operation. Even though numpy has a matrix inverse function, we also need to apply modular arithmetic on this decimal matrix. On the other hand, SymPy handles modular arithmetic for matrix inverse operations easily.
from sympy import Matrix inverse_key = Matrix(key).inv_mod(26) inverse_key = np.array(inverse_key) #sympy to numpy inverse_key = inverse_key.astype(float)
We could find the inverse key.
[[11. 22. 14.] [ 7. 9. 21.] [17. 0. 3.]]
We can validate inverse key matrix. Multiplication of key and inverse key must be equal to idendity matrix.
check = np.matmul(key, inverse_key) check = np.remainder(check, module)
This is really produces the identity matrix.
[[1. 0. 0.] [0. 1. 0.] [0. 0. 1.]]
Bob found the inverse key and he has the ciphertext. He need to multiply ciphertext and inverse key matrices.
decryption = np.matmul(encryption, inverse_key) decryption = np.remainder(decryption, module).flatten()
As seen, decrytpion stores the exact message Alice sent.
decryption: [ 0. 19. 19. 0. 2. 10. 8. 18. 19. 14. 13. 8. 6. 7. 19.]
We can restore these values into characters.
decrypted_message = "" for i in range(0, len(decryption)): letter_num = int(decryption[i]) letter = numberToLetter(decryption[i]) decrypted_message = decrypted_message + letter
This restores the following message.
decrypted message: attackistonight
Inventor Lester S. Hill registered this idea to patent office. You should have a view on his drawings. He designed an encrypted telegraph machine at the beginning of 1930’s and named message protector. Today, we call this Hill’s Cipher Machine.
In this post, we’ve worked on 3×3 sized key and its key space is 26^{9}. Patented mechanism works on 6×6 sized keys. This increases key space to 26^{36}. This is very large even for today computation power. Increasing the size of key matrix makes the cipher much stronger. We can say that Hill is secure against ciphertext only attacks.
However, if an attacker can capture a plaintext ciphertext pair, then he can calculate key value easily. That’s why, ciphertext is weak against known plaintext attacks. That’s why, this cipher got out of the date.
The source code of this post is pushed into the GitHub.
The post A Step by Step Hill Cipher Example appeared first on Sefik Ilkin Serengil.
]]>The post Using Custom Activation Functions in Keras appeared first on Sefik Ilkin Serengil.
]]>Herein, advanced frameworks cannot catch innovations. For example, you cannot use Swish based activation functions in Keras today. This might appear in the following patch but you may need to use an another activation function before related patch pushed. So, this post will guide you to consume a custom activation function out of the Keras and Tensorflow such as Swish or E-Swish.
All you need is to create your custom activation function. In this case, I’ll consume swish which is x times sigmoid. Besides, I include this in a convolutional neural networks model.
import keras def swish(x): beta = 1.5 #1, 1.5 or 2 return beta * x * keras.backend.sigmoid(x) model = Sequential() #1st convolution layer model.add(Conv2D(32, (3, 3) #32 is number of filters and (3, 3) is the size of the filter. , activation = swish , input_shape=(28,28,1))) model.add(MaxPooling2D(pool_size=(2,2))) #2nd convolution layer model.add(Conv2D(64,(3, 3), activation = swish)) # apply 64 filters sized of (3x3) on 2nd convolution layer model.add(MaxPooling2D(pool_size=(2,2))) model.add(Flatten()) # Fully connected layer. 1 hidden layer consisting of 512 nodes model.add(Dense(512, activation = swish)) model.add(Dense(num_classes, activation='softmax')) model.compile(loss='categorical_crossentropy' , optimizer=keras.optimizers.Adam() , metrics=['accuracy'] ) model.fit_generator(x_train, y_train , epochs=epochs , validation_data=(x_test, y_test) )
Remember that we will use this activation function in feed forward step whereas we need to use its derivative in the backpropagation. We just define the activation function but we do offer its derivative. That’s the power of TensorFlow. The framework knows how to apply differentiation for backpropagation. This comes from importing keras backend module. If you design swish function without keras.backend then fitting would fail.
So, we’ve mentioned how to include a new activation function for learning process in Keras / TensorFlow pair. Picking the most convenient activation function is the state-of-the-art for scientists just like structure (number of hidden layers, number of nodes in the hidden layers) and learning parameters (learning rate, epoch or learning rate). Now, you can design your own activation function or consume any newly introduced activation function just similar to the following picture.
My friend and colleague Giray inspires me to produce this post. I am grateful to him as usual.
The post Using Custom Activation Functions in Keras appeared first on Sefik Ilkin Serengil.
]]>The post A Step by Step Adaboost Example appeared first on Sefik Ilkin Serengil.
]]>We are going to work on the following data set. Each instances are represented as 2-dimensional space and we also have its class value. You can find the raw data set here.
x1 | x2 | Decision |
2 | 3 | true |
2.1 | 2 | true |
4.5 | 6 | true |
4 | 3.5 | false |
3.5 | 1 | false |
5 | 7 | true |
5 | 3 | false |
6 | 5.5 | true |
8 | 6 | false |
8 | 2 | false |
We should plot features and class value to be understand clearly.
import pandas as pd import matplotlib.pyplot as plt import numpy as np df = pd.read_csv("dataset/adaboost.txt") positives = df[df['Decision'] >= 0] negatives = df[df['Decision'] < 0] plt.scatter(positives['x1'], positives['x2'], marker='+', s=500*abs(positives['Decision']), c='blue') plt.scatter(negatives['x1'], negatives['x2'], marker='_', s=500*abs(negatives['Decision']), c='red') plt.show()
This code block produces the following graph. As seen, true classes are marked with plus characters whereas false classes are marked with minus character.
We would like to separate true and false classes. This is not a linearly separable problem. Linear classifiers such as perceptrons or decision stumps cannot classify this problem. Herein, adaboost enables linear classifiers to solve this problem.
Decision trees approaches problems with divide and conquer method. They might have lots of nested decision rules. This makes them non-linear classifiers. In contrast, decision stumps are 1-level decision trees. They are linear classifiers just like (single layer) perceptrons. You might think that if height of someone is greater than 1.70 meters (5.57 feet), then it would be male. Otherwise, it would be female. This decision stump would classify gender correctly at least 50% accuracy. That’s why, these classifiers are weak learners.
I’ve modified my decision tree repository to handle decision stumps. Basically, buildDecisionTree function calls itself until reaching a decision. I terminated this recursive calling if adaboost enabled.
The main principle in adaboost is to increase the weight of unclassified ones and to decrease the weight value of classified ones. But we are working on a classification problem. Target values in the data set are nominal values. That’s why, we are going to transform the problem to a regression task. I will set true classes to 1 whereas false classes to -1 to handle this.
Initially, we distribute weights normally. I set weights of all instances to 1/n where n is the total number of instances.
x1 | x2 | actual | weight | weighted_actual |
2 | 3 | 1 | 0.1 | 0.1 |
2 | 2 | 1 | 0.1 | 0.1 |
4 | 6 | 1 | 0.1 | 0.1 |
4 | 3 | -1 | 0.1 | -0.1 |
4 | 1 | -1 | 0.1 | -0.1 |
5 | 7 | 1 | 0.1 | 0.1 |
5 | 3 | -1 | 0.1 | -0.1 |
6 | 5 | 1 | 0.1 | 0.1 |
8 | 6 | -1 | 0.1 | -0.1 |
8 | 2 | -1 | 0.1 | -0.1 |
Weighted actual stores weight times actual value for each line. Now, we are going to use weighted actual as target value whereas x1 and x2 are features to build a decision stump. The following rule set is created when I run the decision stump algorithm.
def findDecision(x1, x2): if x1>2.1: return -0.025 if x1<=2.1: return 0.1
We’ve set actual values as values ±1 but decision stump returns decimal values. Here, the trick is applying sign function handles this issue.
def sign(x): if x > 0: return 1 elif x < 0: return -1 else: return 0
To sum up, prediction will be sign(-0.025) = -1 when x1 is greater than 2.1, and it will be sign(0.1) = +1 when x1 is less than or equal to 2.1.
I’ll put predictions as a column. Also, I check the equality of actual and prediction in loss column. It will be 0 if the prediction is correct, will be 1 if the prediction is incorrect.
x1 | x2 | actual | weight | weighted_actual | prediction | loss | weight * loss |
2 | 3 | 1 | 0.1 | 0.1 | 1 | 0 | 0 |
2 | 2 | 1 | 0.1 | 0.1 | 1 | 0 | 0 |
4 | 6 | 1 | 0.1 | 0.1 | -1 | 1 | 0.1 |
4 | 3 | -1 | 0.1 | -0.1 | -1 | 0 | 0 |
4 | 1 | -1 | 0.1 | -0.1 | -1 | 0 | 0 |
5 | 7 | 1 | 0.1 | 0.1 | -1 | 1 | 0.1 |
5 | 3 | -1 | 0.1 | -0.1 | -1 | 0 | 0 |
6 | 5 | 1 | 0.1 | 0.1 | -1 | 1 | 0.1 |
8 | 6 | -1 | 0.1 | -0.1 | -1 | 0 | 0 |
8 | 2 | -1 | 0.1 | -0.1 | -1 | 0 | 0 |
Sum of weight times loss column stores the total error. It is 0.3 in this case. Here, we’ll define a new variable alpha. It stores logarithm (1 – ε)/ε to the base e over 2.
α = ln[(1-ε)/ε] / 2 = ln[(1 – 0.3)/0.3] / 2 = 0.42
We’ll use alpha to update weights in the next round.
w_{i+1} = w_{i} * math.exp(-alpha * actual * prediction) where i refers to instance number.
Also, sum of weights must be equal to 1. That’s why, we have to normalize weight values. Dividing each weight value to sum of weights column enables normalization.
x1 | x2 | actual | weight | prediction | w_(i+1) | norm(w_(i+1)) |
2 | 3 | 1 | 0.1 | 1 | 0.065 | 0.071 |
2 | 2 | 1 | 0.1 | 1 | 0.065 | 0.071 |
4 | 6 | 1 | 0.1 | -1 | 0.153 | 0.167 |
4 | 3 | -1 | 0.1 | -1 | 0.065 | 0.071 |
4 | 1 | -1 | 0.1 | -1 | 0.065 | 0.071 |
5 | 7 | 1 | 0.1 | -1 | 0.153 | 0.167 |
5 | 3 | -1 | 0.1 | -1 | 0.065 | 0.071 |
6 | 5 | 1 | 0.1 | -1 | 0.153 | 0.167 |
8 | 6 | -1 | 0.1 | -1 | 0.065 | 0.071 |
8 | 2 | -1 | 0.1 | -1 | 0.065 | 0.071 |
This round is over.
I shift normalized w_(i+1) column to weight column in this round. Then, build a decision stump. Still, x1 and x2 are features whereas weighted actual is the target value.
x1 | x2 | actual | weight | weighted_actual |
2 | 3 | 1 | 0.071 | 0.071 |
2 | 2 | 1 | 0.071 | 0.071 |
4 | 6 | 1 | 0.167 | 0.167 |
4 | 3 | -1 | 0.071 | -0.071 |
4 | 1 | -1 | 0.071 | -0.071 |
5 | 7 | 1 | 0.167 | 0.167 |
5 | 3 | -1 | 0.071 | -0.071 |
6 | 5 | 1 | 0.167 | 0.167 |
8 | 6 | -1 | 0.071 | -0.071 |
8 | 2 | -1 | 0.071 | -0.071 |
Graph of the new data set is demonstrated below. Weights of correct classified ones decreased whereas incorrect ones increased.
The following decision stump will be built for this data set.
def findDecision(x1, x2): if x2<=3.5: return -0.02380952380952381 if x2>3.5: return 0.10714285714285714
I’ve applied sign function to predictions. Then, I put loss and weight times loss values as columns.
x1 | x2 | actual | weight | prediction | loss | weight * loss |
2 | 3 | 1 | 0.071 | -1 | 1 | 0.071 |
2 | 2 | 1 | 0.071 | -1 | 1 | 0.071 |
4 | 6 | 1 | 0.167 | 1 | 0 | 0.000 |
4 | 3 | -1 | 0.071 | -1 | 0 | 0.000 |
4 | 1 | -1 | 0.071 | -1 | 0 | 0.000 |
5 | 7 | 1 | 0.167 | 1 | 0 | 0.000 |
5 | 3 | -1 | 0.071 | -1 | 0 | 0.000 |
6 | 5 | 1 | 0.167 | 1 | 0 | 0.000 |
8 | 6 | -1 | 0.071 | 1 | 1 | 0.071 |
8 | 2 | -1 | 0.071 | -1 | 0 | 0.000 |
I can calculate error and alpha values for round 2.
ε = 0.21, α = 0.65
So, weights for the following round can be found.
x1 | x2 | actual | weight | prediction | w_(i+1) | norm(w_(i+1)) |
2 | 3 | 1 | 0.071 | -1 | 0.137 | 0.167 |
2 | 2 | 1 | 0.071 | -1 | 0.137 | 0.167 |
4 | 6 | 1 | 0.167 | 1 | 0.087 | 0.106 |
4 | 3 | -1 | 0.071 | -1 | 0.037 | 0.045 |
4 | 1 | -1 | 0.071 | -1 | 0.037 | 0.045 |
5 | 7 | 1 | 0.167 | 1 | 0.087 | 0.106 |
5 | 3 | -1 | 0.071 | -1 | 0.037 | 0.045 |
6 | 5 | 1 | 0.167 | 1 | 0.087 | 0.106 |
8 | 6 | -1 | 0.071 | 1 | 0.137 | 0.167 |
8 | 2 | -1 | 0.071 | -1 | 0.037 | 0.045 |
I skipped calculations for the following rounds
x1 | x2 | actual | weight | prediction | loss | w * loss | w_(i+1) | norm(w_(i+1)) |
2 | 3 | 1 | 0.167 | 1 | 0 | 0.000 | 0.114 | 0.122 |
2 | 2 | 1 | 0.167 | 1 | 0 | 0.000 | 0.114 | 0.122 |
4 | 6 | 1 | 0.106 | -1 | 1 | 0.106 | 0.155 | 0.167 |
4 | 3 | -1 | 0.045 | -1 | 0 | 0.000 | 0.031 | 0.033 |
4 | 1 | -1 | 0.045 | -1 | 0 | 0.000 | 0.031 | 0.033 |
5 | 7 | 1 | 0.106 | -1 | 1 | 0.106 | 0.155 | 0.167 |
5 | 3 | -1 | 0.045 | -1 | 0 | 0.000 | 0.031 | 0.033 |
6 | 5 | 1 | 0.106 | -1 | 1 | 0.106 | 0.155 | 0.167 |
8 | 6 | -1 | 0.167 | -1 | 0 | 0.000 | 0.114 | 0.122 |
8 | 2 | -1 | 0.045 | -1 | 0 | 0.000 | 0.031 | 0.033 |
ε = 0.31, α = 0.38
def findDecision(x1, x2): if x1>2.1: return -0.003787878787878794 if x1<=2.1: return 0.16666666666666666
x1 | x2 | actual | weight | prediction | loss | w * loss | w_(i+1) | norm(w_(i+1)) |
2 | 3 | 1 | 0.122 | 1 | 0 | 0.000 | 0.041 | 0.068 |
2 | 2 | 1 | 0.122 | 1 | 0 | 0.000 | 0.041 | 0.068 |
4 | 6 | 1 | 0.167 | 1 | 0 | 0.000 | 0.056 | 0.093 |
4 | 3 | -1 | 0.033 | 1 | 1 | 0.033 | 0.100 | 0.167 |
4 | 1 | -1 | 0.033 | 1 | 1 | 0.033 | 0.100 | 0.167 |
5 | 7 | 1 | 0.167 | 1 | 0 | 0.000 | 0.056 | 0.093 |
5 | 3 | -1 | 0.033 | 1 | 1 | 0.033 | 0.100 | 0.167 |
6 | 5 | 1 | 0.167 | 1 | 0 | 0.000 | 0.056 | 0.093 |
8 | 6 | -1 | 0.122 | -1 | 0 | 0.000 | 0.041 | 0.068 |
8 | 2 | -1 | 0.033 | -1 | 0 | 0.000 | 0.011 | 0.019 |
ε = 0.10, α = 1.10
def findDecision(x1,x2): if x1<=6.0: return 0.08055555555555555 if x1>6.0: return -0.07777777777777778
Cumulative sum of each round’s alpha times prediction gives the final prediction.
round 1 | round 2 | round 3 | round 4 | final | |||||
α | pred | α | pred | α | pred | α | pred | pred | actual |
0.42 | 1 | 0.65 | -1 | 0.38 | 1 | 1.1 | 1 | 1 | 1 |
0.42 | 1 | 0.65 | -1 | 0.38 | 1 | 1.1 | 1 | 1 | 1 |
0.42 | -1 | 0.65 | 1 | 0.38 | -1 | 1.1 | 1 | 1 | 1 |
0.42 | -1 | 0.65 | -1 | 0.38 | -1 | 1.1 | 1 | -1 | -1 |
0.42 | -1 | 0.65 | -1 | 0.38 | -1 | 1.1 | 1 | -1 | -1 |
0.42 | -1 | 0.65 | 1 | 0.38 | -1 | 1.1 | 1 | 1 | 1 |
0.42 | -1 | 0.65 | -1 | 0.38 | -1 | 1.1 | 1 | -1 | -1 |
0.42 | -1 | 0.65 | 1 | 0.38 | -1 | 1.1 | 1 | 1 | 1 |
0.42 | -1 | 0.65 | 1 | 0.38 | -1 | 1.1 | -1 | -1 | -1 |
0.42 | -1 | 0.65 | -1 | 0.38 | -1 | 1.1 | -1 | -1 | -1 |
You might realize that both round 1 and round 3 produce same results. Pruning in adaboost proposes to remove similar weak classifier to overperform. Besides, you should increase the multiplier alpha value of remaining one. In this case, I remove round 3 and append its coefficient to round 1.
round 1 | round 2 | round 4 | final | ||||
α | pred | α | pred | α | pred | pred | actual |
0.8 | 1 | 0.65 | -1 | 1.1 | 1 | 1 | 1 |
0.8 | 1 | 0.65 | -1 | 1.1 | 1 | 1 | 1 |
0.8 | -1 | 0.65 | 1 | 1.1 | 1 | 1 | 1 |
0.8 | -1 | 0.65 | -1 | 1.1 | 1 | -1 | -1 |
0.8 | -1 | 0.65 | -1 | 1.1 | 1 | -1 | -1 |
0.8 | -1 | 0.65 | 1 | 1.1 | 1 | 1 | 1 |
0.8 | -1 | 0.65 | -1 | 1.1 | 1 | -1 | -1 |
0.8 | -1 | 0.65 | 1 | 1.1 | 1 | 1 | 1 |
0.8 | -1 | 0.65 | 1 | 1.1 | -1 | -1 | -1 |
0.8 | -1 | 0.65 | -1 | 1.1 | -1 | -1 | -1 |
Even though we’ve used linear weak classifiers, all instances can be classified correctly.
So, we’ve mentioned adaptive boosting algorithm. In this example, we’ve used decision stumps as a weak classifier. You might consume perceptrons for more complex data sets. I’ve pushed the adaboost logic into my GitHub repository.
Special thank to Olga Veksler. Her lecture notes help me to understand this concept.
The post A Step by Step Adaboost Example appeared first on Sefik Ilkin Serengil.
]]>The post A Step by Step Gradient Boosting Example for Classification appeared first on Sefik Ilkin Serengil.
]]>Notice that gradient boosting is not a decision tree algorithm. It proposes to run a regression trees sequentially.
Here, we are going to work on Iris data set. There are 150 instances of 3 homogeneous classes. They are setosa, versicolor and virginica. This is the target output whereas top and bottom leaf sizes are input features.
Applying C4.5 decision tree algorithm to this data set classifies 105 instances correctly whereas 45 instances incorrectly. This means 70% accuracy which is far away from the success. We will run same C4.5 algorithm in the following steps but boosting enables to increase the accuracy.
You can find the building decision tree code here.
We are going to apply one-hot-encoding to target output. Thus, output will be represented as three dimensional vector. However, decision tree algorithms can handle one output only. That’s why, we will build 3 different regression trees each time. You might think each decision tree as different binary classification problem.
I mean that I’ve selected sample rows of the data set to illustrate. This is the original one.
instance | sepal_length | sepal_width | petal_length | petal_width | label |
1 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
2 | 4.9 | 3 | 1.4 | 0.2 | setosa |
51 | 7 | 3.2 | 4.7 | 1.4 | versicolor |
101 | 6.3 | 3.3 | 6 | 2.5 | virginica |
Label consists of 3 classes: setosa, versicolor and virginica.
Firstly, I prepare a data set to check instances setosa or not.
instance | sepal_length | sepal_width | petal_length | petal_width | setosa |
1 | 5.1 | 3.5 | 1.4 | 0.2 | 1 |
2 | 4.9 | 3 | 1.4 | 0.2 | 1 |
51 | 7 | 3.2 | 4.7 | 1.4 | 0 |
101 | 6.3 | 3.3 | 6 | 2.5 | 0 |
Secondly, I prepare a data set to check instances versicolor or not.
instance | sepal_length | sepal_width | petal_length | petal_width | versicolor |
1 | 5.1 | 3.5 | 1.4 | 0.2 | 0 |
2 | 4.9 | 3 | 1.4 | 0.2 | 0 |
51 | 7 | 3.2 | 4.7 | 1.4 | 1 |
101 | 6.3 | 3.3 | 6 | 2.5 | 0 |
Finally, I’ll prepare a data set to check instances virginica or not.
instance | sepal_length | sepal_width | petal_length | petal_width | virginica |
1 | 5.1 | 3.5 | 1.4 | 0.2 | 0 |
2 | 4.9 | 3 | 1.4 | 0.2 | 0 |
51 | 7 | 3.2 | 4.7 | 1.4 | 0 |
101 | 6.3 | 3.3 | 6 | 2.5 | 1 |
Now, I have 3 different data sets. I can build 3 decision trees for these data sets.
I’m going to put actual labels and predictions in the same table in the following steps. Columns beginning with F_ prefix are predictions.
instance | Y_setosa | Y_versicolor | Y_virginica | F_setosa | F_versicolor | F_virginica |
1 | 1 | 0 | 0 | 1 | 0 | 0 |
2 | 1 | 0 | 0 | 1 | 0 | 0 |
51 | 0 | 1 | 0 | 0 | 1 | 0 |
101 | 0 | 0 | 1 | 0 | 1 | 1 |
Notice that instance 101 is predicted as versicolor and virginica with same probability. This has an error.
Initially, we need to apply softmax function for predictions. This function normalize all inputs in scale of [0, 1], and also sum of normalized values are always equal to 1. But there is no out-of-the-box function for softmax in python. Still we can create it easily as coded below.
def softmax(w): e = np.exp(np.array(w)) dist = e / np.sum(e) return dist
I’m going to add these probabilities as columns. I’ve also hided actual values (Y_prefix) to fit the table.
ins | F_setosa | F_versicolor | F_virginica | P_setosa | P_versicolor | P_virginica |
1 | 1 | 0 | 0 | 0.576 | 0.212 | 0.212 |
2 | 1 | 0 | 0 | 0.576 | 0.212 | 0.212 |
51 | 0 | 1 | 0 | 0.212 | 0.576 | 0.212 |
101 | 0 | 1 | 1 | 0.155 | 0.422 | 0.422 |
Remember that we’ve built new tree for actual minus prediction target value in regression case. This difference comes from the derivative of mean squared error. Herein, we’ve applied softmax function. The maximum one between probabilities of predictions (columns have P_ prefix) would be the prediction. In other words, we’ll apply one-hot-encoding as 1 for max one whereas 0 for others. Herein, cross entropy stores the relation between probabilities and one-hot-encoded results. Applying softmax and cross entropy respectively has surprising derivative. It is equal to prediction (probabilities in this case) minus actual. The negative gradient would be actual (columns have Y_ prefix) minus prediction (columns have P_ prefix). We will derive this value.
instance | Y_setosa – P_setosa |
Y_versicolor – P_versicolor |
Y_virginica – P_virginica |
1 | 0.424 | -0.212 | -0.212 |
2 | 0.424 | -0.212 | -0.212 |
51 | -0.212 | 0.424 | -0.212 |
101 | -0.155 | -0.422 | 0.578 |
This is the round 1. Target values will be replaced as these negative gradients in the following round.
Target column for setosa will be replaced with Y_setosa – P_setosa.
instance | sepal_length | sepal_width | petal_length | petal_width | setosa |
1 | 5.1 | 3.5 | 1.4 | 0.2 | 0.424 |
2 | 4.9 | 3 | 1.4 | 0.2 | 0.424 |
51 | 7 | 3.2 | 4.7 | 1.4 | -0.212 |
101 | 6.3 | 3.3 | 6 | 2.5 | -0.155 |
Target column for versicolor will be replaced with Y_versicolor – P_versicolor.
instance | sepal_length | sepal_width | petal_length | petal_width | versicolor |
1 | 5.1 | 3.5 | 1.4 | 0.2 | -0.212 |
2 | 4.9 | 3 | 1.4 | 0.2 | -0.212 |
51 | 7 | 3.2 | 4.7 | 1.4 | 0.424 |
101 | 6.3 | 3.3 | 6 | 2.5 | -0.422 |
I will apply similar replacements for virginica, too. These are my new data sets. I’m going to build 3 different decision trees for these 3 different data set. This operation will be repeated until I get satisfactory success.
Finally, I’m going to sum predictions (F_ prefix) for all rounds. The maximum index value will be my prediction.
At round 10, I can classify 144 instances correctly whereas 6 instances incorrectly. This means I got 96% accuracy. Remember that I got 70% accuracy before boosting. This is a major improvement!
I’ve demonstrated gradient boosting for classification on a multi-class classification problem where number of classes is greater than 2. Running it for a binary classification problem (true/false) might require to consume sigmoid function. Still, softmax and cross-entropy pair works for binary classification.
So, we’ve mentioned a step by step gradient boosting example for classification. I cannot find this in literature. Basically, we’ve transformed classification example to multiple regression tasks to boost. I am grateful to Cheng Li. His lecture notes guide me to understand this topic. Finally, running and debugging code by yourself makes concept much more understandable. That’s why, I’ve already pushed the code of gradient boosting for classification into GitHub.
The post A Step by Step Gradient Boosting Example for Classification appeared first on Sefik Ilkin Serengil.
]]>The post How Pruning Works in Decision Trees appeared first on Sefik Ilkin Serengil.
]]>Pruning can be handled as pre-pruning and post-pruning.
We’ve mentioned regression tree in a previous post. We are going to use the same data set in that post as demonstrated below.
Day | Outlook | Temp. | Humidity | Wind | Golf Players |
1 | Sunny | Hot | High | Weak | 25 |
2 | Sunny | Hot | High | Strong | 30 |
3 | Overcast | Hot | High | Weak | 46 |
4 | Rain | Mild | High | Weak | 45 |
5 | Rain | Cool | Normal | Weak | 52 |
6 | Rain | Cool | Normal | Strong | 23 |
7 | Overcast | Cool | Normal | Strong | 43 |
8 | Sunny | Mild | High | Weak | 35 |
9 | Sunny | Cool | Normal | Weak | 38 |
10 | Rain | Mild | Normal | Weak | 46 |
11 | Sunny | Mild | Normal | Strong | 48 |
12 | Overcast | Mild | High | Strong | 52 |
13 | Overcast | Hot | Normal | Weak | 44 |
14 | Rain | Mild | High | Strong | 30 |
Running regression tree algorithm constructs the following decision tree.
def findDecision(Outlook,Temp.,Humidity,Wind): if Outlook == 'Rain' : if Wind == 'Weak' : if Humidity <=95 : if Temp. <=83 : return 46 if Humidity >95 : return 45 if Wind == 'Strong' : if Temp. <=83 : if Humidity <=95 : return 23 if Outlook == 'Sunny' : if Temp. <=83 : if Wind == 'Weak' : if Humidity <=95 : return 35 if Wind == 'Strong' : if Humidity <=95 : return 30 if Temp. >83 : return 25 if Outlook == 'Overcast' : if Wind == 'Weak' : if Temp. <=83 : if Humidity <=95 : return 46 if Wind == 'Strong' : if Temp. <=83 : if Humidity <=95 : return 43
Disappointing huge tree is created.
As seen, a huge tree is created. This is typical problem of regression trees. Decision rules at the bottom includes a few instances or single instance. This causes overfitting. Here, we can apply early stop. Here, we might check number of instances in the current branch or ratio of standard deviation of the current branch to all global data set.
if algorithm == 'Regression' and subdataset.shape[0] < 5: #if algorithm == 'Regression' and subdataset['Decision'].std(ddof=0)/global_stdev < 0.4: final_decision = subdataset['Decision'].mean() #get average terminateBuilding = True
Enabling early stop if sub data sets in the current branch is less than e.g. 5 will construct the following decision tree. As seen, more generalized decision rules are created. This avoids overfitting.
def findDecision(Outlook,Temp.,Humidity,Wind): if Outlook == 'Rain' : if Wind == 'Weak' : return 47.666666666666664 if Wind == 'Strong' : return 26.5 if Outlook == 'Sunny' : if Temp. <=83 : return 37.75 if Temp. >83 : return 25 if Outlook == 'Overcast' : return 46.25
We’ve mentioned C4.5 decision tree algorithm in a previous post. Suppose that we are going to work on the following data set.
Day | Outlook | Temp. | Humidity | Wind | Decision |
1 | Sunny | 85 | 85 | Weak | No |
2 | Sunny | 80 | 90 | Strong | No |
3 | Overcast | 83 | 78 | Weak | Yes |
4 | Rain | 70 | 96 | Weak | Yes |
5 | Rain | 68 | 80 | Weak | Yes |
6 | Rain | 65 | 70 | Strong | No |
7 | Overcast | 64 | 65 | Strong | Yes |
8 | Sunny | 72 | 95 | Weak | No |
9 | Sunny | 69 | 70 | Weak | Yes |
10 | Rain | 75 | 80 | Weak | Yes |
11 | Sunny | 75 | 70 | Strong | Yes |
12 | Overcast | 72 | 90 | Strong | Yes |
13 | Overcast | 81 | 75 | Weak | Yes |
14 | Rain | 71 | 80 | Strong | No |
C4.5 algorithm constructs the following decision tree. Notice that built decision tree is in a different form in related blog post because we picked information gain metric in that post but we picked gain ratio metric in this post.
def findDecision(Outlook,Temp.,Humidity,Wind): if Temp. <=83 : if Outlook == 'Rain' : if Wind == 'Weak' : return 'Yes' if Wind == 'Strong' : return 'No' if Outlook == 'Overcast' : return 'Yes' if Outlook == 'Sunny' : if Humidity >65 : if Wind == 'Strong' : return 'Yes' if Wind == 'Weak' : return 'Yes' if Temp. >83 : return 'No'
Here, please focus on decisions when temperature is less than or equal to 82, and sunny outlook. This branch makes positive decision no matter what wind is. Still, it checks wind feature. We can prune checking wind feature in that level. Also, you might realize that there is no answer when humidity is less than or equal to 65. It actually comes from that humidity feature could have 65 as minimum value. We can prune this rule, too. The final form of the decision tree is illustrated below.
def findDecision(Outlook,Temp.,Humidity,Wind): if Temp. <=83 : if Outlook == 'Rain' : if Wind == 'Weak' : return 'Yes' if Wind == 'Strong' : return 'No' if Outlook == 'Overcast' : return 'Yes' if Outlook == 'Sunny' : return 'Yes' if Temp. >83 : return 'No'
This modification improves the performance of running decision tree. Because, it will always make same decisions even though its result wouldn’t be changed.
We have been pruning some decision rules because its upper branch includes them both. But this is not a must. You should prune some branches if they might derive from a few instances in the training set. This enables to avoid overfitting.
To sum up, post pruning covers building decision tree first and pruning some decision rules from end to beginning. In contrast, pre-pruning and building decision trees are handled simultaneously. In both cases, less complex trees are created and this causes to run decision rules faster. Also, this might enables to avoid overfitting.
All code and data sets are already pushed into GitHub. You might run it by yourself.
The post How Pruning Works in Decision Trees appeared first on Sefik Ilkin Serengil.
]]>The post A Gentle Introduction to LightGBM for Applied Machine Learning appeared first on Sefik Ilkin Serengil.
]]>You might run pip install lightgbm command to install LightGBM package. Then, we will reference the related library.
import lightgbm as lgb
The data set that we are going to work on is about playing Golf decision based on some features. You can find the data set here. I choose this data set because it has both numeric and string features. Decision column is the target that we would like to extract decision rules. I will load the data set with pandas because it will simplify column based operations in the following steps.
import pandas as pd dataset = pd.read_csv('golf2.txt') dataset.head()
Data frame’s head function prints the first 5 rows.
Outlook | Temp. | Humidity | Wind | Decision | |
0 | Sunny | 85 | 85 | Weak | No |
1 | Sunny | 80 | 90 | Strong | No |
2 | Overcast | 83 | 78 | Weak | Yes |
3 | Rain | 70 | 96 | Weak | Yes |
4 | Rain | 68 | 80 | Weak | Yes |
LightGBM expects to convert categorical features to integer. Here, temperature and humidity features are already numeric but outlook and wind features are categorical. We need to convert these features. I will use scikit-learn’s transformer.
Even though categorical features will be converted to integer, we will specify categorical features in the following steps. That’s why, I store both all features and categorical ones in different variables.
from sklearn import preprocessing le = preprocessing.LabelEncoder() features = []; categorical_features = [] num_of_columns = dataset.shape[1] for i in range(0, num_of_columns): column_name = dataset.columns[i] column_type = dataset[column_name].dtypes if i != num_of_columns - 1: #skip target features.append(column_name) if column_type == 'object': le.fit(dataset[column_name]) feature_classes = list(le.classes_) encoded_feature = le.transform(dataset[column_name]) dataset[column_name] = pd.DataFrame(encoded_feature) if i != num_of_columns - 1: #skip target categorical_features.append(column_name) if is_regression == False and i == num_of_columns - 1: num_of_classes = len(feature_classes)
In this way, we can handle different data sets. Let’s check the encoded data set.
dataset.head()
Outlook | Temp. | Humidity | Wind | Decision | |
0 | 2 | 85 | 85 | 1 | 0 |
1 | 2 | 80 | 90 | 0 | 0 |
2 | 0 | 83 | 78 | 1 | 1 |
3 | 1 | 70 | 96 | 1 | 1 |
4 | 1 | 68 | 80 | 1 | 1 |
Data set is transformed into the final form. We need to separate input features and output labels to feed LightGBM.
y_train = dataset['Decision'].values x_train = dataset.drop(columns=['Decision']).values
Remember that we have converted string features to integer. Here, we need to specify categorical features. Even though it still work if categorical features wouldn’t mention. But in this case, some node in the decision tree might check that feature is greater than something, or less than or equal to it. Consider that gender would be a feature in our data set. We set unknown gender to 0, male to 1, and woman to 2. What if decision tree checks gender is greater than 0, or less than or equal to 0? We might miss an important gender information. Specifying categorical features enables to check gender for male, for woman and for unknown respectively.
lgb_train = lgb.Dataset(x_train, y_train ,feature_name = features , categorical_feature = categorical_features )
We can solve this problem for both classification and regression. Typically, objective and metric parameters should be different. Passing parameter set and LightGBM’s data set will start training.
params = { 'task': 'train' , 'boosting_type': 'gbdt' , 'objective': 'regression' if is_regression == True else 'multiclass' , 'num_class': num_of_classes , 'metric': 'rmsle' if is_regression == True else 'multi_logloss' , 'min_data': 1 , 'verbose': -1 } gbm = lgb.train(params, lgb_train, num_boost_round=50)
Trained tree stored in gbm variable. We can ask gbm to predict the decision for a new instance. Similarly, we can feed features of training set instances and want gbm to predict decisions.
predictions = gbm.predict(x_train) for index, instance in dataset.iterrows(): actual = instance[target_name] if is_regression == True: prediction = round(predictions[index]) else: #classification prediction = np.argmax(predictions[index]) print((index+1),". actual= ",actual,", prediction= ",prediction)
This code block makes following predictions for the training data set. As seen, all instances can be predicted successfully.
actual= 0 , prediction= 0 actual= 0 , prediction= 0 actual= 1 , prediction= 1 actual= 1 , prediction= 1 actual= 1 , prediction= 1 actual= 0 , prediction= 0 actual= 1 , prediction= 1 actual= 0 , prediction= 0 actual= 1 , prediction= 1 actual= 1 , prediction= 1 actual= 1 , prediction= 1 actual= 1 , prediction= 1 actual= 1 , prediction= 1 actual= 0 , prediction= 0
Luckily, LightGBM enables to visualize built decision tree and importance of data set features. This makes decisions understandable. This requires to install Graph Visualization Software.
Firstly, you need to run pip install graphviz command to install python package.
Secondly, please install graphviz package related to your OS here. You can specify the installed directory as illustrated below.
import matplotlib.pyplot as plt import os os.environ["PATH"] += os.pathsep + 'C:/Program Files (x86)/Graphviz2.38/bin'
Plotting tree is an easy task now.
ax = lgb.plot_importance(gbm, max_num_features=10) plt.show() ax = lgb.plot_tree(gbm) plt.show()
Decision rules can be extracted from the built tree easily.
Now, we know feature importance for the data set.
We can monitor accuracy score as coded below
predictions_classes = [] for i in predictions: if is_regression == True: predictions_classes.append(round(i)) else: predictions_classes.append(np.argmax(i)) predictions_classes = np.array(predictions_classes) from sklearn.metrics import confusion_matrix,accuracy_score, roc_curve, auc accuracy = accuracy_score(predictions_classes, y_train)*100 print(accuracy,"%")
Moreover, if the problem were a classification problem, then precision and recall will be more important metric than the raw accuracy.
if is_regression == False: actuals_onehot = pd.get_dummies(y_train).values false_positive_rate, recall, thresholds = roc_curve(actuals_onehot[0], np.round(predictions)[0]) roc_auc = auc(false_positive_rate, recall) print("AUC score ",roc_auc)
So, we have discovered Microsoft’s light gradient boosting machine framework adopted by many applied machine learning studies. Moreover, we’ve mentioned its pros and cons compared to its alternatives. Besides, we’ve developed a hello world model with LightGBM. Finally, I pushed the source code of this blog post to my GitHub profile.
The post A Gentle Introduction to LightGBM for Applied Machine Learning appeared first on Sefik Ilkin Serengil.
]]>