Recently, Kaggle announced a competition aiming to find related face pairs in a random set. This was my 1st kaggle experience. Competition type is knowledge. This means that it would not offer any prize, ranking points or tiers. Having a kaggle title is a prestige in data world and this competition would not contribute and competition title. It still attracted me to join the challenge. Because I believe that my previous face recognition experiences might contribute to solve finding a kinship problem. In this post, I summarize the road map I followed in the competition.
Data set
Families have custom folder (e.g. F0106). In each family folder, family members have custom folders (e.g. F0106/MID2). In these individual folders, there are pictures of that individuals (e.g. F0106/MID2/P01089_face2.jpg).
🙋♂️ You may consider to enroll my top-rated machine learning course on Udemy
It seems that mother, father and children of the family exist in same folder. train_relationships.csv stores relation of individuals. This file might not be seen as important but actually it is. In a family, mother-children and father-children have relationship whereas father-mother have not. We will feed concrete relationships to train set. There are 165K related instances mentioned in the data set.
The problem in this case is that we just have the positive examples. train_relationships.csv would not tell us how are not related. We can generate data for negative examples. Finding different family members will be negative examples. I randomly append 283K unrelated instances to the data set.
So, 36% of the data set is related instances whereas 64% of the data set is unrelated instances. Number of negative instances are particularly more than positive examples to enforce model not to tend classify instances as related.
Face recognition models
I’ve mentioned Oxford’s VGG-Face, Google’s Facenet and CMU’s OpenFace models in this blog. They are all auto-encoders and they represent faces as vectors. In face recognition, we compare represented vector distances. Distances less than a custom threshold can be classified as same person.
The first concern is this competition is not a face recognition task. Threshold values would be greater than the value we used in face recognition task. But still distances of related ones should be less than the unrelated ones. Deciding an optimum threshold is important.
The second concern is that these face recognition models are successful but they are not perfect. However, combining them would be the perfect model. One might fail whereas other one would succeed.
The third concern is that distances between two vectors can be found by cosine or euclidean. Finding the better metric for each case will be important.
Enriching face recognition model results
Even though cosine and euclidean distance values of 3 different face recognition models will be fed as a feature, I will feed the relation of these metrics as custom features.
Average of both cosine and euclidean distance values might be helpful.
df['cosine_avg'] = (df['vgg_cosine'] + df['facenet_cosine'] + df['openface_cosine'])/3 df['euclidean_l2_avg'] = (df['vgg_euclidean_l2'] + df['facenet_euclidean_l2'] + df['openface_euclidean_l2'])/3
Each face recognition model has 2 distance metrics. Ratio of these distance metrics might be helpul, too.
df['vgg_ratio'] = df['vgg_euclidean_l2'] / df['vgg_cosine'] df['facenet_ratio'] = df['facenet_euclidean_l2'] / df['facenet_cosine'] df['openface_ratio'] = df['openface_euclidean_l2'] / df['openface_cosine']
Finally, distance metric ratios of different model will be helpful.
df['vgg_over_facenet_cosine'] = df['vgg_cosine'] / df['facenet_cosine'] df['vgg_over_facenet_euclidean'] = df['vgg_euclidean_l2'] / df['facenet_euclidean_l2'] df['vgg_over_openface_cosine'] = df['vgg_cosine'] / df['openface_cosine'] df['vgg_over_openface_cosine'] = df['vgg_euclidean_l2'] / df['openface_euclidean_l2'] df['facenet_over_openface_cosine'] = df['facenet_cosine'] / df['openface_cosine'] df['facenet_over_openface_euclidean'] = df['facenet_euclidean_l2'] / df['openface_euclidean_l2']
I see the dramatic contribution of ratio features. Including age and embedding ratios increased the model accuracy from 72% to 77%.
Additional features
Face recognition models will be the strongest link in this approach. We can still enrich the train set. We’ve mentioned age and gender prediction in this blog before. Related ones might have a small distance if their age are close. Besides, Related ones might have a small distance if they are same gender. Each line stores 2 face photos. Herein, the order is not important. Putting age and gender predictions as a custom column will cause overfitting. That’s why, I’ll feed the relation of these predictions. I mean that I will feed age difference and age ratio of two person in each line. Similarly, I will feed that they are same gender.
#age difference df['age_diff'] = (df['p1_age'] - df['p2_age']).abs() #age ratio df['age_ratio'] = df['p1_age'] /df['p2_age'] df[df['age_ratio'] < 1]['age_ratio'] = df['p2_age'] / df['p1_age'] #they are different gender df['different_gender'] = (df['p1_gender'] - df['p2_gender']).abs()
Besides, we’ve focused on facial expression recognition. It predicts the distribution of 7 different facial expressions (angry, disgust, fear, happy, sad, surprise, neutral). I will check the equality of the most dominant facial expression of two person.
df['same_emotion'] = 0 df.loc[df[df['p1_dominant_emotion'] == df['p2_dominant_emotion']].index, 'same_emotion'] = 1
Final form of data set
I spent days to expand all of those features in the raw data set. You can skip expanding steps. You can download the expanded data set seconds here.
Train test split
Data set consists of 448K instances and 31 features except image pair names and is related identifier. I will split the data set into train set, validation set and cross validation set. Validation set will perform for early stopping to avoid overfitting. Cross validation set guarantees that model would not overfit over validation set. Distribution will be 70% for train set, 15% for validation set and 15% for cross validation set.
x = df.drop(columns=['is_related']) y = df['is_related'] x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.30) x_test, x_cross_val, y_test, y_cross_val = train_test_split(x_test, y_test, test_size=0.50)
Model
I will feed two distance metrics of three different face recognition models – 6 features – besides additional features to a GBM model. I expect GBM to build the ensemble model. In this way, GBM can find the best model, metric and threshold for each case. Herein, LightGBM was my choice because of its speed. Because kagglers need to model a problem again and again.
train_data = lgb.Dataset(x_train, label=y_train) test_data = lgb.Dataset(x_test, label=y_test) params = { 'boosting_type': 'gbdt', 'objective': 'multiclass', 'num_class': 2, 'metric': 'multi_logloss', 'learning_rate': 0.1, 'num_leaves': 64, 'verbose': 2 } model = lgb.train(params , train_data , valid_sets=test_data, early_stopping_rounds=50 , num_boost_round=500 )
Validation loss decreased from 0.641635 to 0.555935. Model building is stopped early in 332th step even though number of boosting rounds to 500. The best loss was in 282th step.
Train set accuracy was 72.97%, validation set accuracy was 70.85% and cross validation set accuracy was 70.74%. They all seem consistent.
Monitoring feature importance is important to trust the model not to overfit.
ax = lgb.plot_importance(model, max_num_features=10) plt.show()
As I guessed, ratio based features stand atop the podium.
Submission set
We expanded same features to submission set.
predictions = model.predict(submission_x) prediction_classes = [] for i in predictions: #prediction_classes.append(np.argmax(i)) #exact class of prediction is_related = i[1] #probability of is_related prediction_classes.append(is_related) test_df['is_related'] = prediction_classes result_set = test_df[['img_pair', 'is_related']] result_set.to_csv("submission.csv", index=False)
Conclusion
So, this approach got 72.97% accuracy on training set, 70.85% accuracy on validation set, 70.74% accuracy on cross validation set. Besides, it got 77.30% on public submission set 77.50% accuracy on private submission set. We can clearly say that it is a robust model.
Even though there are more successful submissions and kernels exist, this novel approach covers several face recognition models, distances, ratios and some additional features. Besides, it consumes GBM to build ensemble model.
Initially, I just feed the cosine and euclidean distances for 3 face recognition models as features and I got 60.40% accuracy. I can increase the accuracy from 60.40% to 77.30% by applying the road map mentioned in this post. Almost 17% improvement is really amazing for a data science challenge. That is all based on hacking skills.
The source code and expanded data set of this approach is pushed to the Kaggle this time. You can follow my Kaggle contributions here. There are many ways to support a project – starring the kernel is just one. I already appreciated your support.
Support this blog if you do like!