There are lots of face recognition models in the field. Some are designed by the top universities in the world and some are developed by tech giants. You might ask which one is the best. Adopting the right model might be confusing because there is no absolute better model. In this post, we are going to apply an ensemble learning for those popular face recognition models.
Face recognition models
The most popular face recognition models are VGG-Face, Google FaceNet, OpenFace and Facebook DeepFace. Luckily, deepface for python supports all of those face recognition models.
🙋♂️ You may consider to enroll my top-rated machine learning course on Udemy
Remember how face recognition works.
Data set
This data set is run in the unit tests of deepface framework. The following dictionary stores the identities.
idendities = { "Angelina": ["img1.jpg", "img2.jpg", "img4.jpg", "img5.jpg", "img6.jpg", "img7.jpg", "img10.jpg", "img11.jpg"], "Scarlett": ["img8.jpg", "img9.jpg", "img47.jpg", "img48.jpg", "img49.jpg", "img50.jpg", "img51.jpg"], "Jennifer": ["img3.jpg", "img12.jpg", "img53.jpg", "img54.jpg", "img55.jpg", "img56.jpg"], "Mark": ["img13.jpg", "img14.jpg", "img15.jpg", "img57.jpg", "img58.jpg"], "Jack": ["img16.jpg", "img17.jpg", "img59.jpg", "img61.jpg", "img62.jpg"], "Elon": ["img18.jpg", "img19.jpg", "img67.jpg"], "Jeff": ["img20.jpg", "img21.jpg"], "Marissa": ["img22.jpg", "img23.jpg"], "Sundar": ["img24.jpg", "img25.jpg"], "Katy": ["img26.jpg", "img27.jpg", "img28.jpg", "img42.jpg", "img43.jpg", "img44.jpg", "img45.jpg", "img46.jpg"], "Matt": ["img29.jpg", "img30.jpg", "img31.jpg", "img32.jpg", "img33.jpg"], "Leonardo": ["img34.jpg", "img35.jpg", "img36.jpg", "img37.jpg"], "George": ["img38.jpg", "img39.jpg", "img40.jpg", "img41.jpg"] }
Positive pairs
We can generate 140 image pairs belonging to same identity.
positives = [] for key, values in idendities.items(): for i in range(0, len(values)-1): for j in range(i+1, len(values)): positive = [] positive.append(values[i]) positive.append(values[j]) positives.append(positive) positives = pd.DataFrame(positives, columns = ["file_x", "file_y"]) positives["decision"] = "Yes"
Negative pairs
We can also generate pairs belonging to different identities. The number of negative instances will be more but we should limit it with the number of instances in the positive samples to create a homogeneous data set.
samples_list = list(idendities.values()) negatives = [] for i in range(0, len(idendities) - 1): for j in range(i+1, len(idendities)): cross_product = itertools.product(samples_list[i], samples_list[j]) cross_product = list(cross_product) for cross_sample in cross_product: negative = [] negative.append(cross_sample[0]) negative.append(cross_sample[1]) negatives.append(negative) negatives = pd.DataFrame(negatives, columns = ["file_x", "file_y"]) negatives["decision"] = "No" negatives = negatives.sample(positives.shape[0])
Merging positive and negative pairs
We will create a single pandas data frame including positive and negative pairs.
df = pd.concat([positives, negatives]).reset_index(drop = True)
Finding representations
We can pass a bulk list of image pairs to verification function of deepface. The function expects face recognition model name and similarity metric as arguments. Herein, I would like to find the similarity score of each pair for all model and similarity metric candidates.
instances = df[["file_x", "file_y"]].values.tolist() models = ['VGG-Face', 'Facenet', 'OpenFace', 'DeepFace'] metrics = ['cosine', 'euclidean', 'euclidean_l2'] for model in models: for metric in metrics: resp_obj = DeepFace.verify(instances, model_name = model, distance_metric = metric) distances = [] for i in range(0, len(instances)): distance = resp_obj["pair_%s" % (i+1)]["distance"] distances.append(distance) df['%s_%s' % (model, metric)] = distances
This block is completed in 1.5 hours in my computer. I stored the final form of the data frame as a csv file in the deepface repo. You can directly load this data frame to skip finding representations stage.
Distributions
Each face recognition model and similarity metric candidate builds a binary classification problem. Monitoring the distributions for both positive and negative classes would be interesting.
fig = plt.figure(figsize=(15, 15)) figure_idx = 1 for model in models: for metric in metrics: feature = '%s_%s' % (model, metric) ax1 = fig.add_subplot(4, 3, figure_idx) df[df.decision == "Yes"][feature].plot(kind='kde', title = feature, label = 'Yes', legend = True) df[df.decision == "No"][feature].plot(kind='kde', title = feature, label = 'No', legend = True) figure_idx = figure_idx + 1 plt.show()
We’ve discussed which face recognition model and distance metric pair is the best in this video.
We can clearly understand which model is better than others when we look at the distribution graphs.
It seems that FaceNet is the most successful model and VGG-Face comes after it.
Ensemble learning
Even though OpenFace and DeepFace seems to offer a lower accuracy than FaceNet and VGG-Face, they might offer better predictions for some pairs in some specific cases. The idea behind ensemble learning is to find that which model is better for which features.
I am going to build a LightGBM model. The diagram of the ensemble method is illustrated below.
Pre-processing
Image pair paths will not mean anything anymore. That’s why, I will drop those columns. Moreover, LightGBM expects to transform nominal columns to numerical. The target decision columns consists of yes and no classes. I will transform those classes to numeric values as well.
df = df.drop(columns=["file_x", "file_y"]) df.loc[df[df.decision == 'Yes'].index, 'decision'] = 1 df.loc[df[df.decision == 'No'].index, 'decision'] = 0
Train test split
We have 140 positive and 140 negative pairs in the data set we have prepared. I will separate the data set half and half. 1st half is going to be used for training whereas 2nd half is going to be used for testing.
df_train, df_test = train_test_split(df, test_size=0.50, random_state=34) target_name = "decision" y_train = df_train[target_name].values x_train = df_train.drop(columns=[target_name]).values y_test = df_test[target_name].values x_test = df_test.drop(columns=[target_name]).values
LightGBM
We’ve spitted the data set in pandas data frame format. Let’s convert it to lightgbm data set format.
import lightgbm as lgb features = df.drop(columns=["decision"]).columns.tolist() lgb_train = lgb.Dataset(x_train, y_train, feature_name = features) lgb_test = lgb.Dataset(x_test, y_test, feature_name = features)
Training
Now, it is time to train the model. We are going to build 250 boosted trees and terminate building if validation loss score would not improve for 15 rounds.
params = { 'task': 'train' , 'boosting_type': 'gbdt' , 'objective': 'multiclass' , 'num_class': 2 , 'metric': 'multi_logloss' } gbm = lgb.train(params, lgb_train , num_boost_round=250, early_stopping_rounds = 15 , valid_sets=[lgb_test])
My experiments terminated in 56th epoch. Loss seems to be decreased satisfactory.
Evaluation
This is a binary classification problem and built model predict two classes covering no and yes. If the first item is greater then prediction is no, or second item is greater then prediction is yes. We can store the greater index prediction as illustrated below.
predictions = gbm.predict(x_test) prediction_classes = [] for prediction in predictions: prediction_class = np.argmax(prediction) prediction_classes.append(prediction_class)
Calculating accuracy might mislead us. That’s why, we will calculate some additional metrics such as precision, recall and f1 score.
from sklearn.metrics import confusion_matrix,accuracy_score, roc_curve, auc cm = confusion_matrix(y_test, prediction_classes) tn, fp, fn, tp = cm.ravel() recall = tp / (tp + fn) precision = tp / (tp + fp) accuracy = (tp + tn)/(tn + fp + fn + tp) f1 = 2 * (precision * recall) / (precision + recall)
The built model has the following accuracy metrics:
Precision: 98.61%
Recall: 98.61%
F1 score 98.61%
Accuracy: 98.57%
All of accuracy, precision, recall and f1 score metrics are stable. This shows the robustness of the model. Facebook researchers declared that human beings have 97.53% accuracy. It seems that this ensemble method passed the human level performace!
BTW, Facebook researchers tested DeepFace model directly on LabeledFaces in the Wild (LFW) data set but I’ve tested this ensemble method on a different and smaller set. That’s why, comparing those two studies would be subjective. Still, the accuracy level we got is very satisfactory.
Interpretability
Remember that single OpenFace and DeepFace models offer a lower accuracy than FaceNet and VGG-Face. Herein, OpenFace is on the top and DeepFace is in the most significant 3 feature as well. This explains how this ensemble method passed human level performance.
ROC Curve
Plotting the roc curve is a common method to understand how accurate built models are.
y_pred_proba = predictions[::,1] from sklearn import metrics fpr, tpr, _ = metrics.roc_curve(y_test, y_pred_proba) auc = metrics.roc_auc_score(y_test, y_pred_proba) plt.plot(fpr,tpr,label="data 1, auc="+str(auc))
ROC curve seems to be 90 degree rotated L letter. This explains the perfectness of a model.
Visualizing built trees
As I mentioned, training is terminated in 56th epoch. There are two classes in the target column of my data set. That’s why, there are 2×56 = 112 boosted trees. We can visualize built trees as shown below.
for i in range(0, gbm.num_trees()): ax = lgb.plot_tree(gbm, tree_index = i) plt.show()
Early trees are similar to the following illustration.
Transfer learning
I stored the build tree in the deepface repo as well to make it reusable. You can load the built model and make predictions as illustrated below.
deepface_ensemble = lgb.Booster(model_file= 'face-recognition-ensemble-model.txt') bulk_predictions = deepface_ensemble.predict(x_test) idx = 0 single_prediction = deepface_ensemble.predict(np.expand_dims(df.iloc[idx].values[0:-1], axis=0).shape)
Python package and out-of-the-box function
Even though we have used deepface framework for python to find representations and distances for custom models, ensemble method is adapted into deepface as an out-of-the-box function as well. Currently, verification and finding functions support ensemble learning. All you need to pass ensemble value to model name arguments in both verify and find functions.
This offers a huge improvement on accuracy, precision and recall but this runs much slower than single models.
Conclusion
So, we’ve mentioned how to apply ensemble learning for face recognition. It seems that the ensemble method shows a higher accuracy than human beings. This might be a revolution for AI age!
Finally, I pushed the source code of this study as a notebook to GitHub.
Support this blog if you do like!
your videos are amazing! thanks for the great explanation.
what do you recommend to use for evaluating face recognition models with unlabelled identities?