A Gentle Introduction to Face Recognition in Deep Learning

Modern face recognition pipelines consist of 4 common stages. These are detection, alignment, representation and verification. These might be confusing for beginners. In this post, we take a step back and mention a face recognition pipeline conceptually. You should follow the links to dive these concepts deep.

black-mirror-nosedive
Nosedine in Black Mirror

Vlog

The following video covers a hands-on face recognition workshop from scratch in python.


🙋‍♂️ You may consider to enroll my top-rated machine learning course on Udemy

Decision Trees for Machine Learning

DeepFace

We will use deepface framework for python in this post. You can install it by calling the following command.

!pip install deepface

Stage 1 and 2: Detection and Alignment

There are several face detection solutions. OpenCV offers haar cascade and Single Shot Multibox Detector (SSD). Dlib offers Histogram of Oriented Gradients (HOG) and a CNN based Max-Margin Object Detection (MMOD) and finally Multi-task Cascaded Convolutional Networks (MTCNN) is a common solution for face detection.

Here, you can watch how to use different face detectors in Python.

Alignment is easy if face and eyes detected already. Experiments show that applying face alignment increases the accuracy of model more than 1%. Unfortunately, neither opencv nor dlib offer face alignment as an out-of-box function. We have to do some trigonometry here to align faces.

rotate-from-scratch
Detect and align

You can find out the math behind face alignment in the following video.

Here, retinaface is the cutting-edge face detection technology. It can even detect faces in the crowd and it finds facial landmarks including eye coordinates. That’s why, its alignment score is very high.

Herein, deepface offers both face detection and face alignment as a function. It wraps OpenCV’s haar cascade, SSD, dlib HoG, MTCNN and retinaface. It also does some math and trigonometry to align faces. You just need to pass the path of the image. If you wouldn’t mention detector_backend argument, then it will use its default configuration OpenCV’s haar cascade.

import numpy as np
from deepface import DeepFace
from deepface.commons import functions

model_name = "VGG-Face"

target_size = functions.functions(model_name = model_name)

img1 = DeepFace.extract_faces(img_path = "img1.jpg", target_size=target_size, detector_backend ="mtcnn")
img2 = DeepFace.extract_faces(img_path = "img2.jpg", target_size=target_size, detector_backend = "mtcnn")

img1 = np.expand_dims(img1, axis=0) #(224, 224, 3) to (1, 224, 224, 3)
img2 = np.expand_dims(img2, axis=0) #(224, 224, 3) to (1, 224, 224, 3)

Stage 2.5 Normalization

Face detectors detect faces in a squared area. So, detected faces come with some noise such as background color. Here, dlib can find 68 facial landmarks. We can extract exact face and get rid of any noise in this way. This optional step is called as normalization in facial recognition.





In addition, MediaPipe can find 468 landmarks. Please see its real time implementation in the following video. Recommended tutorials: Deep Face Detection with MediaPipe, Zoom Style Virtual Background Setup with MediaPipe.

Stage 3: Representation

Deep learning just appears in this representation stage. We will feed face images to a convolutional neural networks model but the task is here is not classification. We will use CNN models to find embeddings similar to autoencoders.

vgg-face-architecture
VGG-Face model

The most popular face recognition models are VGG-Face, Google FaceNet, OpenFace and Facebook DeepFace. Luckily, these models are all provided by deepface framework for python as well. You can build these models as illustrated below.

model_name = "VGG-Face"
model = DeepFace.build_model(model_name = model_name)

These models have different input and output shapes. For example, VGG-Face expects (224, 224, 3) shaped inputs and returns 2622 dimensional vector as output. On the other hand, Google FaceNet expexts (160, 160, 3) shaped inputs and return 128 dimensional array. Notice that we have to pass the input shape to detectFace function in the detect and align stage. We can get the input shape expected by the built model as shown below. So, you must put detectFace command after input shape retrieved.

from deepface.commons import functions
model_name = "VGG-Face"
target_size = functions.functions(model_name = model_name)

Question: how those models trained?

These face recognition models were previously built to classify identities of face images on a large scale data set. Consider a data set containing 1M images of 1000 unique person. Output layer of the CNN model would be 1000 in this case and the model is trained to find identities of fed images. When training is over, then the output layer is dropped and the early layer of the output layer will be the new output layer. Now, the new model will not classify identities but return representation of faces. We can now feed new images that does not appear in the training data set. The model still finds representations.

Dashed lines in the final layer mean exactly this in the Facebook DeepFace architecture.

deepface-model
Facebook DeepFace architecture

These concept is called as Siamese Networks in the literature.

Representations

We’ve detected and aligned face images and fed them to a face recognition model in the previous steps. Now, we have vector representations for each image. This is a abstract concept. To make this concrete, I will visualize it.

I will transform 1D vectors to 2D matrices by appending vector itself. In this way, each line of the matrix will have same information.

img1_representation = model.predict(img1)[0].tolist()
img2_representation = model.predict(img2)[0].tolist()


img1_graph = []; img2_graph = []

for i in range(0, 200):
   img1_graph.append(img1_representation)
   img2_graph.append(img2_representation)

img1_graph = np.array(img1_graph)
img2_graph = np.array(img2_graph)

This is similar to legacy barcodes. They just store data horizontally. If you damage the barcode horizontally, you can still read data of it. However, vertical damages cause data loss as well.





barcode-sample
Barcode

To visualize the presentations, the following code block will help us.

fig = plt.figure()

ax1 = fig.add_subplot(3,2,1)
plt.imshow(img1[0][:,:,::-1])
plt.axis('off')

ax2 = fig.add_subplot(3,2,2)
im = plt.imshow(img1_graph, interpolation='nearest', cmap=plt.cm.ocean)
plt.colorbar()

ax3 = fig.add_subplot(3,2,3)
plt.imshow(img2[0][:,:,::-1])
plt.axis('off')

ax4 = fig.add_subplot(3,2,4)
im = plt.imshow(img2_graph, interpolation='nearest', cmap=plt.cm.ocean)
plt.colorbar()

plt.show()

VGG-Face representation has 2622 slots horizontally. Each slot is represented with different color and color meaning explained in the colorbar on the right.

vggface-representation
VGG-Face representation

If we set Google FaceNet to face recognition model, then representation will be in different shape and content. It would have 128 dimensions.

facenet-representation
Google FaceNet representation

So, we will decide these two images are same person or not based on those vector representations instead of face images themselves.

Question: which single face recognition model is the best

We could use VGG-Face, FaceNet, OpenFace or DeepFace to find representations of face. They are all state-of-the-art face recognition models. Some are designed by tech giants such as Google and Facebook whereas some are designed by the top universities in the world such as University of Oxford or Carnegie Mellon University. So, which single model performs better than others? Let’s have a short discussion about this topic.

Stage 3: Verification

We will compare vector representations of images. The easiest way to compare two vectors is to find the euclidean distance between them. We all actually remember it from Pythagorean theorem in high school days. However that was 2 dimensional equation. Here we have n-dimensional vector as a representation.

euclidean-distance-dataaspirant
Euclidean distance from dataaspirant

To adapt Pythagorean theorem into n-dimensional space, we will find the square of difference of each slot values in our representations. This new vector represent distance vector. So, squared root of sum of each slot will be the distance.

distance_vector = np.square(img1_representation - img2_representation)
distance = np.sqrt(distance_vector.sum())

We can visualize the distance vector as well.

distance_graph = []
for i in range(0, 200):
distance_graph.append(distance_vector)
distance_graph = np.array(distance_graph)

ax6 = fig.add_subplot(3,2,6)
im = plt.imshow(distance_graph, interpolation='nearest', cmap=plt.cm.ocean)
plt.colorbar()

Distance vector appears in the 3rd line. As seen, its slots are mostly green colored. Notice that green color represent values close to 0.

face-recognition-true-positive
True positive example

Let’s look at a pair for different ones.





face-recognition-false-positive
False positive example

Decision

We know that distance would be 0 if we feed same images. Because representation will be same and difference of each slot will be 0 as well.

Besides, we see that distance value is smaller when we feed images of a same person. It will increase when we feed images of different ones. Here, we will check the distance value is smaller than a threshold value.

Threshold

However, what is the threshold value to determine distance is enough to classify a pair as same person?

This is a very deep topic as well. Here, you can find a deeply explained post about determination of the threshold in a face recognition pipeline. Besides, the following vlog covers how to fine tune the threshold value in a face recognition pipeline.

To sum up, euclidean distanve value for VGG-Face model should be shown below.

if distance <= 0.55:
return True
else:
return False

Threshold should be different for different face recognition models. My experiments show that thresholds should be tuned as demonstrated below.

def findThreshold(model_name):
threshold = 0
if model_name == 'VGG-Face':
threshold = 0.55
elif model_name == 'OpenFace':
threshold = 0.55
elif model_name == 'Facenet':
threshold = 10
elif model_name == 'DeepFace':
threshold = 64
return threshold

BTW, we can find cosine similarity value to compare vectors as well. You can see the threshold values for cosine values here.

Testing

I’ve applied Facebook DeepFace model in real time in the following video. Results are satisfactory for both accuracy and speed, aren’t they?

Namesakes

As seen, face recognition is mainly based on comparing two images. We do not train a CNN model with multiple photos of identities. We just feed an image. That’s why, this concept is also called as one shot learning in the literature. Besides, some sources mention this technology as face verification instead of face recognition. It comes from verifying faces obviously.

Deepface itself

DeepFace handles all pipeline stages mentioned in this post in the background as well. You can apply face recognition tests with a few lines of code. We’ve just focused on pipeline stages to understand a face recognition system.





Large scale face recognition

In this post, we’ve mentioned actually how to apply face verification. Face verification has O(1) complexity in big O notation. Face recognition requires to find a face in a data set. This becomes O(n) complexity in big O notation where n is the number of instances in your data set.

We can find a hacking method to speed large scale face recognition up dramatically.

Notice that face recognition has O(n) time complexity and this might be problematic for millions or billions level data. Herein, approximate nearest neighbor (a-nn) algorithm reduces time complexity dramatically. Spotify Annoy, Facebook Faiss and NMSLIB are amazing a-nn libraries. Besides, Elasticsearch wraps an NMSLIB and it comes with highly scalability. You should run deepface within those a-nn libraries if you have really large scale data base.

Real time face recognition

Besides, we can run face recognition tasks in real time as well.

Ensemble method

We’ve mentioned just a single face recognition model. On the other hand, there are several state-of-the-art models: VGG-Face, Google FaceNet, OpenFace, Facebook DeepFace and DeepID. Even though all of those models perform well, there is no absolute better model. Still, we can apply an ensemble method to build a grandmaster model. In this approach, we will feed the predictions of those models to a boosting model. Accuracy metrics including precision, recall and f1 score increase dramatically in ensemble method whereas running time lasts longer.

Tech Stack Recommendations

Face recognition is mainly based on representing facial images as vectors. Herein, storing the vector representations is a key factor for building robust facial recognition systems. I summarize the tech stack recommendations in the following video.

Conclusion

So, we have mentioned how face recognition works and common stages of a common face recognition pipeline. We have used pre-built models provided by deepface framework. I strongly recommend you to follow links to understand concepts well.

I pushed the source code of this blog post to GitHub. You can support this study by starring the GitHub repo as well.


Like this blog? Support me on Patreon

Buy me a coffee


5 Comments

  1. Hi Sir,
    I was interested to use the deepface project for one of my research project. I found that the codes are quite outdated and I was curious to ask if you have upgraded the codes according to the newest tensorflow version (2.3.x) or like that.

    1. I run deepface for tf 2.2.0. What troubles you had when you running deepface?

  2. Hi Sefik,
    Great stuff! Thanks for sharing.
    There seems to be an problem with all the embedded videos since I always get an error when trying to play them.

  3. I’m trying to follow along with the first video and noticed that

    input_shape = model.layers[0].input_shape[1:] returns an empty listed, whereas

    input_shape = model.layers[0].input_shape[0][1:] returns a tuple

    Hopefully I won’t have too much trouble following along with the rest of the video, but I was able to replicate the output by adding that 0 index

    1. Most probably your tf version behaves different than my tf version. Thank you for the comment!

Comments are closed.