A Gentle Introduction to Face Recognition in Deep Learning

Modern face recognition pipelines consist of 4 common stages. These are detection, alignment, representation and verification. These might be confusing for beginners. In this post, we take a step back and mention a face recognition pipeline conceptually. You should follow the links to dive these concepts deep.

black-mirror-nosedive — Nosedine in Black Mirror

Vlog

The following video covers a hands-on face recognition workshop from scratch in python.

🙋‍♂️ You may consider to enroll my top-rated machine learning course on Udemy

DeepFace

We will use deepface framework for python in this post. You can install it by calling the following command.

!pip install deepface

Stage 1 and 2: Detection and Alignment

There are several face detection solutions. OpenCV offers haar cascade and Single Shot Multibox Detector (SSD). Dlib offers Histogram of Oriented Gradients (HOG) and a CNN based Max-Margin Object Detection (MMOD) and finally Multi-task Cascaded Convolutional Networks (MTCNN) is a common solution for face detection.

Here, you can watch how to use different face detectors in Python.

Alignment is easy if face and eyes detected already. Experiments show that applying face alignment increases the accuracy of model more than 1%. Unfortunately, neither opencv nor dlib offer face alignment as an out-of-box function. We have to do some trigonometry here to align faces.

You can find out the math behind face alignment in the following video.

Here, retinaface is the cutting-edge face detection technology. It can even detect faces in the crowd and it finds facial landmarks including eye coordinates. That’s why, its alignment score is very high.

Herein, deepface offers both face detection and face alignment as a function. It wraps OpenCV’s haar cascade, SSD, dlib HoG, MTCNN and retinaface. It also does some math and trigonometry to align faces. You just need to pass the path of the image. If you wouldn’t mention detector_backend argument, then it will use its default configuration OpenCV’s haar cascade.

import numpy as np
from deepface import DeepFace
from deepface.commons import functions

model_name = &amp;amp;amp;quot;VGG-Face&amp;amp;amp;quot;

target_size = functions.functions(model_name = model_name)

img1 = DeepFace.extract_faces(img_path = &amp;amp;amp;quot;img1.jpg&amp;amp;amp;quot;, target_size=target_size, detector_backend =&amp;amp;amp;quot;mtcnn&amp;amp;amp;quot;)
img2 = DeepFace.extract_faces(img_path = &amp;amp;amp;quot;img2.jpg&amp;amp;amp;quot;, target_size=target_size, detector_backend = &amp;amp;amp;quot;mtcnn&amp;amp;amp;quot;)

img1 = np.expand_dims(img1, axis=0) #(224, 224, 3) to (1, 224, 224, 3)
img2 = np.expand_dims(img2, axis=0) #(224, 224, 3) to (1, 224, 224, 3)

Stage 2.5 Normalization

Face detectors detect faces in a squared area. So, detected faces come with some noise such as background color. Here, dlib can find 68 facial landmarks. We can extract exact face and get rid of any noise in this way. This optional step is called as normalization in facial recognition.

In addition, MediaPipe can find 468 landmarks. Please see its real time implementation in the following video. Recommended tutorials: Deep Face Detection with MediaPipe, Zoom Style Virtual Background Setup with MediaPipe.

Stage 3: Representation

Deep learning just appears in this representation stage. We will feed face images to a convolutional neural networks model but the task is here is not classification. We will use CNN models to find embeddings similar to autoencoders.

The most popular face recognition models are VGG-Face, Google FaceNet, OpenFace and Facebook DeepFace. Luckily, these models are all provided by deepface framework for python as well. You can build these models as illustrated below.

model_name = &amp;amp;amp;quot;VGG-Face&amp;amp;amp;quot;
model = DeepFace.build_model(model_name = model_name)

These models have different input and output shapes. For example, VGG-Face expects (224, 224, 3) shaped inputs and returns 2622 dimensional vector as output. On the other hand, Google FaceNet expexts (160, 160, 3) shaped inputs and return 128 dimensional array. Notice that we have to pass the input shape to detectFace function in the detect and align stage. We can get the input shape expected by the built model as shown below. So, you must put detectFace command after input shape retrieved.

from deepface.commons import functions
model_name = &amp;amp;amp;quot;VGG-Face&amp;amp;amp;quot;
target_size = functions.functions(model_name = model_name)

Question: how those models trained?

These face recognition models were previously built to classify identities of face images on a large scale data set. Consider a data set containing 1M images of 1000 unique person. Output layer of the CNN model would be 1000 in this case and the model is trained to find identities of fed images. When training is over, then the output layer is dropped and the early layer of the output layer will be the new output layer. Now, the new model will not classify identities but return representation of faces. We can now feed new images that does not appear in the training data set. The model still finds representations.

Dashed lines in the final layer mean exactly this in the Facebook DeepFace architecture.

deepface-model — Facebook DeepFace architecture

These concept is called as Siamese Networks in the literature.

Representations

We’ve detected and aligned face images and fed them to a face recognition model in the previous steps. Now, we have vector representations for each image. This is a abstract concept. To make this concrete, I will visualize it.

I will transform 1D vectors to 2D matrices by appending vector itself. In this way, each line of the matrix will have same information.

img1_representation = model.predict(img1)[0].tolist()
img2_representation = model.predict(img2)[0].tolist()


img1_graph = []; img2_graph = []

for i in range(0, 200):
   img1_graph.append(img1_representation)
   img2_graph.append(img2_representation)

img1_graph = np.array(img1_graph)
img2_graph = np.array(img2_graph)

This is similar to legacy barcodes. They just store data horizontally. If you damage the barcode horizontally, you can still read data of it. However, vertical damages cause data loss as well.

To visualize the presentations, the following code block will help us.

fig = plt.figure()

ax1 = fig.add_subplot(3,2,1)
plt.imshow(img1[0][:,:,::-1])
plt.axis(&amp;amp;#039;off&amp;amp;#039;)

ax2 = fig.add_subplot(3,2,2)
im = plt.imshow(img1_graph, interpolation=&amp;amp;#039;nearest&amp;amp;#039;, cmap=plt.cm.ocean)
plt.colorbar()

ax3 = fig.add_subplot(3,2,3)
plt.imshow(img2[0][:,:,::-1])
plt.axis(&amp;amp;#039;off&amp;amp;#039;)

ax4 = fig.add_subplot(3,2,4)
im = plt.imshow(img2_graph, interpolation=&amp;amp;#039;nearest&amp;amp;#039;, cmap=plt.cm.ocean)
plt.colorbar()

plt.show()

VGG-Face representation has 2622 slots horizontally. Each slot is represented with different color and color meaning explained in the colorbar on the right.

If we set Google FaceNet to face recognition model, then representation will be in different shape and content. It would have 128 dimensions.

facenet-representation — Google FaceNet representation

So, we will decide these two images are same person or not based on those vector representations instead of face images themselves.

Question: which single face recognition model is the best

We could use VGG-Face, FaceNet, OpenFace or DeepFace to find representations of face. They are all state-of-the-art face recognition models. Some are designed by tech giants such as Google and Facebook whereas some are designed by the top universities in the world such as University of Oxford or Carnegie Mellon University. So, which single model performs better than others? Let’s have a short discussion about this topic.

Stage 3: Verification

We will compare vector representations of images. The easiest way to compare two vectors is to find the euclidean distance between them. We all actually remember it from Pythagorean theorem in high school days. However that was 2 dimensional equation. Here we have n-dimensional vector as a representation.

euclidean-distance-dataaspirant — Euclidean distance from dataaspirant

To adapt Pythagorean theorem into n-dimensional space, we will find the square of difference of each slot values in our representations. This new vector represent distance vector. So, squared root of sum of each slot will be the distance.

distance_vector = np.square(img1_representation - img2_representation)
distance = np.sqrt(distance_vector.sum())

We can visualize the distance vector as well.

distance_graph = []
for i in range(0, 200):
distance_graph.append(distance_vector)
distance_graph = np.array(distance_graph)

ax6 = fig.add_subplot(3,2,6)
im = plt.imshow(distance_graph, interpolation=&amp;amp;#039;nearest&amp;amp;#039;, cmap=plt.cm.ocean)
plt.colorbar()

Distance vector appears in the 3rd line. As seen, its slots are mostly green colored. Notice that green color represent values close to 0.

face-recognition-true-positive — True positive example

Let’s look at a pair for different ones.

face-recognition-false-positive — False positive example

Decision

We know that distance would be 0 if we feed same images. Because representation will be same and difference of each slot will be 0 as well.

Besides, we see that distance value is smaller when we feed images of a same person. It will increase when we feed images of different ones. Here, we will check the distance value is smaller than a threshold value.

Threshold

However, what is the threshold value to determine distance is enough to classify a pair as same person?

This is a very deep topic as well. Here, you can find a deeply explained post about determination of the threshold in a face recognition pipeline. Besides, the following vlog covers how to fine tune the threshold value in a face recognition pipeline.

To sum up, euclidean distanve value for VGG-Face model should be shown below.

if distance <= 0.55:
return True
else:
return False

Threshold should be different for different face recognition models. My experiments show that thresholds should be tuned as demonstrated below.

def findThreshold(model_name):
threshold = 0
if model_name == &amp;amp;#039;VGG-Face&amp;amp;#039;:
threshold = 0.55
elif model_name == &amp;amp;#039;OpenFace&amp;amp;#039;:
threshold = 0.55
elif model_name == &amp;amp;#039;Facenet&amp;amp;#039;:
threshold = 10
elif model_name == &amp;amp;#039;DeepFace&amp;amp;#039;:
threshold = 64
return threshold

BTW, we can find cosine similarity value to compare vectors as well. You can see the threshold values for cosine values here.

Testing

I’ve applied Facebook DeepFace model in real time in the following video. Results are satisfactory for both accuracy and speed, aren’t they?

Namesakes

As seen, face recognition is mainly based on comparing two images. We do not train a CNN model with multiple photos of identities. We just feed an image. That’s why, this concept is also called as one shot learning in the literature. Besides, some sources mention this technology as face verification instead of face recognition. It comes from verifying faces obviously.

Deepface itself

DeepFace handles all pipeline stages mentioned in this post in the background as well. You can apply face recognition tests with a few lines of code. We’ve just focused on pipeline stages to understand a face recognition system.

Large scale face recognition

In this post, we’ve mentioned actually how to apply face verification. Face verification has O(1) complexity in big O notation. Face recognition requires to find a face in a data set. This becomes O(n) complexity in big O notation where n is the number of instances in your data set.

We can find a hacking method to speed large scale face recognition up dramatically.

Approximate Nearest Neighbor

As explained in this tutorial, facial recognition models are being used to verify a face pair is same person or different persons. This is actually face verification instead of face recognition. Because face recognition requires to perform face verification many times. Now, suppose that you need to find an identity in a billion-scale database e.g. citizen database of a country and a citizen may have many images. This problem has O(n x logn) time complexity where n is the number of entries of your database.

On the other hand, approximate nearest neighbor algorithm reduces time complexity dramatically to O(logn)! Vector indexes such as Annoy, Voyager, Faiss; and vector databases such as Postgres with pgvector and RediSearch are running this algorithm to find a similar vector of a given vector even in billions of entries just in milliseconds.

So, if you have a robust facial recognition model then it is not a big deal to run it in billions!

Real time face recognition

Besides, we can run face recognition tasks in real time as well.

Meanwhile, you can run face verification tasks directly in your browser with its custom ui built with ReactJS.

Anti-Spoofing and Liveness Detection

What if DeepFace is given fake or spoofed images? This becomes a serious issue if it is used in a security system. To address this, DeepFace includes an anti-spoofing feature for face verification or liveness detection.

DeepFace API

DeepFace offers a web service for face verification, facial attribute analysis and vector embedding generation through its API. You can watch a tutorial on using the DeepFace API here:

Additionally, DeepFace can be run with Docker to access its API. Learn how in this video:

Ensemble method

We’ve mentioned just a single face recognition model. On the other hand, there are several state-of-the-art models: VGG-Face, Google FaceNet, OpenFace, Facebook DeepFace and DeepID. Even though all of those models perform well, there is no absolute better model. Still, we can apply an ensemble method to build a grandmaster model. In this approach, we will feed the predictions of those models to a boosting model. Accuracy metrics including precision, recall and f1 score increase dramatically in ensemble method whereas running time lasts longer.

Tech Stack Recommendations

Face recognition is mainly based on representing facial images as vectors. Herein, storing the vector representations is a key factor for building robust facial recognition systems. I summarize the tech stack recommendations in the following video.

Conclusion

So, we have mentioned how face recognition works and common stages of a common face recognition pipeline. We have used pre-built models provided by deepface framework. I strongly recommend you to follow links to understand concepts well.

I pushed the source code of this blog post to GitHub. You can support this study by starring the GitHub repo as well.

Support this blog financially if you do like!

5 Comments

Dominic says:

April 15, 2023 at 6:03 pm

I’m trying to follow along with the first video and noticed that

input_shape = model.layers[0].input_shape[1:] returns an empty listed, whereas

input_shape = model.layers[0].input_shape[0][1:] returns a tuple

Hopefully I won’t have too much trouble following along with the rest of the video, but I was able to replicate the output by adding that 0 index

Log in to Reply
1. Sefik Serengil says:
  
  April 17, 2023 at 3:28 pm
  
  Most probably your tf version behaves different than my tf version. Thank you for the comment!
  
  Log in to Reply

A Gentle Introduction to Face Recognition in Deep Learning

Vlog

DeepFace

Stage 1 and 2: Detection and Alignment

Stage 2.5 Normalization

Stage 3: Representation

Question: how those models trained?

Representations

Question: which single face recognition model is the best

Stage 3: Verification

Decision

Threshold

Testing

Namesakes

Deepface itself

Large scale face recognition

Approximate Nearest Neighbor

Real time face recognition

Anti-Spoofing and Liveness Detection

DeepFace API

Ensemble method

Tech Stack Recommendations

Conclusion

Related

5 Comments

Leave a Reply Cancel reply

Vlog

DeepFace

Stage 1 and 2: Detection and Alignment

Stage 2.5 Normalization

Stage 3: Representation

Question: how those models trained?

Representations

Question: which single face recognition model is the best

Stage 3: Verification

Decision

Threshold

Testing

Namesakes

Deepface itself

Large scale face recognition

Approximate Nearest Neighbor

Real time face recognition

Anti-Spoofing and Liveness Detection

DeepFace API

Ensemble method

Tech Stack Recommendations

Conclusion

Related

5 Comments

Leave a Reply Cancel reply

Discover more from Sefik Ilkin Serengil