Large Scale Face Recognition with Pinecone Vector Database

A production-driven facial recognition pipeline comes with the common concerns about how to store the vector embeddings. Because vectors are complex data types. There are a lot of options in the tech stack: relational databases, key-value or wide-column stores, document or graph databases. In this post, we are going to mention how to build a large scale facial recognition pipeline with Pinecone vector database.

Pinecone

Pinecone is a cloud solution. It has no on-premise support. I mean that you have to access all of its services with an api. You don’t have to download anything on your environment.

🙋‍♂️ You may consider to enroll my top-rated machine learning course on Udemy

You will need an API key to access its functionalities. You can get it with your name and email on its official web site. You will have 14-day free trial. Then, you have to pay for it.

Python client

It has a pretty python client. We are going to use its functionalities with the following package. It’s available on pypi.

!pip install pinecone-client

When we install the package, we then import it. You should initialize the package with your API key.

import pinecone
pinecone.init(api_key = YOUR_API_KEY)

Index

The equivalent of a database table in relational databases is index in the vector database. I retrieve the existing indexes first, check deepface index exists second, and create it if it does not exist finally in the following code snippet. You can drop the existing index first alternatively.

#pinecone.delete_index("deepface")
existig_indexes = pinecone.list_indexes() 

engine = 'approximated' #approximated, exact
if "deepface" not in existig_indexes:
    pinecone.create_index(name = "deepface", metric = "euclidean", engine_type = engine)

#connect
index = pinecone.Index("deepface")

Nearest neighbor

You should define the engine type when you create the index first time. It could be either approximated or exact. Exact option applies k-nearest neighbor (k-nn) algorithms whereas approximated option applies approximate nearest neighbor (a-nn) algorithm.

Here, the time complexity of k-nn is O(dxn) where n is the number of instances in your database and d is the number of dimensions of your embeddings. For example, Facenet face recognition model creates 128 dimensional embeddings. It is expected that n is much greater than d. That’s why, we can simplify the time complexity of k-nn to O(n). It would be problematic for millions level data if you don’t have big data solutions but it always guarantees to find the nearest ones.

On the other hand, time complexity of a-nn algorithm is much less than the k-nn algorithm but it approximates. I mean that it can discard some nearest ones but it is very fast.

If your task requires speed (e.g. google image search), then you should use a-nn. If your task requires confidence (finding the identity of guilty one) instead of speed, you should run k-nn.

Notice that you can also use Spotify annoy, Facebook faiss, Nmslib or Elasticsearch to run approximate nearest neighbor algorithm.

Face recognition model

I’m going to use deepface library for python to find the vector embeddings of facial images.

from deepface import DeepFace

Here, you can watch the functionalities of deepface.

A modern face recognition pipeline consists of 4 common stages: detect, align, represent and verify. Deepface offers a modern pipeline for your studies.

It wraps several state-of-the-art face recognition models: VGG-Face, FaceNet, ArcFace. They all passed the human level accuracy. I will build FaceNet model in this post.

It will create 128 dimensional vector embeddings.

Facial database

I’m going to use the unit test items of deepface as facial database.

img_paths = []
for root, dirs, files in os.walk("deepface/tests/dataset"):
    for file in files:
        if '.jpg' in file:
            img_paths.append(root+"/"+file)

embeddings = []
for i in tqdm.tqdm(range(0, len(img_paths))):
    img_path = img_paths[i]
    embedding = DeepFace.represent(img_path = img_path
             , model_name = 'Facenet', model = model)[0]["embedding"]
    embeddings.append(embedding)

Synthetic data

There are 60 items in the unit test folder of deepface. I’ll create some synthetic data to make the problem more complex. I’ll store 100K embeddings in the database.

for i in tqdm.tqdm(range(60, 100000)):
    embedding = []
    for j in range(0, 128):
        embedding.append(random.uniform(-2, 2))
    
    embeddings.append(embedding)
    img_paths.append('dummy_%d.jpg' % (i))

Master data

Now, I will merged the embeddings of unit test items and synthetic data.

import pandas as pd
df = pd.DataFrame(img_paths, columns = ["img_path"])
df["embedding"] = embeddings

Now, everything is in the pandas data frame.

Inserting data to index

Inserting a bulk data is handled with upsert function in its interface.

#index.upsert(items=zip(df.img_path, df.embedding))

I have 100K instances. I will insert the data with chunks. In this way, I can try to re-insert it if I had a trouble.

chunk_size = 1000
cycles = int(df.shape[0] / chunk_size) + 1

retry_count = 3

missings = []

for i in tqdm.tqdm(range(0, cycles)):
    if i &amp;gt;= 0:
        valid_from = i * chunk_size
        valid_until = min(i * chunk_size + chunk_size, df.shape[0])

        if valid_from &amp;gt;= df.shape[0]:
            break

        valid_frame = df.iloc[valid_from:valid_until]

        is_successful = False

        for j in range(0, retry_count):
            try:
                index.upsert(items=zip(valid_frame.img_path, valid_frame.embedding), disable_progress_bar = True)
                is_successful = True
                break
            except:
                time.sleep(1)

        if is_successful != True:
            missings.append(i)

A chunk with 1000 items were inserted in 8.37 seconds averagely in my case. So, inserting 100K items lasts almost 14 minutes. Notice that inserting stage will not be run anymore. You can of course run it if you have new instances but once they are stored in the pinecone vector database, you don’t have to re-insert them.

Target

I’m going to look for the identity of the following image in my 100K sized index.

img_path = "hello-world/source.jpg"
target_embedding = DeepFace.represent(img_path = img_path, model_name = 'Facenet', model = model)[0]["embedding"]

Query

Once you created the embedding of the target image, then pass it to the query function to find its some neighbors.

results = index.query(queries=[target_embedding], top_k = 5)
for result in results:
    keys = result.ids
    scores = result.scores

It can find the nearest neighbors in 0.3 seconds!

Key value store

You can use pinecone as a key value store as well. Suppose that you know that img2 is Angelina Jolie but you don’t know the target image is Angelina. You will get the embedding of Angelina first, then compare it to target one. If the distance is less than 10, they are same person. Because the fine tuned threshold of facenet and euclidean distance is 10.

from deepface.commons import distance as dst
resp = index.unary_fetch(id = "deepface/tests/dataset/img2.jpg")
distance = dst.findEuclideanDistance(target_embedding, resp.vector.tolist())
distance &amp;lt; 10

Tech stack

Vector database is a best practice solution but there a lot of different tools in the toolbox: relational databases, key value stores, graph and document databases, big data systems, a-nn libraries.

Conclusion

We mentioned how to build a large scale face recognition pipeline with pinecone vector database. It comes with powerful k-nn and a-nn support.

I pushed the source code of this study into the GitHub. You can support this study if you star⭐️ the repo.

Like this blog? Support me on Patreon