A production-driven facial recognition pipeline comes with the common concerns about how to store the vector embeddings. Because vectors are complex data types. There are a lot of options in the tech stack: relational databases, key-value or wide-column stores, document or graph databases. In this post, we are going to mention how to build a large scale facial recognition pipeline with Pinecone vector database.
BTW, strongly recommend to watch the following video about the math behind approximate nearest neighbor and its python implementation from scratch
πββοΈ You may consider to enroll my top-rated machine learning course on Udemy
Pinecone
Pinecone is a cloud solution. It has no on-premise support. I mean that you have to access all of its services with an api. You don’t have to download anything on your environment.
You will need an API key to access its functionalities. You can get it with your name and email on its official web site. You will have 14-day free trial. Then, you have to pay for it.
Python client
It has a pretty python client. We are going to use its functionalities with the following package. It’s available on pypi.
!pip install pinecone-client
When we install the package, we then import it. You should initialize the package with your API key.
import pinecone pinecone.init(api_key = YOUR_API_KEY)
Index
The equivalent of a database table in relational databases is index in the vector database. I retrieve the existing indexes first, check deepface index exists second, and create it if it does not exist finally in the following code snippet. You can drop the existing index first alternatively.
#pinecone.delete_index("deepface") existig_indexes = pinecone.list_indexes() engine = 'approximated' #approximated, exact if "deepface" not in existig_indexes: pinecone.create_index(name = "deepface", metric = "euclidean", engine_type = engine) #connect index = pinecone.Index("deepface")
Nearest neighbor
You should define the engine type when you create the index first time. It could be either approximated or exact. Exact option applies k-nearest neighbor (k-nn) algorithms whereas approximated option applies approximate nearest neighbor (a-nn) algorithm.
Here, the time complexity of k-nn is O(dxn) where n is the number of instances in your database and d is the number of dimensions of your embeddings. For example, Facenet face recognition model creates 128 dimensional embeddings. It is expected that n is much greater than d. That’s why, we can simplify the time complexity of k-nn to O(n). It would be problematic for millions level data if you don’t have big data solutions but it always guarantees to find the nearest ones.
On the other hand, time complexity of a-nn algorithm is much less than the k-nn algorithm but it approximates. I mean that it can discard some nearest ones but it is very fast.
If your task requires speed (e.g. google image search), then you should use a-nn. If your task requires confidence (finding the identity of guilty one) instead of speed, you should run k-nn.
Notice that you can also use Spotify annoy, Facebook faiss, Nmslib or Elasticsearch to run approximate nearest neighbor algorithm.
Face recognition model
I’m going to use deepface library for python to find the vector embeddings of facial images.
from deepface import DeepFace
Here, you can watch the functionalities of deepface.
A modern face recognition pipeline consists of 4 common stages: detect, align, represent and verify. Deepface offers a modern pipeline for your studies.
It wraps several state-of-the-art face recognition models: VGG-Face, FaceNet, ArcFace. They all passed the human level accuracy. I will build FaceNet model in this post.
It will create 128 dimensional vector embeddings.
Facial database
I’m going to use the unit test items of deepface as facial database.
img_paths = [] for root, dirs, files in os.walk("deepface/tests/dataset"): for file in files: if '.jpg' in file: img_paths.append(root+"/"+file) embeddings = [] for i in tqdm.tqdm(range(0, len(img_paths))): img_path = img_paths[i] embedding = DeepFace.represent(img_path = img_path , model_name = 'Facenet', model = model)[0]["embedding"] embeddings.append(embedding)
Synthetic data
There are 60 items in the unit test folder of deepface. I’ll create some synthetic data to make the problem more complex. I’ll store 100K embeddings in the database.
for i in tqdm.tqdm(range(60, 100000)): embedding = [] for j in range(0, 128): embedding.append(random.uniform(-2, 2)) embeddings.append(embedding) img_paths.append('dummy_%d.jpg' % (i))
Master data
Now, I will merged the embeddings of unit test items and synthetic data.
import pandas as pd df = pd.DataFrame(img_paths, columns = ["img_path"]) df["embedding"] = embeddings
Now, everything is in the pandas data frame.
Inserting data to index
Inserting a bulk data is handled with upsert function in its interface.
#index.upsert(items=zip(df.img_path, df.embedding))
I have 100K instances. I will insert the data with chunks. In this way, I can try to re-insert it if I had a trouble.
chunk_size = 1000 cycles = int(df.shape[0] / chunk_size) + 1 retry_count = 3 missings = [] for i in tqdm.tqdm(range(0, cycles)): if i >= 0: valid_from = i * chunk_size valid_until = min(i * chunk_size + chunk_size, df.shape[0]) if valid_from >= df.shape[0]: break valid_frame = df.iloc[valid_from:valid_until] is_successful = False for j in range(0, retry_count): try: index.upsert(items=zip(valid_frame.img_path, valid_frame.embedding), disable_progress_bar = True) is_successful = True break except: time.sleep(1) if is_successful != True: missings.append(i)
A chunk with 1000 items were inserted in 8.37 seconds averagely in my case. So, inserting 100K items lasts almost 14 minutes. Notice that inserting stage will not be run anymore. You can of course run it if you have new instances but once they are stored in the pinecone vector database, you don’t have to re-insert them.
Target
I’m going to look for the identity of the following image in my 100K sized index.
img_path = "hello-world/source.jpg" target_embedding = DeepFace.represent(img_path = img_path, model_name = 'Facenet', model = model)[0]["embedding"]
Query
Once you created the embedding of the target image, then pass it to the query function to find its some neighbors.
results = index.query(queries=[target_embedding], top_k = 5) for result in results: keys = result.ids scores = result.scores
It can find the nearest neighbors in 0.3 seconds!
Key value store
You can use pinecone as a key value store as well. Suppose that you know that img2 is Angelina Jolie but you don’t know the target image is Angelina. You will get the embedding of Angelina first, then compare it to target one. If the distance is less than 10, they are same person. Because the fine tuned threshold of facenet and euclidean distance is 10.
from deepface.commons import distance as dst resp = index.unary_fetch(id = "deepface/tests/dataset/img2.jpg") distance = dst.findEuclideanDistance(target_embedding, resp.vector.tolist()) distance < 10
Tech stack
Vector database is a best practice solution but there a lot of different tools in the toolbox: relational databases, key value stores, graph and document databases, big data systems, a-nn libraries.
The Best Single Model
DeepFace has many cutting-edge models in its portfolio. Find out the best configuration for facial recognition model, detector, similarity metric and alignment mode.
DeepFace API
DeepFace offers a web service for face verification, facial attribute analysis and vector embedding generation through its API. You can watch a tutorial on using the DeepFace API here:
Additionally, DeepFace can be run with Docker to access its API. Learn how in this video:
Conclusion
We mentioned how to build a large scale face recognition pipeline with pinecone vector database. It comes with powerful k-nn and a-nn support.
I pushed the source code of this study into the GitHub. You can support this study if you starβοΈ the repo.
Support this blog if you do like!