In the world of database management, there are often debates about the right tools for specific tasks. In a recent blog post, we explored the capabilities of regular PostgreSQL and SQL for vector similarity search, leveraging PostgreSQL’s support for list datatypes. While this approach allowed us to perform exact nearest neighbor searches, it proved inefficient for large-scale datasets due to its computational intensity. In this post, we delve into a more advanced solution by utilizing PostgreSQL’s pgvector extension, transforming PostgreSQL into a powerful vector database and running approximate nearest neighbor. This enhancement enables us to find similar vectors among billions within milliseconds. As a practical use case, we will employ the DeepFace library in Python to convert facial images into vector embeddings and generate synthetic data to simulate a million-sized database. In summary, we demonstrate how pgvector revolutionizes vector similarity search at scale.
Vlog
You can either continue to read this tutorial or watch the following video. They both cover postgres’ pgvector extension to transform postgres to a vector database to perform approximate nearest neighbor to search the similar of a given vector even in billions just in milliseconds.
πββοΈ You may consider to enroll my top-rated machine learning course on Udemy
Approximate Nearest Neighbor
I strongly recommend you to check out approximate nearest neighbor algorithm from scratch to understand the math behind this algorithm.
Just Use Postgres
This popular meme captures the common debate perfectly with a bell curve depicting different perspectives on using PostgreSQL for AI applications. On the left side of the curve, the simplistic approach advocates “just use Postgres”, dismissing the need for specialized tools. In the middle, a detailed explanation suggests the necessity of a specialized vector database for AI applications at scale, emphasizing the benefits of optimized performance and features. Ironically, the right side of the curve, representing the highly knowledgeable, also concludes with “just use Postgres”, implying a deeper understanding that leverages PostgreSQL’s robust capabilities. This humorous take underscores a critical point: PostgreSQL, especially with extensions like pgvector or pgvecto.rs, can indeed serve as a powerful vector database capable of handling billion-scale vector similarity searches efficiently. The meme encapsulates the journey from simplicity to complexity and back to simplicity, reflecting PostgreSQL’s versatile and reliable nature in the realm of vector databases.
Installation
For this experiment, we will rely on PostgreSQL and its pgvector extension. I was able to install both on my MacBook using the following commands, which installed PostgreSQL 14.12 and pgvector 0.7.2 stable versions.
# install postgres and pgvector $ brew install postgresql $ brew install pgvector # start postgres service $ brew services start postgresql # open postgres power shell $ psql postgres
When PostgreSQL’s PowerShell is activated, you need to enable the pgvector extension for the first-time use with the following command:
postgres=# CREATE EXTENSION IF NOT EXISTS vector;
Creating Tables
Once the pgvector extension is activated, we can create tables with the vector datatype. In this experiment, we will use the FaceNet model to represent facial images as vector embeddings. The FaceNet model generates 128-dimensional vectors, so we created an embedding field as a vector column with 128 dimensions. If you plan to use a different facial recognition model, make sure to adjust the number of dimensions accordingly.
postgres=# CREATE TABLE IDENTITIES (ID INT primary key, IMG_NAME VARCHAR(100), embedding vector(128));
Vector Embeddings
In this experiment, we will use DeepFace’s unit test items to create our facial database with vector embeddings.
import os from deepface import DeepFace instances = [] for dirpath, dirnames, filenames in os.walk("deepface/tests/dataset/"): for filename in filenames: img_path = f"{dirpath}{filename}" if not img_path.endswith(".jpg"): continue objs = DeepFace.represent( img_path=img_path, model_name="Facenet" ) embedding = objs[0]["embedding"] instances.append((img_path, embedding))
This will generate 64 vector embeddings, which is too few to test the limits of the approximate nearest neighbor algorithm. Therefore, let’s generate some synthetic data.
import numpy as np target_size = 100000 for i in range(len(instances), target_size): img_name = f"synthetic_{i}.jpg" # generate 128-d embedding with dimension values in [-5, +5] embedding = np.random.uniform(-5, 5, 128).tolist() instances.append((img_name, embedding))
We will use the FaceNet model, which generates 128-dimensional vector embeddings. Therefore, we will generate 128-dimensional embeddings when creating the synthetic data.
Database Connection
To communicate with PostgreSQL from Python, we will use the psycopg2-binary package. By default, PostgreSQL uses my Mac’s login credentials as the username and password.
# !pip install psycopg2-binary import psycopg2 conn = psycopg2.connect( host="localhost", port="5432", database="postgres", user=user, password=password, ) cursor = conn.cursor()
Table Initialization
Even though we activated the vector extension and initialized our table from PostgreSQL’s PowerShell, these actions can also be performed from Python. These statements are idempotent, meaning that running them from Python will not throw any exceptions or take any action if they have already been executed in PowerShell. However, if they were not previously created in PowerShell, Python will initialize them.
cursor.execute("create extension if not exists vector;") cursor.execute("drop table if exists identities;") cursor.execute(''' create table identities ( ID INT primary key, IMG_NAME VARCHAR(100), embedding vector(128)); ''')
Storing Facial Database Into Postgres
We stored our facial dataset with synthetic data in the instances variable, which consists of 100,000 items. Now, we will insert them into PostgreSQL one by one.
for idx, (img_path, embedding) in tqdm(enumerate(instances)): statement = f''' insert into identities (id, img_name, embedding) values ({idx}, '{img_path}', '{str(embedding)}'); ''' cursor.execute(statement) # commit once all inserts done conn.commit()
I was able to store all 100,000 items in PostgreSQL one by one in 17 seconds. While inserting them as a batch would be faster, this performance is acceptable for me.
Creating The Index
Once the data with the vector column is stored, you can search for similar vectors given a specific one. However, by default, PostgreSQL applies the exact nearest neighbor algorithm, which is very slow for large-scale datasets. Additionally, we have previously performed exact nearest neighbor searches with PostgreSQL without the vector extension in one of our blog posts. Therefore, to perform approximate nearest neighbor searches efficiently, we need to create an index on the stored vectors.
cursor.execute( "CREATE INDEX ON identities USING hnsw (embedding vector_l2_ops);" )
Creating index takes 58 seconds for 100K vectors.
Searching An Identity
I will search for the following identity in my database. Please note that the image provided is not part of my database. I will be looking for similar faces that match the given image.
target_path = 'target.jpg' objs = DeepFace.represent( img_path=target_path, model_name='Facenet' ) target_embedding = objs[0]['embedding']
Querying
Once we have the vector representation of the target image, I can search for it using a basic SQL query. First, I will calculate the Euclidean distance between the target image and all images in the database. Then, I will filter out the images whose distance is less than a pre-tuned threshold value, and finally, I will order the results by distance in ascending order.
# pre-tuned threshold to classify same person for facenet - euclidean pair threshold = 10 cursor.execute( f""" SELECT * FROM ( SELECT i.*, embedding <-> '{str(target_embedding)}' as distance FROM identities i ) a WHERE distance < {threshold} ORDER BY distance asc LIMIT 100 """ ) results = cursor.fetchall()
Querying just takes 0.0168 seconds!
Results
We are able to plot the nearest neighbors with the following code snippet.
for idx, img_name, embedding, distance in results: print(f"distance between {target_path} and {img_name} is {round(distance, 2)}") source_img = cv2.imread(img_name) plt.imshow(source_img[:,:,::-1]) plt.show()
All nearest neighbors are images of Angelina Jolie. So, this shows our approach is working fine!
Source Code
I pushed the source code of this study into GitHub as a jupyter notebook. You can support this study if you starβ it!
Small or Moderate Size Database
In this post, we focused on postgres’ pgvector extension to transform postgres to a vector database. In that way, we are able to run approximate nearest neighbor algorithm and we can find the similar vectors of a given one just in milliseconds even in billions. However, what if your dataset is not billion scale? Using regular postgres with sql queries may be enough for your task!
You can check out vector similarity search with postgres blog post for moderate size datasets.
Conclusion
In conclusion, by incorporating the pgvector extension into PostgreSQL, we have significantly enhanced the efficiency and scalability of vector similarity searches with approximate nearest neighbor algorithm. This advancement allows us to handle billion-scale datasets with remarkable speed, achieving results in milliseconds. Through our practical example using the DeepFace library to create facial image embeddings and a synthetic million-size database, we’ve demonstrated the potential and practicality of this approach. Pgvector not only transforms PostgreSQL into a robust vector database but also opens up new possibilities for real-time, large-scale similarity search applications. As the demand for handling vast amounts of vector data continues to grow, leveraging tools like pgvector will be crucial in maintaining performance and accuracy.
Support this blog if you do like!