Postgres As A Vector Database: Billion-Scale Vector Similarity Search With pgvector

In the world of database management, there are often debates about the right tools for specific tasks. In a recent blog post, we explored the capabilities of regular PostgreSQL and SQL for vector similarity search, leveraging PostgreSQL’s support for list datatypes. While this approach allowed us to perform exact nearest neighbor searches, it proved inefficient for large-scale datasets due to its computational intensity. In this post, we delve into a more advanced solution by utilizing PostgreSQL’s pgvector extension, transforming PostgreSQL into a powerful vector database and running approximate nearest neighbor. This enhancement enables us to find similar vectors among billions within milliseconds. As a practical use case, we will employ the DeepFace library in Python to convert facial images into vector embeddings and generate synthetic data to simulate a million-sized database. In summary, we demonstrate how pgvector revolutionizes vector similarity search at scale.

Woman in Green Coat With Brown Hair From Pexels

Vlog

You can either continue to read this tutorial or watch the following video. They both cover postgres’ pgvector extension to transform postgres to a vector database to perform approximate nearest neighbor to search the similar of a given vector even in billions just in milliseconds.


πŸ™‹β€β™‚οΈ You may consider to enroll my top-rated machine learning course on Udemy

Decision Trees for Machine Learning

Approximate Nearest Neighbor

I strongly recommend you to check out approximate nearest neighbor algorithm from scratch to understand the math behind this algorithm.

Just Use Postgres

Why Sometimes ‘Just Use Postgres’ Makes Perfect Sense

This popular meme captures the common debate perfectly with a bell curve depicting different perspectives on using PostgreSQL for AI applications. On the left side of the curve, the simplistic approach advocates “just use Postgres”, dismissing the need for specialized tools. In the middle, a detailed explanation suggests the necessity of a specialized vector database for AI applications at scale, emphasizing the benefits of optimized performance and features. Ironically, the right side of the curve, representing the highly knowledgeable, also concludes with “just use Postgres”, implying a deeper understanding that leverages PostgreSQL’s robust capabilities. This humorous take underscores a critical point: PostgreSQL, especially with extensions like pgvector or pgvecto.rs, can indeed serve as a powerful vector database capable of handling billion-scale vector similarity searches efficiently. The meme encapsulates the journey from simplicity to complexity and back to simplicity, reflecting PostgreSQL’s versatile and reliable nature in the realm of vector databases.

Installation

For this experiment, we will rely on PostgreSQL and its pgvector extension. I was able to install both on my MacBook using the following commands, which installed PostgreSQL 14.12 and pgvector 0.7.2 stable versions.

# install postgres and pgvector
$ brew install postgresql
$ brew install pgvector

# start postgres service
$ brew services start postgresql

# open postgres power shell
$ psql postgres

When PostgreSQL’s PowerShell is activated, you need to enable the pgvector extension for the first-time use with the following command:

postgres=# CREATE EXTENSION IF NOT EXISTS vector;

Creating Tables

Once the pgvector extension is activated, we can create tables with the vector datatype. In this experiment, we will use the FaceNet model to represent facial images as vector embeddings. The FaceNet model generates 128-dimensional vectors, so we created an embedding field as a vector column with 128 dimensions. If you plan to use a different facial recognition model, make sure to adjust the number of dimensions accordingly.

postgres=# CREATE TABLE IDENTITIES (ID INT primary key, IMG_NAME VARCHAR(100), embedding vector(128));

Vector Embeddings

In this experiment, we will use DeepFace’s unit test items to create our facial database with vector embeddings.





import os
from deepface import DeepFace

instances = []
for dirpath, dirnames, filenames in os.walk("deepface/tests/dataset/"):
    for filename in filenames:
        img_path = f"{dirpath}{filename}"
        if not img_path.endswith(".jpg"):
            continue
        objs = DeepFace.represent(
            img_path=img_path,
            model_name="Facenet"
        )
        embedding = objs[0]["embedding"]
        instances.append((img_path, embedding))

This will generate 64 vector embeddings, which is too few to test the limits of the approximate nearest neighbor algorithm. Therefore, let’s generate some synthetic data.

import numpy as np

target_size = 100000
for i in range(len(instances), target_size):
    img_name = f"synthetic_{i}.jpg"
    # generate 128-d embedding with dimension values in [-5, +5]
    embedding = np.random.uniform(-5, 5, 128).tolist()
    instances.append((img_name, embedding))

We will use the FaceNet model, which generates 128-dimensional vector embeddings. Therefore, we will generate 128-dimensional embeddings when creating the synthetic data.

Database Connection

To communicate with PostgreSQL from Python, we will use the psycopg2-binary package. By default, PostgreSQL uses my Mac’s login credentials as the username and password.

# !pip install psycopg2-binary
import psycopg2

conn = psycopg2.connect(
    host="localhost",
    port="5432",
    database="postgres",
    user=user,
    password=password,
)

cursor = conn.cursor()

Table Initialization

Even though we activated the vector extension and initialized our table from PostgreSQL’s PowerShell, these actions can also be performed from Python. These statements are idempotent, meaning that running them from Python will not throw any exceptions or take any action if they have already been executed in PowerShell. However, if they were not previously created in PowerShell, Python will initialize them.

cursor.execute("create extension if not exists vector;")

cursor.execute("drop table if exists identities;")

cursor.execute('''
    create table identities (
    ID INT primary key,
    IMG_NAME VARCHAR(100),
    embedding vector(128));
''')

Storing Facial Database Into Postgres

We stored our facial dataset with synthetic data in the instances variable, which consists of 100,000 items. Now, we will insert them into PostgreSQL one by one.

for idx, (img_path, embedding) in tqdm(enumerate(instances)):
    statement = f'''
        insert into 
        identities 
        (id, img_name, embedding) 
        values 
        ({idx}, '{img_path}', '{str(embedding)}');
    '''
    cursor.execute(statement)

# commit once all inserts done
conn.commit()

I was able to store all 100,000 items in PostgreSQL one by one in 17 seconds. While inserting them as a batch would be faster, this performance is acceptable for me.

Creating The Index

Once the data with the vector column is stored, you can search for similar vectors given a specific one. However, by default, PostgreSQL applies the exact nearest neighbor algorithm, which is very slow for large-scale datasets. Additionally, we have previously performed exact nearest neighbor searches with PostgreSQL without the vector extension in one of our blog posts. Therefore, to perform approximate nearest neighbor searches efficiently, we need to create an index on the stored vectors.

cursor.execute(
   "CREATE INDEX ON identities USING hnsw (embedding vector_l2_ops);"
)

Creating index takes 58 seconds for 100K vectors.

Searching An Identity

I will search for the following identity in my database. Please note that the image provided is not part of my database. I will be looking for similar faces that match the given image.

Target Image
target_path = 'target.jpg'
objs = DeepFace.represent(
    img_path=target_path,
    model_name='Facenet'
)
target_embedding = objs[0]['embedding']

Querying

Once we have the vector representation of the target image, I can search for it using a basic SQL query. First, I will calculate the Euclidean distance between the target image and all images in the database. Then, I will filter out the images whose distance is less than a pre-tuned threshold value, and finally, I will order the results by distance in ascending order.





# pre-tuned threshold to classify same person for facenet - euclidean pair
threshold = 10

cursor.execute(
    f"""
        SELECT *
        FROM (
            SELECT i.*, embedding <-> '{str(target_embedding)}' as distance
            FROM identities i
        ) a
        WHERE distance < {threshold}
        ORDER BY distance asc
        LIMIT 100
    """
)
results = cursor.fetchall()

Querying just takes 0.0168 seconds!

Results

We are able to plot the nearest neighbors with the following code snippet.

for idx, img_name, embedding, distance in results:
    print(f"distance between {target_path} and {img_name} is {round(distance, 2)}")
    source_img = cv2.imread(img_name)
    plt.imshow(source_img[:,:,::-1])
    plt.show()
Results

All nearest neighbors are images of Angelina Jolie. So, this shows our approach is working fine!

Source Code

I pushed the source code of this study into GitHub as a jupyter notebook. You can support this study if you star⭐ it!

Small or Moderate Size Database

In this post, we focused on postgres’ pgvector extension to transform postgres to a vector database. In that way, we are able to run approximate nearest neighbor algorithm and we can find the similar vectors of a given one just in milliseconds even in billions. However, what if your dataset is not billion scale? Using regular postgres with sql queries may be enough for your task!

You can check out vector similarity search with postgres blog post for moderate size datasets.

Conclusion

In conclusion, by incorporating the pgvector extension into PostgreSQL, we have significantly enhanced the efficiency and scalability of vector similarity searches with approximate nearest neighbor algorithm. This advancement allows us to handle billion-scale datasets with remarkable speed, achieving results in milliseconds. Through our practical example using the DeepFace library to create facial image embeddings and a synthetic million-size database, we’ve demonstrated the potential and practicality of this approach. Pgvector not only transforms PostgreSQL into a robust vector database but also opens up new possibilities for real-time, large-scale similarity search applications. As the demand for handling vast amounts of vector data continues to grow, leveraging tools like pgvector will be crucial in maintaining performance and accuracy.


Support this blog if you do like!

Buy me a coffee      Buy me a coffee