Vector Similarity Search with Partially Homomorphic Encryption in Python

Vector embeddings are used in many modern applications, including facial recognition, recommendation systems, reverse image search, and large language models (LLMs). These embeddings help computers understand and compare complex data by converting them into numerical vectors. Even though embeddings do not directly reveal the original data, they can still be sensitive—similar to fingerprints. If someone gains access to embeddings, they might try to reverse-engineer them or compare them with known data to make guesses. This makes security a major concern.

High-rise Buildings of Kuala Lumpur By Pexels

One way to protect embeddings is encryption. However, using standard encryption methods like AES requires decrypting data before performing computations, which is inefficient, especially in cloud environments. This is where homomorphic encryption comes in. It allows computations on encrypted data without decrypting it.


🙋‍♂️ You may consider to enroll my top-rated machine learning course on Udemy

Decision Trees for Machine Learning

In this blog post, we will implement vector similarity search with partially homomorphic encryption using LightPHE in Python. This method keeps embeddings encrypted while allowing cloud-based computations, making it both secure and efficient.

Vlog

Why Not to Adopt Fully Homomorphic Encryption?

In this blog, we previously implemented encrypted vector similarity search using fully homomorphic encryption with TenSEAL in Python. However, FHE has several limitations. It demands high computational power, and is inefficient for memory-constraint environments such as edge devices and IoT. Additionally, it produces very long ciphertexts, making storage difficult, and requires long encryption keys, which are challenging to distribute. FHE is most importantly not scalable efficiently. Finally, it can lead to data loss when performing homomorphic operations – this happens because higher security levels introduce more noise into encrypted computations, which can accumulate and cause errors when decrypting the results.

Still, if you determine to use FHE over PHE, you may consider to explore CipherFace repository. It combines DeepFace and TenSEAL while it is offering simple interface to encrypt tensors and compute cosine or euclidean distance on encrypted data.

Two Tower Architecture

Many vector embedding based models are depending on two-tower architecture. Weaviate‘s illustration is my favorite to describe it.

Two Tower Architecture

The Two-Tower Architecture consists of two branches that encode different inputs into vector embeddings. These embeddings are mapped to a shared space, where a similarity function (e.g., cosine similarity) determines their closeness. One tower encodes the query (e.g., a user’s face recently captured from mobile phone), while the other encodes candidates (e.g., stored faces of target person). This approach enables efficient and scalable similarity search, commonly used in face verification, recommendation systems, and information retrieval.

Suppose your facial database contains millions of identities, with each identity having a single facial image. On the on-premises side, you need to generate vector embeddings for all facial images in advance, which only needs to be done once. When a login attempt occurs, the mobile device captures the face of the person attempting to log in and generates a vector embedding using the same facial recognition model (e.g., FaceNet). The system then compares the newly captured face’s vector embedding with the stored facial image’s vector embedding using a similarity function (e.g., cosine or Euclidean distance). If the similarity score is below a certain threshold, the system will determine that the two embeddings represent the same person.

In summary, the on-premises side runs in the user tower branch – for each identity and in advance, while the mobile device side operates in the item tower branch – for currently attempted input. In this setup, we are planning to encrypt vector embeddings generated on prem side.

Similarity Functions

Let’s remember the formulas of Euclidean and Cosine distances. If p and q are n-dimensional vectors, then their formulas are:





dEuc(p, q) = √ ( Σ (i = 1 to n) (qi – pi)2 )

dcos(p, q) = 1 – [ Σ (i = 1 to n) (qi * pi) / (Σ (i = 1 to n) (qi2) * Σ (i = 1 to n) (pi2) ]

Euclidean distance involves square roots, subtraction, and squaring values. In FHE implemention, we discarded the squared root and compare the right hand side with threshold’s squared value. Still, the both subtraction and squared value calculations cannot be performed together under partially homomorphic encryption.

The cosine distance formula is also quite complex. However, Cosine distance is calculated as 1 minus cosine similarity. Instead of comparing that the distance is less than the threshold, we can compare that the distance is greater than 1 minus the threshold. This leads to a simpler formula and makes the calculation easier to handle.

simcos(p, q) = ( Σ (i = 1 to n) (qi * pi) / (Σ (i = 1 to n) (qi2) * Σ (i = 1 to n) (pi2) )

Additionally, if we normalize the vectors p and q in advance, we can eliminate the terms in the denominator. This is because dividing each dimension value by the magnitude of the vector is exactly what L2 normalization entails. This normalization simplifies the cosine similarity calculation by removing the need to compute the magnitudes during the comparison.

Let’s call L2 normalized vector p to p’, and L2 normalized vector q to q’.

p’ = Σ (i = 1 to n) (pi) / Σ (i = 1 to n) (pi2) )

q’ = Σ (i = 1 to n) (qi) / (Σ (i = 1 to n) (qi2)

simcos(p’, q’) = Σ (i = 1 to n) (qi‘ * pi‘)





In summary, cosine similarity involves multiplying the corresponding dimension values of two normalized vectors and then summing the results.

Is this possible with PHE?

At first glance, cosine similarity calculation might seem impossible to perform with partially homomorphic encryption, as both multiplication and addition appear difficult with encrypted data. However, the key point is that the vector embedding coming from the on-prem side is expected to be encrypted already, while the vector embedding from the edge side is in plain form.

The only condition is that dimension values must be positive and vector embedding should be normalized in advance. Luckily, the default model VGG-Face of DeepFace generates vector embeddings with positive dimension values and it normalizes embeddings in L2 form already.

Partially Homomorphic Encryption

Additively homomorphic encryption algorithms support adding two encrypted values and also multiplying encrypted values with scalars. The following snippet shows the proof of work of these 2 operations. So, addition of 2 encrypted values and multiplication of an encrypted value with a plain constant are possible with additively homomorphic encryption algorithms.

from lightphe import LightPHE

# build an additively homomorphic cryptosystem
phe = LightPHE(algorithm_name = "Paillier")

# define plaintexts
m1 = 10000
m2 = 500

# encrypt them
c1 = phe.encrypt(m1)
c2 = phe.encrypt(m2)

# homomorphic addition
c3 = c1 + c2

# homomorphic scalar multiplication
k = 3
c4 = k * c1

# proof of work - addition of 2 ciphertext
assert phe.decrypt(c3) == m1 + m1

# proof of work - multiplication of ciphertext with a plain constant
assert phe.decrypt(c4) == k * m1

So, you can multiply the corresponding dimension values of the encrypted vector from the on-prem side with the plain vector from the cloud side element-wisely. The result will be a set of encrypted values for each dimension. Finally, you can sum the encrypted dimensions of this new encrypted vector – because we adopted additively homomorphic encryption algorithm – which gives you the encrypted cosine similarity. This approach allows you to compute cosine similarity securely while maintaining privacy.

LightPHE supports several additively homomorphic encryption algorithms, including Paillier, Damgard-Jurik, Benaloh, Naccache-Stern, Okamoto-Uchiyama, Exponential ElGamal and Elliptic Curve ElGamal. However, Benaloh and Naccache-Stern are considered experimental, as they encounter issues when generating keys for longer bit lengths. Additionally, Exponential ElGamal relies on solving the discrete logarithm problem, while Elliptic Curve ElGamal requires solving the elliptic curve discrete logarithm problem during decryption. Due to these computational challenges, both are also classified as experimental and are not suitable for practical applications. In summary, to encrypt vector embeddings for encrypted vector similarity search, you should use Paillier, Damgard-Jurik, or Okamoto-Uchiyama, as they are the most practical choices.

Encrypted Vector Similarity Search with PHE

Our experiment will involve two sides: on-prem and cloud. The on-prem side will store both the private and public keys, while the cloud side will only have access to the public key.

The on-prem side will establish an additively homomorphic cryptosystem and encrypt the vector embeddings of identities one by one. Suppose I only hold identities of Angelina Jolie and Jennifer Aniston.

from lightphe import LightPHE

# build an additively homomorhic encryption cryptosystem
onprem_cs = LightPHE(algorithm_name = "Paillier", precision = 19)

# export private and public keys
onprem_cs.export_keys("secret.txt")
onprem_cs.export_keys("public.txt", public=True)

# find vector embeddings of identities as l2 normalized form and all positives
# VGG-Face does this already
angelina_embedding = DeepFace.represent("img1.jpg")[0]["embedding"]
jennifer_embedding = DeepFace.represent("img3.jpg")[0]["embedding"]

# encrypt vector embeddings
angelina_embedding_encrypted = onprem_cs.encrypt(angelina_embedding)
jennifer_embedding_encrypted = onprem_cs.encrypt(onprem_cs.encrypt(angelina_embedding))

The cloud side will initialize the same cryptosystem using only the public key. It will then capture an image of the target and convert it into a vector embedding. Next, the cloud side will compute the dot product of the encrypted vector embedding and the plaintext embedding of the target image. This operation will yield an encrypted cosine similarity. However, since the cloud only has the public key, it will be unable to decrypt the result.

Target Image
from lightphe import LightPHE
import pytest

# build cryptosystem in cloud with only public key
cloud_cs = LightPHE(algorithm_name = "Paillier", precision = 19, key_file = "public.txt")

# confirm that cloud cannot decrypt angelina_embedding_encrypted
with pytest.raises(ValueError, match="You must have private key"):
   cloud_cs.decrypt(angelina_embedding_encrypted)

# find vector embeddings of the target image as l2 normalized form and all positives
# VGG-Face does this already
target_embedding = DeepFace.represent("target.jpg")[0]["embedding"]

# find dot product of encrypted embedding and plain embedding in cloud
encrypted_cosine_similarity = angelina_embedding_encrypted @ target_embedding

# confirm that cloud cannot decrypt it even though it is calculated by cloud
with pytest.raises(ValueError, match="You must have private key"):
   cloud_cs.decrypt(encrypted_cosine_similarity)

Finally, the on-prem side will decrypt the encrypted cosine similarity, as it possesses the private key. The decrypted result should be identical to the cosine similarity computed using plaintext vector embeddings, ensuring the correctness of the homomorphic encryption process.





from lightphe import LightPHE

# restore the crypto system
onprem_cs = LightPHE(algorithm_name = "Paillier", precision = 19, key_file = "secret.txt")

# restore cosine similarity
cosine_similarity = onprem_cs.decrypt(encrypted_cosine_similarity)[0]

# proof of work
expected_similarity = sum(x * y for x, y in zip(angelina_embedding, target_embedding))
assert abs(cosine_similarity - expected_similarity) < 1e-2

In DeepFace, we compared the cosine distance to see if it was less than or equal to 0.68 for the VGG-Face model and cosine pairs.

thresholds = {
    "VGG-Face": {
        "cosine": 0.68,
        "euclidean": 1.17,
        "euclidean_l2": 1.17,
    },
}

Here, we are calculating cosine similarity instead of cosine distance. The formula for cosine distance is 1 minus cosine similarity. Therefore, to verify if the image pairs are of the same person or different people, we compare the restored cosine similarity to be greater than or equal to 1 – 0.68. This allows us to accurately verify whether the image pairs represent the same person or different individuals.

if cosine_similarity >= 0.32:
   print("same person")
else:
   print("different persons")

Conclusion

In this post, we explored how partially homomorphic encryption (PHE) can be leveraged to securely perform vector similarity searches on encrypted data, addressing the challenges of privacy while enabling efficient computations in cloud-based environments. By focusing on additively homomorphic encryption algorithms like Paillier, Damgard-Jurik, and Okamoto-Uchiyama, we demonstrated how vector embeddings can be kept private while still allowing similarity computations like cosine similarity to be performed securely.

Although fully homomorphic encryption (FHE) offers same security levels, it comes with significant computational and storage overheads that make it impractical for many use cases, especially in resource-constrained environments. On the other hand, PHE provides a more feasible solution, allowing encrypted operations to be conducted on vectors without compromising security.

The two-tower architecture used in applications like facial recognition and recommendation systems benefits greatly from this method. By encrypting the embeddings generated on the on-prem side and keeping the cloud-side operations secure, we can protect sensitive data while maintaining scalability and efficiency.

In summary, the integration of PHE in vector similarity search processes presents a robust solution for maintaining data privacy and security in modern applications. With practical encryption algorithms and secure processing workflows, organizations can safeguard their data while still performing complex, privacy-preserving computations in distributed systems.


Support this blog if you do like!

Buy me a coffee      Buy me a coffee