How to Calculate Percentage-Based Confidence Scores from Similarities of Embedding Models

Embedding models have become a cornerstone of modern machine learning, powering everything from recommendation engines and document search to verification systems like signature or image matching. At their core, these models transform complex data—whether it’s text, images, audio, or even behavioral patterns—into high-dimensional numerical vectors, known as embeddings. These embeddings capture the essential features of the data, allowing us to compare items mathematically.

To determine similarity between embeddings, we typically use distance or similarity metrics such as cosine similarity or Euclidean distance. Based on these measurements, we often implement hard classification rules: if the distance between two vectors is below a predefined threshold, we classify them as belonging to the same group; otherwise, they are considered different. This binary approach works well for applications like reverse image search, facial recognition, or fraud detection, where a yes/no decision is sufficient.

🙋‍♂️ You may consider to enroll my top-rated machine learning course on Udemy

However, hard classification has a major limitation—it doesn’t provide a measure of how confident we are in that decision. For instance, knowing that two embeddings are “the same” doesn’t tell us whether the similarity is borderline or nearly identical. This lack of interpretability can be a problem when stakeholders or end-users need a more intuitive understanding of similarity.

In this post, we’ll explore a simple yet effective approach to bridge this gap: converting embedding distances and similarity scores into percentage-based confidence scores. Using a straightforward logistic regression model, we’ll demonstrate how to transform raw distance measurements from any embedding model into interpretable percentages. This not only provides more nuanced insights into similarity but also makes embedding-based systems more transparent and user-friendly.

Use Case

We will use DeepFace to obtain distance scores between vector embeddings for pairs of images, both of the same person and of different people. Additionally, we will leverage the unit test data provided by the DeepFace library.

When we run the verify functionality in DeepFace, it returns both a distance score and a hard classification: True if the pair belongs to the same person, and False if they are different.

# !pip install deepface
from deepface import DeepFace
result = DeepFace.verify("img1.jpg", "img2.jpg")

The result payload will look like this:

{
   "verified": True,
   "distance": 0.41,
   "threshold": 0.68,
}

Preparing The Dataset

The unit test folder contains numerous facial images, each accompanied by its identity information.

idendities = {
 "Angelina": ["img1.jpg", "img2.jpg", "img4.jpg"
    , "img5.jpg", "img6.jpg", "img7.jpg", "img10.jpg", "img11.jpg"],
 "Scarlett": ["img8.jpg", "img9.jpg"],
 "Jennifer": ["img3.jpg", "img12.jpg"],
 "Mark": ["img13.jpg", "img14.jpg", "img15.jpg"],
 "Jack": ["img16.jpg", "img17.jpg"],
 "Elon": ["img18.jpg", "img19.jpg"],
 "Jeff": ["img20.jpg", "img21.jpg"],
 "Marissa": ["img22.jpg", "img23.jpg"],
 "Sundar": ["img24.jpg", "img25.jpg"]
}

First, let’s create a Pandas DataFrame containing only same-person labeled instances by cross-matching the images of each identity.

positives = []
for key, values in idendities.items():
 for i in range(0, len(values)-1):
  for j in range(i+1, len(values)):
   positive = []
   positive.append(values[i])
   positive.append(values[j])
   positives.append(positive)
 
positives = pd.DataFrame(positives, columns = ["file_x", "file_y"])
positives["actual"] = "Same Person"

Then, we’ll add different-person labeled instances by performing cross-sampling across identities.

samples_list = list(idendities.values())
 
negatives = []
for i in range(0, len(idendities) - 1):
 for j in range(i+1, len(idendities)):
  cross_product = itertools.product(samples_list[i], samples_list[j])
  cross_product = list(cross_product)
 
  for cross_sample in cross_product:
   negative = []
   negative.append(cross_sample[0])
   negative.append(cross_sample[1])
   negatives.append(negative)
 
negatives = pd.DataFrame(negatives, columns = ["file_x", "file_y"])
negatives["actual"] = "Different Persons"

Now, let’s concatenate the same-person and different-person labeled instances into a single Pandas DataFrame.

df = pd.concat([positives, negatives]).reset_index(drop = True)
 
df.file_x = "../tests/dataset/"+df.file_x
df.file_y = "../tests/dataset/"+df.file_y

Now, we have a Pandas DataFrame containing the image pair names along with their labels.

Generate Embeddings

I prefer to store the vector embeddings of each image in a dictionary, since the same image may appear multiple times in our Pandas DataFrame. This way, we avoid storing duplicate embeddings and reduce unnecessary repetition.

pivot = {}

model_name = "Facenet"
detector_backend = "mtcnn"

def represent(img_name: str):
    if pivot.get(img_name) is None:
        embedding_objs = DeepFace.represent(img_path=img_name, model_name=model_name, detector_backend=detector_backend)

        if len(embedding_objs) > 1:
            raise ValueError(f"{img_name} has more than one face!")
            
        pivot[img_name] = [embedding_obj["embedding"] for embedding_obj in embedding_objs]
    return pivot[img_name]

Then, we can represent each item in the Pandas DataFrame using its corresponding vector embedding.

img1_embeddings = []
img2_embeddings = []
for index, instance in tqdm(df.iterrows(), total=df.shape[0]):
    img1_embeddings = img1_embeddings + represent(instance["file_x"])
    img2_embeddings = img2_embeddings + represent(instance["file_y"])

df["img1_embeddings"] = img1_embeddings
df["img2_embeddings"] = img2_embeddings

Now, we have a Pandas DataFrame that contains the image pair names along with their corresponding embeddings, structured as follows:

Dataframe with pair’s names, labels and embeddings

Distance Calculation

In each row of the DataFrame, we have two vector embeddings. From these, we can compute the distance for each row as follows:

from deepface.modules.verification import find_distance, find_threshold

distance_metrics = [
    "cosine", "euclidean", "euclidean_l2", "angular",
]

for distance_metric in distance_metrics:
    distances = []
    for index, instance in tqdm(df.iterrows(), total=df.shape[0]):
        img1_embeddings = instance["img1_embeddings"]
        img2_embeddings = instance["img2_embeddings"]

        distance = find_distance(
            alpha_embedding=img1_embeddings,
            beta_embedding=img2_embeddings,
            distance_metric=distance_metric
        )
        distances.append(distance)
    
    df[distance_metric] = distances

This adds the distances for each metric as new columns in the Pandas DataFrame, as shown below:

Hard Classification

Once we have the distance scores, we can classify each pair as same-person or different-person by comparing the distance against a pre-tuned threshold.

for distance_metric in distance_metrics:
    threshold = find_threshold(model_name=model_name, distance_metric=distance_metric)
    df[f"{distance_metric}_threshold"] = threshold
    df[f"{distance_metric}_decision"] = 0
    idx = df[df[distance_metric] &lt;= threshold].index
    df.loc[idx, f"{distance_metric}_decision"] = 1

This adds the pre-tuned threshold and a hard prediction column, with 1 indicating a same-person pair and 0 indicating a different-person pair.

Logistic Regression Model

Next, we will build a logistic regression model to convert distance scores into confidence scores. The distance values will serve as the input features, while the hard predictions will be used as the target labels.

We need to normalize the input distances to the [0, 1] range before feeding them into the model, since the logistic regression uses a sigmoid function, which saturates for values below -4 or above +4.

confidence_metrics = {}

for distance_metric in distance_metrics:
    max_value = df[distance_metric].max()

    X = df[distance_metric].values.reshape(-1, 1)

    # normalize the distance values before feeding them to the model
    if max_value > 1:
        X = X / max_value

    y = df[f"{distance_metric}_decision"].values

    model = LogisticRegression().fit(X, y)

    w = model.coef_[0][0]
    b = model.intercept_[0]

    confidence_metrics[distance_metric] = {
        "w": w,
        "b": b,
        "normalizer": max_value,
    }

    confidences =[]
    for index, instance in df.iterrows():
        distance = instance[distance_metric]

        if max_value > 1:
            distance = distance / max_value

        z = w * distance + b
        confidence = 100 / (1 + math.exp(-z))

        confidences.append(confidence)
    
    df[distance_metric + "_confidence"] = confidences

    confidence_metrics[distance_metric]["denorm_max_true"] = df[df[f"{distance_metric}_decision"] == 1][distance_metric + "_confidence"].max()
    confidence_metrics[distance_metric]["denorm_min_true"] = df[df[f"{distance_metric}_decision"] == 1][distance_metric + "_confidence"].min()

    confidence_metrics[distance_metric]["denorm_max_false"] = df[df[f"{distance_metric}_decision"] == 0][distance_metric + "_confidence"].max()
    confidence_metrics[distance_metric]["denorm_min_false"] = df[df[f"{distance_metric}_decision"] == 0][distance_metric + "_confidence"].min()

After training, we will obtain the coefficient and intercept of the logistic regression model, which define the slope and position of the sigmoid curve:

{'cosine': {'w': -6.502269165856082,
  'b': 1.679048923097668,
  'normalizer': 1.206694,
  'denorm_max_true': 77.17253153662926,
  'denorm_min_true': 41.790002608273234,
  'denorm_max_false': 20.618350202170916,
  'denorm_min_false': 0.7976712344840693},
 'euclidean': {'w': -6.716177467853723,
  'b': 2.790978346203265,
  'normalizer': 18.735288,
  'denorm_max_true': 74.76412617567517,
  'denorm_min_true': 40.4423755909089,
  'denorm_max_false': 25.840858374979504,
  'denorm_min_false': 1.9356150486888306},
 'euclidean_l2': {'w': -6.708710331202137,
  'b': 2.9094193067398195,
  'normalizer': 1.553508,
  'denorm_max_true': 75.45756719896039,
  'denorm_min_true': 40.4509428022908,
  'denorm_max_false': 30.555931000001184,
  'denorm_min_false': 2.189644991619842},
 'angular': {'w': -6.371147050396505,
  'b': 0.6766460615182355,
  'normalizer': 0.56627,
  'denorm_max_true': 45.802357900723386,
  'denorm_min_true': 24.327312950719133,
  'denorm_max_false': 16.95267765757785,
  'denorm_min_false': 5.063533287198758}}

I also stored the denormalization minimum and maximum values to map false predictions to the range [0, 49] and true predictions to [51, 100]. This step is purely optional but can make the confidence scores more intuitive.

Confidence Scores

Now, we can use the trained logistic regression model to convert distance scores into confidence scores for each row, and we can do this separately for each distance metric.

for distance_metric in distance_metrics:
    for index, instance in df.iterrows():
        current_distance = instance[distance_metric]
        threshold = find_threshold(model_name=model_name, distance_metric=distance_metric)

        prediction = "same person" if current_distance &lt;= threshold else "different persons"

        # denormalize same person predictions
        if prediction == "same person":
            min_orginal = confidence_metrics[distance_metric]["denorm_min_true"]
            max_orginal = confidence_metrics[distance_metric]["denorm_max_true"]
            min_target = max(51, min_orginal)
            max_target = 100
        else:
            min_orginal = confidence_metrics[distance_metric]["denorm_min_false"]
            max_orginal = confidence_metrics[distance_metric]["denorm_max_false"]
            min_target = 0
            max_target = min(49, max_orginal)

        confidence = instance[f"{distance_metric}_confidence"]

        confidence_new = (
            (confidence - min_orginal) / (max_orginal - min_orginal)
        ) * (max_target - min_target) + min_target
        
        confidence_new = float(confidence_new)

        # print(f"{prediction}: {confidence}  -> {confidence_new}")

        df.loc[index, f"{distance_metric}_confidence"] = confidence_new

We now have confidence scores computed for each pair.

Distributions

Next, let’s plot the distributions of confidence scores for same-person and different-person pairs. Ideally, we should see scores in the range 0–49 for different-person pairs and 51–100 for same-person pairs.

for distance_metric in distance_metrics:
    df[df.actual == "Same Person"][f"{distance_metric}_confidence"].plot.kde(label="Same Person")
    df[df.actual == "Different Persons"][f"{distance_metric}_confidence"].plot.kde(label="Different Persons")
    plt.legend()
    plt.show()

The confidence scores are indeed distributed within the expected ranges.

With this approach, we were able to convert continuous distance values into finite confidence scores ranging from 0 to 100. Since we used distances and predictions from a dataset for this conversion, we also accounted for how a small decrease in distance affects the confidence score—similar to how a derivative measures sensitivity. In other words, instead of just meaningless distance values, we now have meaningful, interpretable, and actionable confidence scores on a 0–100 scale. At this point, you can implement additional actions, such as taking automatic action for classifications with confidence above 75, while sending scores in the 51–75 range for human review.

In Summary

Converting embedding distances and similarity scores into percentage-based confidence scores adds a layer of interpretability that hard classification cannot provide. Instead of relying solely on true/false decisions, we can now understand not just whether two items are considered the same, but how strong or fuzzy that classification is. Even a simple approach like logistic regression allows us to transform raw embedding metrics into intuitive, human-understandable percentages. This added nuance makes similarity-based systems more transparent, informative, and user-friendly, bridging the gap between powerful machine learning models and actionable insights.

I pushed the source code of this study into GitHub. You can support this work by starring the repo.

Support this blog financially if you do like!

How to Calculate Percentage-Based Confidence Scores from Similarities of Embedding Models

Use Case

Preparing The Dataset

Generate Embeddings

Distance Calculation

Hard Classification

Logistic Regression Model

Confidence Scores

Distributions

In Summary

Related

Leave a Reply Cancel reply

Use Case

Preparing The Dataset

Generate Embeddings

Distance Calculation

Hard Classification

Logistic Regression Model

Confidence Scores

Distributions

In Summary

Related

Leave a Reply Cancel reply

Discover more from Sefik Ilkin Serengil