OpenAI vs SBERT embeddings on similar vs dissimilar texts¶

This notebook is a simple comparison of OpenAI vs SBERT embeddings. I'm using a small subset of "The Pile", which is a diverse open-source language modeling dataset.

The expectation¶

  1. We expect that texts that are from the same document should have high similarity in their embeddings while texts that are not should have low similarity. We test for this by chunking up documents from different slices of The Pile, and compare their self-similarity vs cross-similarity.
  2. If we collect a dataset of embeddings from across a diverse set of texts, we intuitively expect the embedding to make use of most of the k-dimensional space. Truly testing for this is hard, due to the curse of dimensionality, but we can at least look at each dimension in isolation and see whether there's reasonable variance within that dimension among our set.

The findings¶

  1. OpenAI embeddings yield very high similarity (in the range 0.65 - 0.7) even for very dissimilar texts, while SBERT gives similarities that are close to 0 for such examples.
  2. Compared to SBERT's embeddings, the variance of each dimension in OpenAI's embeddings is much smaller; meaning that no matter what the input text is, the value assigned to most dimensions is in a narrower range.

Speculation¶

I speculate that this behavior of OpenAI's embeddings is why Andrej Karpathy felt the need to use SVM for nearest neighbor lookups, instead of simple cosine similarity. I suspect that if one were to use embeddings which fill out the space better, like SBERT, the improvement that comes from using SVM (which is costly) for nearest-neighbor lookups will be small.

Imports¶

In [1]:
from datasets import load_dataset
from itertools import islice
import numpy as np
import pandas as pd
import random
import openai
from openai.embeddings_utils import get_embedding, cosine_similarity
import os
from sentence_transformers import SentenceTransformer, util
import seaborn as sns
import matplotlib.pyplot as plt

Load subset of Pile dataset¶

In [2]:
p10k = load_dataset("NeelNanda/pile-10k", split="train")
Using custom data configuration NeelNanda--pile-10k-72f566e9f7c464ab
Found cached dataset parquet (/Users/venu/.cache/huggingface/datasets/NeelNanda___parquet/NeelNanda--pile-10k-72f566e9f7c464ab/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec)
In [3]:
p10k[0, 1].keys()
Out[3]:
dict_keys(['text', 'meta'])
In [4]:
p10k[0, 0]["meta"][0]
Out[4]:
{'pile_set_name': 'Pile-CC'}
In [5]:
p10k[80, 0]["meta"][0]['pile_set_name']
Out[5]:
'StackExchange'
In [6]:
# get pile_set_name for each example, and count frequencies of each pile_set_name
names = [p10k[i, 0]["meta"][0]['pile_set_name'] for i in range(len(p10k))]
from collections import Counter
Counter(names)
Out[6]:
Counter({'Pile-CC': 2524,
         'Github': 855,
         'OpenWebText2': 1520,
         'StackExchange': 1399,
         'Wikipedia (en)': 779,
         'PubMed Abstracts': 1423,
         'USPTO Backgrounds': 514,
         'FreeLaw': 241,
         'PubMed Central': 259,
         'Enron Emails': 47,
         'HackerNews': 81,
         'NIH ExPorter': 104,
         'Books3': 9,
         'ArXiv': 91,
         'DM Mathematics': 99,
         'OpenSubtitles': 27,
         'BookCorpus2': 2,
         'Ubuntu IRC': 2,
         'YoutubeSubtitles': 11,
         'EuroParl': 6,
         'PhilPapers': 5,
         'Gutenberg (PG-19)': 2})
In [7]:
# make a dictionary with pile_set_name as key, and pick the first example 
# from each pile_set_name that's at least num_chunks * chunk_size characters long
samples = {}
num_chunks = 10
chunk_size = 1000
for i in range(len(p10k)):
    name = p10k[i, 0]["meta"][0]['pile_set_name']
    text = p10k[i, 0]["text"][0]
    if name not in samples and len(text) > chunk_size * num_chunks:
        # chunk up text into chunks of chunk_size characters each
        chunks = [text[i:i+chunk_size] for i in range(0, chunk_size*num_chunks, chunk_size)]
        samples[name] = chunks
In [8]:
samples.keys()
Out[8]:
dict_keys(['Pile-CC', 'USPTO Backgrounds', 'FreeLaw', 'PubMed Central', 'StackExchange', 'Books3', 'OpenWebText2', 'ArXiv', 'Github', 'OpenSubtitles', 'Wikipedia (en)', 'BookCorpus2', 'HackerNews', 'Ubuntu IRC', 'YoutubeSubtitles', 'PhilPapers', 'Gutenberg (PG-19)', 'EuroParl'])
In [9]:
len(samples.items())
for k, v in islice(samples.items(), 1):
    print(k, len(v), v[0])
Pile-CC 10 It is done, and submitted. You can play “Survival of the Tastiest” on Android, and on the web. Playing on the web works, but you have to simulate multi-touch for table moving and that can be a bit confusing.

There’s a lot I’d like to talk about. I’ll go through every topic, insted of making the typical what went right/wrong list.

Concept

Working over the theme was probably one of the hardest tasks I had to face.

Originally, I had an idea of what kind of game I wanted to develop, gameplay wise – something with lots of enemies/actors, simple graphics, maybe set in space, controlled from a top-down view. I was confident I could fit any theme around it.

In the end, the problem with a theme like “Evolution” in a game is that evolution is unassisted. It happens through several seemingly random mutations over time, with the most apt permutation surviving. This genetic car simulator is, in my opinion, a great example of actual evolution of a species facing a challenge. But is it a game?


In [10]:
# Common function for computing similarity matrix and visualizing it
def compute_similarities_and_plot(embs: dict, name: str, num_classes=6):
    selected_samples = dict(islice(samples.items(), num_classes))
    sims_df = pd.DataFrame(columns=selected_samples.keys(), index=selected_samples.keys())
    random.seed(42)
    for k1, v1 in selected_samples.items():
        for k2, v2 in selected_samples.items():
            # get average of 10 cosine similarity values
            sum = 0
            l = 10
            for i in range(l):
                rnd1 = random.randint(0, len(v1)-1)
                rnd2 = random.randint(0, len(v2)-1)
                sum += cosine_similarity(embs[k1][rnd1], embs[k2][rnd2])
            sims_df.loc[k1][k2] = float(sum / l)
    # not sure why I'm needing this, but otherwise it's inferring the type as object
    sims_df = sims_df.apply(pd.to_numeric, errors='coerce') 

    plt.figure(figsize=(6, 4))  # increase figure size for better visibility
    sns.heatmap(sims_df, annot=True, cmap='YlGnBu', vmin=0, vmax=1)
    plt.title(f'Avg. cosine similarities using {name} embeddings')

    # get average and std. dev of the entries along the diagonal in sims_df
    text1 = "Average self-similarity {:.2f}, std.dev {:.2f}".format(np.mean(np.diag(sims_df)), np.std(np.diag(sims_df)))

    # get average and std. dev of all the entries except the diagonal in sims_df
    non_diagonal_ix = np.where(~np.eye(sims_df.values.shape[0],dtype=bool))
    non_diagonal_mean = np.mean(sims_df.values[non_diagonal_ix])
    non_diagonal_std = np.std(sims_df.values[non_diagonal_ix])
    text2 = "Average non-self-similarity {:.2f}, std.dev {:.2f}".format(non_diagonal_mean, non_diagonal_std)

    plt.text(0, -1, text1 + "\n" + text2, fontsize=12)

    plt.show()

    return sims_df

Get OpenAI embeddings of each of the chunks¶

In [11]:
openai.api_key = os.getenv("OPENAI_API_KEY")
embedding_model = "text-embedding-ada-002"
In [12]:
cosine_similarity(get_embedding(samples["Pile-CC"][0], engine=embedding_model), get_embedding(samples["Pile-CC"][1], engine=embedding_model))
Out[12]:
0.8459940907245606
In [13]:
len(get_embedding(samples["Pile-CC"][0], engine=embedding_model))
Out[13]:
1536
In [14]:
cosine_similarity(get_embedding(samples["Pile-CC"][0], engine=embedding_model), get_embedding(samples["FreeLaw"][0], engine=embedding_model))
Out[14]:
0.6870063565524391
In [15]:
openai_embs = {}
for k1, v1 in samples.items():
    openai_embs[k1] = [get_embedding(v1[i], engine=embedding_model) for i in range(len(v1))]
In [16]:
sims_df = compute_similarities_and_plot(openai_embs, "OpenAI")

Commentary on the results¶

No matter how different the two inputs, the cosine similarity of the two resulting embeddings is 0.66 or greater!

Get SBERT embeddings of each of the chunks¶

In [17]:
model = SentenceTransformer('all-MiniLM-L6-v2')
In [18]:
util.cos_sim(model.encode(samples["Pile-CC"][0]), model.encode(samples["Pile-CC"][1]))
Out[18]:
tensor([[0.6097]])
In [19]:
# Double check that openai's implementation of cosine similarity returns the same results
cosine_similarity(model.encode(samples["Pile-CC"][0]), model.encode(samples["Pile-CC"][1]))
Out[19]:
0.60973257
In [20]:
util.cos_sim(model.encode(samples["Pile-CC"][0]), model.encode(samples["FreeLaw"][0]))
Out[20]:
tensor([[0.0098]])
In [21]:
sbert_embs = {}
for k1, v1 in samples.items():
    sbert_embs[k1] = [model.encode(v1[i]) for i in range(len(v1))]
In [22]:
sbert_sims = compute_similarities_and_plot(sbert_embs, "SBERT")

Commentary on the results¶

There's a much greater separation between the diagonal and the off-diagonal entries here, which makes much more sense and matches my intuition.

Compare the dimensions of OpenAI's vs SBERT's embeddings¶

One hypothesis I can think of is that there are some useless dimensions, i.e. there is very little variation in those dimensions across different sentences.

In [23]:
len(openai_embs["ArXiv"][0])
Out[23]:
1536
In [24]:
len(sbert_embs["ArXiv"][0])
Out[24]:
384
In [25]:
# convert openai_embs, which is currently a dictionary, to a matrix
openai_embs_matrix = np.zeros((len(openai_embs)*len(openai_embs["ArXiv"]), len(openai_embs["ArXiv"][0])))
In [26]:
sbert_embs_matrix = np.zeros((len(sbert_embs)*len(sbert_embs["ArXiv"]), len(sbert_embs["ArXiv"][0])))
In [27]:
for i, (k, v) in enumerate(openai_embs.items()):
    for j in range(len(v)):
        openai_embs_matrix[i*len(v)+j] = openai_embs[k][j]
        sbert_embs_matrix[i*len(v)+j] = sbert_embs[k][j]
In [28]:
# make sure we built the matrices correctly
print(np.sum(openai_embs_matrix[0] - openai_embs["Pile-CC"][0]))
print(np.sum(sbert_embs_matrix[0] - sbert_embs["Pile-CC"][0]))
0.0
0.0
In [29]:
# make sure we built the matrices correctly
print(np.count_nonzero(openai_embs_matrix) - np.shape(openai_embs_matrix)[0] * np.shape(openai_embs_matrix)[1])
print(np.count_nonzero(sbert_embs_matrix) - np.shape(sbert_embs_matrix)[0] * np.shape(sbert_embs_matrix)[1])
0
0
In [30]:
# let's get std. dev along each column of openai_embs_matrix
openai_std = np.sqrt(np.var(openai_embs_matrix, axis=0))
sbert_std = np.sqrt(np.var(sbert_embs_matrix, axis=0))
# let's plot the distribution of std. dev in the same plot
plt.figure(figsize=(6, 4))  # increase figure size for better visibility
sns.histplot(sbert_std, kde=True, label='SBERT')
sns.histplot(openai_std, kde=True, label='OpenAI')
plt.legend()
plt.title('The distribution of std. dev. within each dimension of OpenAI vs SBERT embeddings')
plt.show()

Commentary on the results¶

It looks like most of the dimensions in OpenAI's embeddings span a very narrow range, compared to SBERT's.

Try second way of getting OpenAI embeddings¶

Apparently, there's a slightly different way of getting Open AI's embeddings (even for the same model), and somehow the two methods don't return the same result! The two methods are "openai.Embedding.create" vs "openai.embeddings_utils.get_embedding". The latter is what I see used in OpenAI Cookbook on Github but let's just try the former as well, to see if the results are any different.

In [31]:
oa2_embs = {}
for k1, v1 in samples.items():
    oa2_embs[k1] = [openai.Embedding.create(model=embedding_model,input=v1[i],)["data"][0]["embedding"] for i in range(len(v1))]
In [32]:
_ = compute_similarities_and_plot(oa2_embs, "OpenAI2")

Commentary on the results¶

The resulting plot looks very similar, although it's not exactly the same. Not sure why there's two different ways of using OpenAI's embeddings, but looks like they have similar characteristics.

In [ ]: