Exploring the World of Embeddings

Introduction

Vector embeddings are among the most intriguing and beneficial aspects of machine learning, playing a pivotal role in many natural language processing, recommendation, and search algorithms. If you've interacted with recommendation engines, voice assistants, or language translators, you've engaged with systems that utilize embeddings.

Embeddings are dense vector representations of data that encapsulate semantic information, making them suitable for various machine-learning tasks such as clustering, recommendation, and classification. They transform human-perceived semantic similarity into closeness in vector space and can be generated for different data types, including text, images, and audio.

For text data, models like the GPT family of models and Llama are employed to create vector embeddings for words, sentences, or paragraphs. In the case of images, convolutional neural networks (CNNs) such as VGG and Inception can generate embeddings. Audio recordings can be converted into vectors using image embedding techniques applied to visual representations of audio frequencies, like spectrograms. Deep neural networks are commonly employed to train models that convert objects into vectors. The resulting embeddings are typically high-dimensional and dense.

Embeddings are extensively used in similarity search applications, such as KNN and ANN, which require calculating distances between vectors to determine similarity. Nearest neighbor search can be employed for tasks like de-duplication, recommendations, anomaly detection, and reverse image search.

Similarity search and vector embeddings

OpenAI offers a powerful language model called GPT-3, which can be used for various tasks, such as generating embeddings and performing similarity searches. In this example, we'll use the OpenAI API to generate embeddings for a set of documents and then perform a similarity search using cosine similarity.

First, let's install the required packages with the following command: pip install langchain==0.0.208 deeplake openai==0.27.8 tiktoken scikit-learn.

Next, create an API key from the OpenAI website and set it as an environment variable:

export OPENAI_API_KEY="your-api-key"

Let's generate embeddings for our documents and perform a similarity search:

import openai
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from langchain.embeddings import OpenAIEmbeddings

# Define the documents
documents = [
    "The cat is on the mat.",
    "There is a cat on the mat.",
    "The dog is in the yard.",
    "There is a dog in the yard.",
]

# Initialize the OpenAIEmbeddings instance
embeddings = OpenAIEmbeddings(model="text-embedding-ada-002")

# Generate embeddings for the documents
document_embeddings = embeddings.embed_documents(documents)

# Perform a similarity search for a given query
query = "A cat is sitting on a mat."
query_embedding = embeddings.embed_query(query)

# Calculate similarity scores
similarity_scores = cosine_similarity([query_embedding], document_embeddings)[0]

# Find the most similar document
most_similar_index = np.argmax(similarity_scores)
most_similar_document = documents[most_similar_index]

print(f"Most similar document to the query '{query}':")
print(most_similar_document)

# the output:
Most similar document to the query 'A cat is sitting on a mat.':
The cat is on the mat.

We initialize the OpenAI API client by setting the OpenAI API key. This allows us to use OpenAI's services for generating embeddings.

We then define a list of documents as strings. These documents are the text data we want to analyze for semantic similarity.

In order to perform this analysis, we need to convert our documents into a format that our similarity computation algorithm can understand. This is where OpenAIEmbeddings class comes in. We use it to generate embeddings for each document, transforming them into vectors that represent their semantic content.

Similarly, we also transform our query string into an embedding. The query string is the text we want to find the most similar document too.

With our documents and query now in the form of embeddings, we compute the cosine similarity between the query embedding and each document embedding. The cosine similarity is a metric used to determine how similar two vectors are. In our case, it gives us a list of similarity scores for our query against each document.

With our similarity scores in hand, we then identify the document most similar to our query. We do this by finding the index of the highest similarity score and retrieving the corresponding document from our list of documents.

Embedding vectors positioned near each other are regarded as similar. At times, they are directly applied to display related items in online shops. In other instances, they are incorporated into various models to share insights across akin items rather than considering them as entirely distinct entities. This renders embeddings effective in representing aspects like web browsing patterns, textual data, and e-commerce transactions for subsequent model applications.

Embedding Models

Embedding models are a type of machine learning model that convert discrete data into continuous vectors. In the context of natural language processing, these discrete data points can be words, sentences, or even entire documents. The generated vectors, also known as embeddings, are designed to capture the semantic meaning of the original data.

For instance, words that are semantically similar (e.g., 'cat' and 'kitten') would have similar embeddings. These embeddings are dense, which means that they use many dimensions (often hundreds) to capture nuances in meaning.

The primary benefit of embeddings is that they allow us to use mathematical operations to reason about semantic meaning. For example, we can calculate the cosine similarity between two embeddings to assess how semantically similar the corresponding words or documents are.

We initialize our embedding model. For this task, we've chosen the pre-trained "sentence-transformers/all-mpnet-base-v2" model. This model is designed to transform sentences into embeddings - vectors that encapsulate the semantic meaning of the sentences. The model_kwargs parameter is used here to specify that we want our computations to be performed on the CPU.

Before executing the subsequent code, make sure to install the Sentence Transformer library by using the command pip install sentence_transformers===2.2.2. This library offers powerful pre-trained models designed to generate embedding representations.

from langchain.llms import HuggingFacePipeline
from langchain.embeddings import HuggingFaceEmbeddings

model_name = "sentence-transformers/all-mpnet-base-v2"
model_kwargs = {'device': 'cpu'}
hf = HuggingFaceEmbeddings(model_name=model_name, model_kwargs=model_kwargs)

documents = ["Document 1", "Document 2", "Document 3"]
doc_embeddings = hf.embed_documents(documents)

Now that we have our model, we define a list of documents - these are the pieces of text that we want to convert into semantic embeddings.

With our model and documents ready, we move on to generate the embeddings. We do this by calling the embed_documents method on our HuggingFaceEmbeddings instance, passing our list of documents as an argument. This method processes each document and returns a corresponding list of embeddings.

These embeddings are now ready for any downstream tasks such as classification, clustering, or similarity analysis. They represent our original documents in a form that machines can understand and process, enabling us to perform complex semantic tasks.

Cohere embeddings

Cohere is dedicated to making its innovative multilingual language models accessible to all, thereby democratizing advanced NLP technologies worldwide. Their Multilingual Model, which maps text into a semantic vector space for better text similarity understanding, significantly enhances multilingual applications such as search operations. Unlike their English language model, the multilingual model uses dot product computations resulting in superior performance.

These multilingual embeddings are represented in a 768-dimensional vector space.

To activate the power of the Cohere API, one needs to acquire an API key. Here's a step-by-step guide to doing so:

  1. Visit the Cohere Dashboard.
  2. If you haven't already, you must either log in or sign up for a Cohere account. Please note that you agree to adhere to the Terms of Use and Privacy Policy by signing up.
  3. When you're logged in, the dashboard provides an intuitive interface to create and manage your API keys.

Once we have the API key, we initialize an instance of the CohereEmbeddings class within LangChain, specifying the "embed-multilingual-v2.0" model.

We then specify a list of texts in various languages. The embed_documents() method is subsequently invoked to generate unique embeddings for each text in the list.

To illustrate the results, we print each text alongside its corresponding embedding. For simplicity, we only display the first 5 dimensions of each embedding. You also need to install the cohere package by running the following command pip install cohere.

import cohere
from langchain.embeddings import CohereEmbeddings

# Initialize the CohereEmbeddings object
cohere = CohereEmbeddings(
	model="embed-multilingual-v2.0",
	cohere_api_key="your_cohere_api_key"
)

# Define a list of texts
texts = [
    "Hello from Cohere!", 
    "مرحبًا من كوهير!", 
    "Hallo von Cohere!",  
    "Bonjour de Cohere!", 
    "¡Hola desde Cohere!", 
    "Olá do Cohere!",  
    "Ciao da Cohere!", 
    "您好,来自 Cohere!", 
    "कोहेरे से नमस्ते!"
]

# Generate embeddings for the texts
document_embeddings = cohere.embed_documents(texts)

# Print the embeddings
for text, embedding in zip(texts, document_embeddings):
    print(f"Text: {text}")
    print(f"Embedding: {embedding[:5]}")  # print first 5 dimensions of each embedding

Your output should be similar to the following.

Text: Hello from Cohere!
Embedding: [0.23439695, 0.50120056, -0.048770234, 0.13988855, -0.1800725]

Text: مرحبًا من كوهير!
Embedding: [0.25350592, 0.29968268, 0.010332941, 0.12572688, -0.18180023]

Text: Hallo von Cohere!
Embedding: [0.10278442, 0.2838264, -0.05107267, 0.23759139, -0.07176493]

Text: Bonjour de Cohere!
Embedding: [0.15180704, 0.28215882, -0.056877363, 0.117460854, -0.044658754]

Text: ¡Hola desde Cohere!
Embedding: [0.2516583, 0.43137372, -0.08623046, 0.24681088, -0.11645193]

Text: Olá do Cohere!
Embedding: [0.18696906, 0.39113742, -0.046254586, 0.14583701, -0.11280365]

Text: Ciao da Cohere!
Embedding: [0.1157251, 0.43330532, -0.025885003, 0.14538017, 0.07029742]

Text: 您好,来自 Cohere!
Embedding: [0.24605744, 0.3085744, -0.11160592, 0.266223, -0.051633865]

Text: कोहेरे से नमस्ते!
Embedding: [0.19287698, 0.6350239, 0.032287907, 0.11751755, -0.2598813]

LangChain, a comprehensive library designed for language understanding and processing, serves as an ideal conduit for Cohere's advanced language models. It simplifies the integration of Cohere's multilingual embeddings into a developer's workflow, thus enabling a broader range of applications, from semantic search to customer feedback analysis and content moderation, across a multitude of languages.

When used in tandem with Cohere, LangChain eliminates the need for complex pipelines, making the process of generating and manipulating high-dimensional embeddings straightforward and efficient. Given a list of multilingual texts, the embed_documents() method in LangChain's CohereEmbeddings class, connected to Cohere’s embedding endpoint, can swiftly generate unique semantic embeddings for each text.

Deep Lake Vector Store

Vector stores are data structures or databases designed to store and manage high-dimensional vectors efficiently. They enable efficient similarity search, nearest neighbor search, and other vector-related operations. Vector stores can be built using various data structures such as approximate nearest neighbor (ANN) techniques, KD trees, or Vantage Point trees.

Deep Lake, serves as both a data lake for deep learning and a multi-modal vector store. As a multi-modal vector store, it allows users to store images, audio, videos, text, and metadata in a format optimized for deep learning. It enables hybrid search, allowing users to search both embeddings and their attributes.

Users can save data locally, in their cloud, or on Activeloop storage. Deep Lake supports the training of PyTorch and TensorFlow models while streaming data with minimal boilerplate code. It also provides features like version control, dataset queries, and distributed workloads using a simple Python API.

Moreover, as the size of datasets increases, it becomes increasingly difficult to store them in local memory. A local vector store could have been utilized in this particular instance since only a few documents are being uploaded. However, the necessity for a centralized cloud dataset arises in a typical production setting, where thousands or millions of documents may be involved and accessed by various programs.

Let’s see how to use Deep Lake for our example.

Creating Deep Lake Vector Store embeddings example

Deep Lake provides well-written documentation, and besides other examples for which they added Jupyter Notebooks, we can follow the one for vector store creation.

This task aims to leverage the power of NLP technologies, particularly OpenAI and Deep Lake, to generate and manipulate high-dimensional embeddings. These embeddings can be used for a variety of purposes, such as searching for relevant documents, moderating content, and answering questions. In this case, we will create a Deep Lake database for a retrieval-based question-answering system.

First, we need to import the required packages and ensure that the Activeloop and OpenAI keys are stored in the environment variables, ACTIVELOOP_TOKEN and OPENAI_API_KEY. Getting ACTIVELOOP_TOKEN is straightforward, you can easily generate one on the Activeloop page.

The installation of the deeplake library using pip, and the initialization of the OpenAI and Activeloop API keys:

pip install deeplake

Then make sure to specify the right API keys in the “OPENAI_API_KEY” and “ACTIVELOOP_TOKEN” environmental variables.

Next, the necessary modules from the langchain package are imported.

from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import DeepLake
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chat_models import ChatOpenAI
from langchain.chains import RetrievalQA

We then create some documents using the RecursiveCharacterTextSplitter class.

# create our documents
texts = [
    "Napoleon Bonaparte was born in 15 August 1769",
    "Louis XIV was born in 5 September 1638",
    "Lady Gaga was born in 28 March 1986",
    "Michael Jeffrey Jordan was born in 17 February 1963"
]
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.create_documents(texts)

The next step is to create a Deep Lake database and load our documents into it.

# initialize embeddings model
embeddings = OpenAIEmbeddings(model="text-embedding-ada-002")

# create Deep Lake dataset
# TODO: use your organization id here. (by default, org id is your username)
my_activeloop_org_id = "<YOUR-ACTIVELOOP-ORG-ID>"
my_activeloop_dataset_name = "langchain_course_embeddings"
dataset_path = f"hub://{my_activeloop_org_id}/{my_activeloop_dataset_name}"
db = DeepLake(dataset_path=dataset_path, embedding_function=embeddings)

# add documents to our Deep Lake dataset
db.add_documents(docs)

If everything worked correctly, you should see a printed output like this:

Your Deep Lake dataset has been successfully created!
The dataset is private so make sure you are logged in!

We now create a retriever from the database.

# create retriever from db
retriever = db.as_retriever()

Finally, we create a RetrievalQA chain in LangChain and run it

# istantiate the llm wrapper
model = ChatOpenAI(model='gpt-3.5-turbo')

# create the question-answering chain
qa_chain = RetrievalQA.from_llm(model, retriever=retriever)

# ask a question to the chain
qa_chain.run("When was Michael Jordan born?")

This returns:

'Michael Jordan was born on 17 February 1963.'

This pipeline demonstrates how to leverage the power of the LangChain, OpenAI, and Deep Lake libraries and products to create a conversational AI model capable of retrieving and answering questions based on the content of a given repository.

Let's break down each step to understand how these technologies work together.

  1. OpenAI and LangChain Integration: LangChain, a library built for chaining NLP models, is designed to work seamlessly with OpenAI's GPT-3.5-turbo model for language understanding and generation. You've initialized OpenAI embeddings using OpenAIEmbeddings(), and these embeddings are later used to transform the text into a high-dimensional vector representation. This vector representation captures the semantic essence of the text and is essential for information retrieval tasks.
  2. Deep Lake: Deep Lake is a Vector Store for creating, storing, and querying vector representations (also known as embeddings) of data.
  3. Text Retrieval: Using the db.as_retriever() function, you've transformed the Deep Lake dataset into a retriever object. This object is designed to fetch the most relevant pieces of text from the dataset based on the semantic similarity of their embeddings.
  4. Question Answering: The final step involves setting up a RetrievalQA chain from LangChain. This chain is designed to accept a natural language question, transform it into an embedding, retrieve the most relevant document chunks from the Deep Lake dataset, and generate a natural language answer. The ChatOpenAI model, which is the underlying model of this chain, is responsible for both the question embedding and the answer generation.

Conclusion

In conclusion, vector embeddings are a cornerstone in capturing and understanding the rich contextual information in our textual data. This representation becomes increasingly important when dealing with language models like GPT-3.5-turbo, which have a limited token capacity.

In this tutorial, we've used embeddings from OpenAI and incorporated embeddings from Hugging Face and Cohere. The former, a well-known AI research organization, provides Transformer-based models that are highly versatile and widely used. Cohere offers innovative multilingual language models that are a significant asset in a globally interconnected world.

Building upon these technologies, we've walked through the process of creating a conversational AI application, specifically a Q&A system leveraging Deep Lake. This application demonstrates the potential of these combined technologies - LangChain for chaining together complex NLP tasks, Hugging Face, Cohere, and OpenAI for generating high-quality embeddings, and Deep Lake for managing these embeddings in a vector store.

In the next lesson we’ll build a customer support question-answering chatbot leveraging our new knowledge about indexes and retrievers.