Introduction

In this lesson, we will explore optimization techniques that maximize large language model performance. We will learn about the appropriate use of prompt engineering, retrieval augmented generation (RAG), and fine-tuning, distinguishing how each method contributes and their specific challenges.

A significant portion of the lesson will be dedicated to addressing the limitations of RAG systems in real-world applications. These mainly include maintaining high retrieval accuracy and ensuring accurate responses from LLMs. Much of our discussion will include Activeloop's Deep Memory, a technique designed to augment the retrieval precision of embeddings for user queries.

We will also perform a detailed comparison of empirical data, analyzing the differences in retrieval recall rates between systems employing Deep Memory and those that do not. Before starting this guide, please make sure to install all the requirements in the requirements section.

Overview of RAG Enhancement Techniques

Expanding on the discussion surrounding fine-tuning, retrieval-augmented generation, and prompt engineering, it's essential to understand each approach's distinct strengths, weaknesses, and most suitable applications.

Prompt engineering

Prompt engineering is often the first step in enhancing the performance of an LLM for specific tasks. This approach alone can be sufficient, especially for simpler or well-defined tasks. Techniques like few-shot prompting can notably improve task performance. This method involves providing small task-specific examples to guide the LLM. Chain of Thought (CoT) prompting can also improve reasoning capabilities and encourage the model to generate more detailed responses.

Combining Few-shot with RAG—using a tailored dataset of examples to retrieve the most relevant information for each query—can be more effective.

Fine-tuning

Fine-tuning enhances LLM’s capabilities in the following areas:

Modifying the structure or tone of responses.
Teaching the model to follow complex instructions.

For example, fine-tuning enables models to perform tasks like extracting JSON-formatted data from text, translating natural language into SQL queries, or adopting a specific writing style.

Fine-tuning demands a large, high-quality, task-specific dataset for effective training. You can start with a small dataset and training to see if the method works for your task.

Fine-tuning is less effective in adapting to new, rapidly changing data or unfamiliar queries beyond the training dataset. It's also not the best choice for incorporating new information into the model. Alternative methods, such as Retrieval-Augmented Generation, are more suitable.

From A Survey of Techniques for Maximizing LLM Performance

Retrieval-Augmented Generation

RAG specializes in incorporating external knowledge, enabling the model to access current and varied information.

Real-Time Updates: It is more adept at dealing with evolving datasets and can provide more up-to-date responses.

Complexity in Integration: Setting up a RAG system is more complex than basic prompting, requiring extra components like a Vector Database and retrieval algorithms.

Data Management: Managing and updating the external data sources is crucial for maintaining the accuracy and relevance of its outputs.

Retrieval accuracy: Ensuring precise embedding retrieval is crucial in RAG systems to guarantee reliable and comprehensive responses to user queries. For that, we will demonstrate how Activeloop’s Deep Memory method can greatly increase the recall of embedding retrieval.

RAG + Fine-tuning

Fine-tuning and RAGs are not mutually exclusive techniques. Fine-tuning brings the advantage of customizing models for a specific style or format, which can be useful when using LLMs for specific domains such as medical, financial, or legal, requiring a highly specialized tone of writing.

When combined with RAG, the model becomes adept in its specialized area and gains access to a vast range of external information. The resulting model provides accurate responses in the niche area.

Implementing these two methods can demand considerable resources for setup and ongoing upkeep. It involves multiple training runs of fine-tuning with the data handling requirements inherent to RAG.

From A Survey of Techniques for Maximizing LLM Performance

Enhanced RAG with Deep Memory

Deep Memory is a method developed by Activeloop to boost the accuracy of embedding retrieval for RAG systems integrated into the Deep Lake vector store database.

Central to its functionality is an embedding transformation process. Deep Memory trains a model that transforms embeddings into a space optimized for your use case. This reconfiguration significantly improves vector search accuracy.

Deep Memory is effective where query reformulation, query transformation, or document re-ranking might cause latency and increased token usage. It boosts retrieval capabilities without negatively impacting the system's performance.

The figure below shows the recall performance for different algorithms compared to Deep Memory.

Recall@1: This measures whether the top result (i.e., the first result) returned by the retrieval system is relevant to the query.

Recall@10: This metric assesses whether the relevant document is within the top 10 results returned by the retrieval system.

Comparison to Lexical search

BM25 is considered a state-of-the-art approach for "lexical search," based on the explicit presence of words (or lexicons) from the query in the documents. It's particularly effective for applications where the relevance of documents depends heavily on the presence of specific terms, such as in traditional search engines. However, BM25 does not account for the semantic relationships between words, where more advanced techniques like vector search with neural embeddings and semantic search come into play.

Overview of Deep Memory

In the figure above, we see the Inference and Training workflow:

Embeddings: Vector representation of a text sentence or set of words. We can create them using embedding models such as OpenAI’s text-embedding-ada-002 or open-source models.
Deep Memory Training: A dataset of query and context pairs trains the Deep Memory model. This training process runs in Deep Lake Cloud, which provides the computational resources and infrastructure for handling the training.
Deep Memory Inference: The model enters the inference phase after training, which transforms query embeddings. We can use the Tensor Query Language (TQL) when running an inference/querying in the Vector Store.
Transformed Embeddings: The result of the inference process is a set of transformed embeddings optimized for a specific use case. This optimization means that the embeddings are now in a more conducive space for returning accurate results.
Vector Search: These optimized embeddings are used in a vector search, utilizing standard similarity search techniques (e.g., cosine similarity). The vector search is retrieving information, leveraging the refined embeddings to find and retrieve the most relevant data points for a given query.

Step by Step - Training a Deep Memory Model

Moving forward in our lesson, let's implement Deep Memory within our workflow to see firsthand how it impacts retrieval recall.

You can follow along with this Colab notebook.

As Step 0, please note that Deep Memory is a premium feature in Activeloop paid plans. As a reminder, you are able to redeem a free trial. As a part of the course, all course takers can redeem a free extended trial of one month for the Activeloop Growth plan by redeeming GENAI360 promo code at checkout. To redeem the plan, please create a Deep Lake Account, and on the following screen on account creation, please watch the following video.

Install the required libraries

!pip3 install deeplake==3.9.27 langchain openai tiktoken llama-index
%pip install llama-index-vector-stores-deeplake
%pip install llama-index-llms-openai

Set your ACTIVELOOP_TOKEN and OPENAI_API_KEY

import os, getpass
os.environ['ACTIVELOOP_TOKEN'] = getpass.getpass("Enter your ActiveLoop token: ")

os.environ['OPENAI_API_KEY'] = getpass.getpass("Enter your OpenAI API key: ")

Download the data or use your own. Here, we download a text file hosted on GitHub.

!mkdir -p 'data/paul_graham/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'

Create the Llama-index nodes/chunks

from llama_index.core.node_parser import SimpleNodeParser
from llama_index.core import SimpleDirectoryReader

documents = SimpleDirectoryReader("./data/paul_graham/").load_data()
node_parser = SimpleNodeParser.from_defaults(chunk_size=512)
nodes = node_parser.get_nodes_from_documents(documents)

# By default, the node/chunks ids are set to random uuids. To ensure same id's per run, we manually set them.
for idx, node in enumerate(nodes):
    node.id_ = f"node_{idx}"

print(f"Number of Documents: {len(documents)}")
print(f"Number of nodes: {len(nodes)} with the current chunk size of {node_parser.chunk_size}")

Number of Documents: 1
Number of nodes: 58 with the current chunk size of 512

The output.

Create a local Deep Lake vector store

from llama_index.core import VectorStoreIndex, ServiceContext, StorageContext
from llama_index.vector_stores.deeplake import DeepLakeVectorStore
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms.openai import OpenAI

# Create a DeepLakeVectorStore locally to store the vectors
dataset_path = "./data/paul_graham/deep_lake_db"
vector_store = DeepLakeVectorStore(dataset_path=dataset_path, overwrite=True)

# LLM that will answer questions with the retrieved context
llm = OpenAI(model="gpt-4o-mini")
embed_model = OpenAIEmbedding()

service_context = ServiceContext.from_defaults(embed_model=embed_model, llm=llm,)
storage_context = StorageContext.from_defaults(vector_store=vector_store)

vector_index = VectorStoreIndex(nodes, service_context=service_context, storage_context=storage_context, show_progress=True)

Uploading data to deeplake dataset.
100%|██████████| 58/58 [00:00<00:00, 274.94it/s]Dataset(path='./data/paul_graham/deep_lake_db', tensors=['text', 'metadata', 'embedding', 'id'])

  tensor      htype      shape      dtype  compression
  -------    -------    -------    -------  ------- 
   text       text      (58, 1)      str     None   
 metadata     json      (58, 1)      str     None   
 embedding  embedding  (58, 1536)  float32   None   
    id        text      (58, 1)      str     None

The output.

Now, let's upload the local Vectore Store to Activeloop's platform and convert it into a managed database.

import deeplake
local = "./data/paul_graham/deep_lake_db"
your_org = "your_org"
hub_path = f"hub://{your_org}/optimization_paul_graham"
hub_managed_path = f"hub://{your_org}/optimization_paul_graham_managed"

# First upload our local vector store
deeplake.deepcopy(local, hub_path, overwrite=True)
# Create a managed vector store under a different name
deeplake.deepcopy(hub_path, hub_managed_path, overwrite=True, runtime={"tensor_db": True})

Instantiate a Vector Store with the managed dataset that we just created.

db = DeepLakeVectorStore(dataset_path=hub_managed_path, overwrite=False, read_only=True,)

Now, let’s generate a dataset of Queries and Documents

Fetching our docs and ids from the vector store.

# Fetch dataset docs and ids 
docs = db._vectorstore.dataset.text.data(fetch_chunks=True, aslist=True)['value']
ids = db._vectorstore.dataset.id.data(fetch_chunks=True, aslist=True)['value']
print(len(docs))

Generating a synthetic training dataset.

We need labeled data (query and document_id pairs) to train a Deep Memory model. Sometimes, it can be difficult to get labeled data when you are starting from scratch. This tutorial generates queries/questions using gpt-4o-mini from our existing documents.

from openai import OpenAI
client = OpenAI()

def generate_question(text):
    try:
        response = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[
                {"role": "system", "content": "You are a world class expert for generating questions based on provided context. \
                        You make sure the question can be answered by the text."},
                {
                    "role": "user",
                    "content": text,
                },
            ],
        )
        return response.choices[0].message.content
    except:
        question_string = "No question generated"
        return question_string

import random
from tqdm import tqdm

def generate_queries(docs: list[str], ids: list[str], n: int):

    questions = []
    relevances = []
    pbar = tqdm(total=n)
    while len(questions) < n:
        # 1. randomly draw a piece of text and relevance id
        r = random.randint(0, len(docs)-1)
        text, label = docs[r], ids[r]

        # 2. generate queries and assign and relevance id
        generated_qs = [generate_question(text)]
        if generated_qs == ["No question generated"]:
            print("No question generated")
            continue

        questions.extend(generated_qs)
        relevances.extend([[(label, 1)] for _ in generated_qs])
        pbar.update(len(generated_qs))

    return questions[:n], relevances[:n]

5.1 Launch the query generation process with a desired size of 40 queries/questions.

questions, relevances = generate_queries(docs, ids, n=40)
print(len(questions)) #40
print(questions[0])

You will have a list of generated questions and the associated contexts by running the two cells above.

Launch Deep Memory Training

Install the langchain-openai requirements

%pip install -qU langchain-openai

Run the deep memory training

from langchain_openai import OpenAIEmbeddings
openai_embeddings = OpenAIEmbeddings()

job_id = db._vectorstore.deep_memory.train(
    queries=questions,
    relevance=relevances,
    embedding_function=openai_embeddings.embed_documents,
)

Starting DeepMemory training job

Your Deep Lake dataset has been successfully created!

Preparing training data for DeepMemory: Creating 20 embeddings in 1 batches of size 20:: 100%|██████████| 1/1 [06:36<00:00, 396.77s/it] DeepMemory training job started. Job ID: 657b3083d528b0fd224173c6

# During training you can check the status of the training run
db._vectorstore.deep_memory.status(job_id="657b3083d528b0fd224173c6")

--------------------------------------------------------------
|                  657b3083d528b0fd224173c6                  |
--------------------------------------------------------------
| status                     | completed                     |
--------------------------------------------------------------
| progress                   | eta: 0.9 seconds              |
|                            | recall@10: 60.00% (+25.00%)   |
--------------------------------------------------------------
| results                    | recall@10: 60.00% (+25.00%)   |
--------------------------------------------------------------

Output

We see an increase of 25% in recall@10 after finetuning.

Run a Deep Memory-enabled inference by setting deep_memory=True.

from llama_index.llms.openai import OpenAI
query = "What are the main things Paul worked on before college?"

llm = OpenAI(model="gpt-4o-mini")
embed_model = OpenAIEmbedding()

service_context = ServiceContext.from_defaults(embed_model=embed_model, llm=llm,)
storage_context = StorageContext.from_defaults(vector_store=vector_store)

db = DeepLakeVectorStore(dataset_path=hub_managed_path, overwrite=False, read_only=True,)
vector_index = VectorStoreIndex.from_vector_store(db, service_context=service_context, storage_context=storage_context, show_progress=True)

query_engine = vector_index.as_query_engine(similarity_top_k=3, vector_store_kwargs={"deep_memory": True})
response_vector = query_engine.query(query)
print(response_vector.response)

Now, let's run a quantitative evaluation on another set of synthetically generated test queries.

# Generate validation queries
validation_questions, validation_relevances = generate_queries(docs, ids, n=40)

# Launch the evaluation function
recalls = db._vectorstore.deep_memory.evaluate(
    queries=validation_questions,
    relevance=validation_relevances,
    embedding_function=openai_embeddings.embed_documents,
)

Code

Embedding queries took 0.82 seconds
---- Evaluating without Deep Memory ----
Recall@1:	  27.0%
Recall@3:	  42.0%
Recall@5:	  42.0%
Recall@10:	  50.0%
Recall@50:	  67.0%
Recall@100:	  72.0%
---- Evaluating with Deep Memory ----
Recall@1:	  32.0%
Recall@3:	  45.0%
Recall@5:	  48.0%
Recall@10:	  55.0%
Recall@50:	  69.0%
Recall@100:	  73.0%

Output

Even with our new test dataset, we notice higher recall values using Deep Memory. Comparing these results with the training dataset highlights how a query-context dataset has better quality and represents your use case.

Conclusion

In this lesson, we explored the optimization techniques for large language models, covering prompt engineering as a first way to maximize LLM performance, fine-tuning, and Retrieval-Augmented Generation (RAG) for integrating external, up-to-date knowledge.

We also discussed combining fine-tuning with RAG for complex, domain-specific applications requiring considerable resources. A significant focus was on Activeloop's Deep Memory, which was integrated into RAG systems to enhance embedding retrieval accuracy. Deep Memory outperforms traditional methods like BM25 using lexical search and vector search using cosine similarity. We demonstrated it by getting higher recall values. It also efficiently reduces token usage in LLM prompts compared to query reformulation or transformation.

This approach addresses key embedding retrieval challenges and signals a promising future for increasingly capable and versatile LLMs.

RESOURCES

Colab with the lesson code

Google Colaboratory

colab.research.google.com

A Survey of Techniques for Maximizing LLM Performance from OpenAI

A Survey of Techniques for Maximizing LLM Performance

Join us for a comprehensive survey of techniques designed to unlock the full potential of Language Model Models (LLMs). Explore strategies such as fine-tuning, RAG (Retrieval-Augmented Generation), and prompt engineering to maximize LLM performance. Speakers: John Allard Engineering Lead, Fine-tuning Product Team at @OpenAI Colin Jarvis Solutions, EMEA at @OpenAI

www.youtube.com

A Survey of Techniques for Maximizing LLM Performance

Deep Memory Blog Post

Use Deep Memory to Boost RAG Apps' Accuracy by up to +22%

Boost Retrieval Accuracy by +41% & Cut GPT-4 Costs in Half with Deep Memory, a Feature to Boost Chat-With-Data Applications Built with RAG & Deep Lake