Iterative Optimization of LlamaIndex RAG Pipeline: A Step-by-Step Approach

Introduction

In previous lessons, we learned about advanced techniques and evaluation metrics for LlamaIndex Retrieval-Augmented Generation (RAG) pipelines. Building on this knowledge, we now focus on optimizing a LlamaIndex RAG pipeline through a series of iterative evaluations. We aim to enhance the system's ability to retrieve and generate accurate and relevant information. Before starting this guide, make sure you install all the requirements in the requirements section.

Here's our step-by-step plan:

  1. Baseline Evaluation: Construct a standard LlamaIndex RAG pipeline and establish an initial performance baseline.
    1. Adjusting TOP_K Retrieval Values: Experiment with different values of k (1, 3, 5, 7) to understand their effect on the accuracy of retrieved information and the relevance of generated answers.
  2. Testing Different Embedding Models: Evaluate models such as "text-embedding-ada-002" and "cohere/embed-english-v3.0" to identify the most effective one for our pipeline.
  3. Incorporating a Reranker: Implement a reranking mechanism to refine the document selection process of the retriever.
  4. Employing a Deep Memory Approach: Investigate the impact of a deep memory component on the accuracy of information retrieval.

Through these steps, we aim to refine our RAG system systematically, enhancing its performance by providing accurate and relevant information.

The code for this lesson is also available through a Colab notebook, where you can follow along.

1. Baseline evaluation

The first step is installing the required Python packages.

!pip3 install deeplake llama_index langchain openai tiktoken cohere pandas torch sentence-transformers

Here, you can set our API keys. You can skip this step if you plan to use other services.

import os
import getpass
os.environ['ACTIVELOOP_TOKEN'] = getpass.getpass('Enter your ActiveLoop API key: ')
os.environ['OPENAI_API_KEY'] = getpass.getpass('Enter your OpenAI API key: ')
os.environ['COHERE_API_KEY'] = getpass.getpass('Enter your Cohere API key: ')

We download the data, which is a single text file. You can use this or replace it with your own data.

!mkdir -p 'data/paul_graham/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'

Let’s load the Data and build LlamaIndex nodes/chunks.

from llama_index.core.node_parser import SimpleNodeParser
from llama_index.core import SimpleDirectoryReader

# First we create Document LlamaIndex objects from the text data
documents = SimpleDirectoryReader("./data/paul_graham/").load_data()
node_parser = SimpleNodeParser.from_defaults(chunk_size=512)
nodes = node_parser.get_nodes_from_documents(documents)

# By default, the node/chunks ids are set to random uuids. To ensure same id's per run, we manually set them.
for idx, node in enumerate(nodes):
    node.id_ = f"node_{idx}"

print(f"Number of Documents: {len(documents)}")
print(f"Number of nodes: {len(nodes)} with the current chunk size of {node_parser.chunk_size}")
Number of Documents: 1
Number of nodes: 58 with the current chunk size of 512
The output.

The next step is to create a LlamaIndex VectorStoreIndex object and use a DeepLakeVectorStore to store the vector embeddings.

We also choose gpt-3.5-turbo-1106 as our LLM and OpenAI’s embedding model text-embedding-ada-002

%pip install llama-index-vector-stores-deeplake
%pip install llama-index-llms-openai
from llama_index.core import VectorStoreIndex, ServiceContext, StorageContext
from llama_index.vector_stores.deeplake import DeepLakeVectorStore
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms.openai import OpenAI

# Create a local Deep Lake VectorStore
dataset_path = "./data/paul_graham/deep_lake_db"
vector_store = DeepLakeVectorStore(dataset_path=dataset_path, overwrite=True)

# LLM that will answer questions with the retrieved context
llm = OpenAI(model="gpt-3.5-turbo-1106")
# We use OpenAI's embedding model "text-embedding-ada-002"
embed_model = OpenAIEmbedding()

service_context = ServiceContext.from_defaults(embed_model=embed_model, llm=llm,)
storage_context = StorageContext.from_defaults(vector_store=vector_store)

vector_index = VectorStoreIndex(nodes, service_context=service_context, storage_context=storage_context, show_progress=True)
Generating embeddings: 100%
58/58 [00:06<00:00, 8.75it/s]
Uploading data to deeplake dataset.
100%|██████████| 58/58 [00:00<00:00, 169.79it/s]Dataset(path='./data/paul_graham/deep_lake_db', tensors=['text', 'metadata', 'embedding', 'id'])

  tensor      htype      shape      dtype  compression
  -------    -------    -------    -------  ------- 
   text       text      (58, 1)      str     None   
 metadata     json      (58, 1)      str     None   
 embedding  embedding  (58, 1536)  float32   None   
    id        text      (58, 1)      str     None
The output.

With the vector index, we can now build a QueryEngine, which generates answers with the LLM and the retrieved chunks of text.

query_engine = vector_index.as_query_engine(similarity_top_k=10)
response_vector = query_engine.query("What are the main things Paul worked on before college?")
print(response_vector.response)
Before college, Paul worked on writing and programming.
The output.

Now that we have a simple RAG pipeline, we can evaluate it. For that, we need a dataset. Since we don’t have one, we will generate one. LlamaIndex offers a generate_question_context_pairs module specifically for generating questions and context pairs. We will use that dataset to assess the RAG chunk retrieval and response capabilities.

Let’s also save the generated dataset in JSON format for later use. In this case we only generate 58 question and context pairs, but you can increase the number of samples in the dataset for a more thorough evaluation.

from llama_index.core.evaluation import generate_question_context_pairs
qc_dataset = generate_question_context_pairs(
    nodes,
    llm=llm,
    num_questions_per_chunk=1
)
# We can save the dataset as a json file for later use.
qc_dataset.save_json("qc_dataset.json")
100%|██████████| 58/58 [01:30<00:00,  1.56s/it]
The output.

You can load the dataset from your local disk if you have already generated it.

from llama_index.core.evaluation import EmbeddingQAFinetuneDataset
qc_dataset = EmbeddingQAFinetuneDataset.from_json(
    "qc_dataset.json"
)
💡
We now have a synthetic dataset, but here, you must take the time to review it or even consider building it manually. Doing so will increase the accuracy of your RAG pipeline evaluations.
If you want more control over the quality of the generated dataset, you can look into modifying the prompt. This is the current default for the generate_question_context_pairs function.
DEFAULT_QA_GENERATE_PROMPT_TMPL = """\
Context information is below.

---------------------
{context_str}
---------------------

Given the context information and not prior knowledge.
generate only questions based on the below query.

You are a Teacher/ Professor. Your task is to setup \
{num_questions_per_chunk} questions for an upcoming \
quiz/examination. The questions should be diverse in nature \
across the document. Restrict the questions to the \
context information provided."
"""

With the generated dataset, we can first start with the retrieval evaluations.

We will use the RetrieverEvaluator class available in LlamaIndex to measure the Hit Rate and Mean Reciprocal Rank (MRR).

Hit Rate:

Think of the Hit Rate as playing a game of guessing. You're given a question and need to guess the correct answer from a list of options. The Hit Rate measures how often you guess the correct answer by only looking at your top few guesses. You have a high Hit Rate if you often find the right answer in your first few guesses.

So, in a retrieval system, it's about how frequently the system finds the correct document within its top 'k' picks (where 'k' is a number you decide, like top 5 or top 10).

Mean Reciprocal Rank (MRR):

MRR is like measuring how quickly you can find a treasure in a list of boxes. Imagine you have a row of boxes, and only one has a treasure. The MRR calculates how close to the start of the row the treasure box is, on average.

If the treasure is always in the first box you open, you're doing great and have an MRR of 1. If it's in the second box, the score is 1/2, since you took two tries to find it. If it's in the third box, your score is 1/3, and so on. MRR averages these scores across all your searches. So, for a retrieval system, MRR looks at where the correct document ranks in the system's guesses. If it's usually near the top, the MRR will be high, indicating good performance.

In summary, Hit Rate tells you how often the system gets it right in its top guesses, and MRR tells you how close to the top the right answer usually is. Both metrics are useful for evaluating the effectiveness of a retrieval system, like how well a search engine or a recommendation system works.

First, we define a function to display the Retrieval evaluation results in table format.

import pandas as pd

def display_results_retriever(name, eval_results):
    """Display results from evaluate."""

    metric_dicts = []
    for eval_result in eval_results:
        metric_dict = eval_result.metric_vals_dict
        metric_dicts.append(metric_dict)

    full_df = pd.DataFrame(metric_dicts)

    hit_rate = full_df["hit_rate"].mean()
    mrr = full_df["mrr"].mean()

    metric_df = pd.DataFrame(
        {"Retriever Name": [name], "Hit Rate": [hit_rate], "MRR": [mrr]}
    )

    return metric_df

Then, Run the evaluation procedure.

from llama_index.core.evaluation import RetrieverEvaluator

# We can evaluate the retievers with different top_k values.
for i in [2, 4, 6, 8, 10]:
    retriever = vector_index.as_retriever(similarity_top_k=i)
    retriever_evaluator = RetrieverEvaluator.from_metric_names(
        ["mrr", "hit_rate"], retriever=retriever
    )
    eval_results = await retriever_evaluator.aevaluate_dataset(qc_dataset)
    print(display_results_retriever(f"Retriever top_{i}", eval_results))
Retriever Name  Hit Rate       MRR
0  Retriever top_2  0.827586  0.702586
    Retriever Name  Hit Rate       MRR
0  Retriever top_4  0.913793  0.729167
    Retriever Name  Hit Rate       MRR
0  Retriever top_6  0.922414  0.730891
    Retriever Name  Hit Rate       MRR
0  Retriever top_8  0.956897  0.735509
     Retriever Name  Hit Rate       MRR
0  Retriever top_10  0.982759  0.738407
The output.

We notice that the Hit Rate increases as the top_k value increases, which is what we can expect. We're increasing the probability of the correct answer being included in the returned set.

But how does that impact the quality of the generated answers?

Evaluation for Relevancy and Faithfulness metrics.

Relevancy evaluates whether the retrieved context and answer are relevant to the query.

Faithfulness evaluates if the answer is faithful to the retrieved contexts or, in other words, whether there’s a hallucination.

LlamaIndex includes functions that evaluate both metrics using an LLM as the judge. GPT4 will be used as the judge.

Now, let's see how the top_k value affects these two metrics.

from llama_index.core.evaluation import RelevancyEvaluator, FaithfulnessEvaluator, BatchEvalRunner

for i in [2, 4, 6, 8, 10]:   
    # Set Faithfulness and Relevancy evaluators
    query_engine = vector_index.as_query_engine(similarity_top_k=i)

    # While we use GPT3.5-Turbo to answer questions
    # we can use GPT4 to evaluate the answers.
    llm_gpt4 = OpenAI(temperature=0, model="gpt-4-1106-preview")
    service_context_gpt4 = ServiceContext.from_defaults(llm=llm_gpt4)

    faithfulness_evaluator = FaithfulnessEvaluator(service_context=service_context_gpt4)
    relevancy_evaluator = RelevancyEvaluator(service_context=service_context_gpt4)

    # Run evaluation
    queries = list(qc_dataset.queries.values())
    batch_eval_queries = queries[:20]

    runner = BatchEvalRunner(
    {"faithfulness": faithfulness_evaluator, "relevancy": relevancy_evaluator},
    workers=8,
    )
    eval_results = await runner.aevaluate_queries(
        query_engine, queries=batch_eval_queries
    )
    faithfulness_score = sum(result.passing for result in eval_results['faithfulness']) / len(eval_results['faithfulness'])
    print(f"top_{i} faithfulness_score: {faithfulness_score}")

    relevancy_score = sum(result.passing for result in eval_results['faithfulness']) / len(eval_results['relevancy'])
    print(f"top_{i} relevancy_score: {relevancy_score}")
top_2 faithfulness_score: 0.95
top_2 relevancy_score: 0.95
top_4 faithfulness_score: 0.95
top_4 relevancy_score: 0.95
top_6 faithfulness_score: 0.95
top_6 relevancy_score: 0.95
top_8 faithfulness_score: 1.0
top_8 relevancy_score: 1.0
top_10 faithfulness_score: 1.0
top_10 relevancy_score: 1.0
The output.

We can notice the relevancy and faithfulness scores increase as the Top_k value increases. We also get a perfect score using eight retrieved chunks as context.

💡
Remember that these scores come from another LLM, such as GPT4. Since LLM outputs are not deterministic, you can expect different results if you run the evaluations repeatedly.

This is the LlamaIndex Relevancy prompt default template.

🛠
DEFAULT_EVAL_TEMPLATE = PromptTemplate( "Your task is to evaluate if the response for the query \ is in line with the context information provided.\n" "You have two options to answer. Either YES/ NO.\n" "Answer - YES, if the response for the query \ is in line with context information otherwise NO.\n" "Query and Response: \n {query_str}\n" "Context: \n {context_str}\n" "Answer: " )

2. Changing the embedding model

Now that we have the baseline evaluation score, we can start changing some modules of our LlamaIndex RAG pipeline.

We can start by changing the embedding model. Here, we will be testing the cohere embedding model embed-english-v3.0 instead of OpenAI’s text-embedding-ada-002.

%pip install llama-index-llms-cohere
%pip install llama-index-embeddings-cohere
%pip install llama-index-postprocessor-cohere-rerank
import os
from llama_index.core import VectorStoreIndex, ServiceContext, StorageContext
from llama_index.vector_stores.deeplake import DeepLakeVectorStore
from llama_index.embeddings.cohere import CohereEmbedding

from llama_index.llms.openai import OpenAI

# Create another local DeepLakeVectorStore to store the embeddings
dataset_path = "./data/paul_graham/deep_lake_db_1"
vector_store = DeepLakeVectorStore(dataset_path=dataset_path, overwrite=False)

llm = OpenAI(model="gpt-3.5-turbo-1106")
embed_model = CohereEmbedding(
    cohere_api_key=os.getenv('COHERE_API_KEY'),
    model_name="embed-english-v3.0",
    input_type="search_document",
)

service_context = ServiceContext.from_defaults(embed_model=embed_model, llm=llm,)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
vector_index = VectorStoreIndex(nodes, service_context=service_context, storage_context=storage_context, show_progress=True)
Generating embeddings: 100%
58/58 [00:02<00:00, 23.68it/s]
Uploading data to deeplake dataset.
100%|██████████| 58/58 [00:00<00:00, 315.69it/s]Dataset(path='./data/paul_graham/deep_lake_db_1', tensors=['text', 'metadata', 'embedding', 'id'])

  tensor      htype      shape      dtype  compression
  -------    -------    -------    -------  ------- 
   text       text      (58, 1)      str     None   
 metadata     json      (58, 1)      str     None   
 embedding  embedding  (58, 1024)  float32   None   
    id        text      (58, 1)      str     None
The output.

We run the retrieval evaluation using these new embeddings.

from llama_index.core.evaluation import RetrieverEvaluator

embed_model.input_type = "search_query"
retriever = vector_index.as_retriever(similarity_top_k=10, embed_model=embed_model)

retriever_evaluator = RetrieverEvaluator.from_metric_names(
    ["mrr", "hit_rate"], retriever=retriever
)
eval_results = await retriever_evaluator.aevaluate_dataset(qc_dataset)
print(display_results_retriever(f"Retriever_cohere_embeds", eval_results))
Retriever Name  Hit Rate       MRR
0  Retriever_cohere_embeds  0.965517  0.754823
The output.

These embeddings show a lower Hit Rate but a better MRR value.

💡
In this tutorial, we test the cohere embeddings, but you can also try any embedding models from the Hugging Face hub. The models at the top of the mteb/leaderboard are a good choice to try.
💡
If pre-trained embedding models do not perform well on your data, consider fine-tuning your own embedding model.

3. Incorporating a Reranker

Here, we will be testing three different Rerankers that we learned about in previous lessons.

from llama_index.postprocessor.cohere_rerank import CohereRerank
from llama_index.core.postprocessor import SentenceTransformerRerank,LLMRerank
st_reranker = SentenceTransformerRerank(
    top_n=5, model="cross-encoder/ms-marco-MiniLM-L-6-v2"
)

llm_reranker = LLMRerank(
    choice_batch_size=4, top_n=5,
)
cohere_rerank = CohereRerank(api_key=os.getenv('COHERE_API_KEY'), top_n=10)
for reranker in [cohere_rerank, st_reranker, llm_reranker]:
    retriever_with_reranker = vector_index.as_retriever(similarity_top_k=10, postprocessor=reranker, embed_model=embed_model)

    retriever_evaluator_1 = RetrieverEvaluator.from_metric_names(
        ["mrr", "hit_rate"], retriever=retriever_with_reranker
    )
    eval_results1 = await retriever_evaluator_1.aevaluate_dataset(qc_dataset)
    print(display_results_retriever("Retriever with added Reranker", eval_results1))
config.json: 100%
794/794 [00:00<00:00, 23.6kB/s]
pytorch_model.bin: 100%
90.9M/90.9M [00:00<00:00, 145MB/s]
tokenizer_config.json: 100%
316/316 [00:00<00:00, 11.0kB/s]
vocab.txt: 100%
232k/232k [00:00<00:00, 3.79MB/s]
special_tokens_map.json: 100%
112/112 [00:00<00:00, 3.88kB/s]
                  Retriever Name  Hit Rate       MRR
0  Retriever with added Reranker  0.965517  0.754823
                  Retriever Name  Hit Rate       MRR
0  Retriever with added Reranker  0.965517  0.754823
                  Retriever Name  Hit Rate       MRR
0  Retriever with added Reranker  0.965517  0.754823
The output.

Here, we unfortunately don't see a significant improvement in the retriever's performance. We suspect it is mainly caused by the evaluation dataset we’ve built. Rerankers can nonetheless offer great benefits depending on your application and are easy to implement.

4. Employing Deep Memory

Activeloop's Deep Memory is a feature that introduces a tiny neural network layer trained to match user queries with relevant data from a corpus. While this addition incurs minimal latency during search, it can boost retrieval accuracy by up to 27%.

First, let's reuse and convert our generated dataset into a format Deep Memory expects. We need queries and relevant IDs.

def create_query_relevance(qa_dataset):
    """Function for converting LlamaIndex dataset to correct format for deep memory training"""
    queries = [text for _, text in qa_dataset.queries.items()]
    relevant_docs = qa_dataset.relevant_docs
    relevance = []
    for doc in relevant_docs:
        relevance.append([(relevant_docs[doc][0], 1)])
    return queries, relevance

train_queries, train_relevance = create_query_relevance(qc_dataset)
print(len(train_queries))

Now, let's upload our baseline Vectore Store on Activeloop's cloud platform and convert it into a managed database.

import deeplake
local = "./data/paul_graham/deep_lake_db"
your_org = "your_org"
hub_path = f"hub://{your_org}/optimization_paul_graham"
hub_managed_path = f"hub://{your_org}/optimization_paul_graham_managed"

# First upload our local vector store
deeplake.deepcopy(local, hub_path, overwrite=True)
# Create a managed vector store
deeplake.deepcopy(hub_path, hub_managed_path, overwrite=True, runtime={"tensor_db": True})

You can replace the paths using your organization name and database name.

Let’s create a LlamaIndex RAG pipeline using our new managed vector store.

%pip install llama-index-embeddings-openai
import os
from llama_index.core import VectorStoreIndex, ServiceContext, StorageContext
from llama_index.vector_stores.deeplake import DeepLakeVectorStore
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding
embed_model = OpenAIEmbedding()

vector_store = DeepLakeVectorStore(dataset_path=hub_managed_path, overwrite=False, runtime={"tensor_db": True}, read_only=True)
llm = OpenAI(model="gpt-3.5-turbo-1106")

service_context = ServiceContext.from_defaults(embed_model=embed_model, llm=llm,)
storage_context = StorageContext.from_defaults(vector_store=vector_store)

vector_index = VectorStoreIndex.from_vector_store(vector_store,service_context=service_context, storage_context=storage_context, use_async=False, show_progress=True)
Deep Lake Dataset in hub://genai360/optimization_paul_graham_managed already exists, loading from the storage
The output.

And now we can launch the Deep Memory training.

from langchain_openai import OpenAIEmbeddings
openai_embeddings = OpenAIEmbeddings()

job_id = vector_store._vectorstore.deep_memory.train(
    queries=train_queries,
    relevance=train_relevance,
    embedding_function=embed_model.embed_documents,
)
Your Deep Lake dataset has been successfully created!
creating embeddings: 100%|██████████| 1/1 [00:02<00:00,  2.27s/it]
100%|██████████| 100/100 [00:00<00:00, 158.16it/s]
Dataset(path='hub://genai360/optimization_paul_graham_managed_queries', tensors=['text', 'metadata', 'embedding', 'id'])

  tensor      htype       shape      dtype  compression
  -------    -------     -------    -------  ------- 
   text       text      (100, 1)      str     None   
 metadata     json      (100, 1)      str     None   
 embedding  embedding  (100, 1536)  float32   None   
    id        text      (100, 1)      str     None   
DeepMemory training job started. Job ID: 652dceeed7d1579bf6abf3df
The output.

With the job_id, you can keep track of deep memory training.

vector_store._vectorstore.deep_memory.status('652dceeed7d1579bf6abf3df')

To evaluate our Deep Memory-enabled vector store, we can generate a test dataset. Here, we only send 20 chunks to make things fast, but a bigger dataset size would be recommended for a stronger evaluation.

from llama_index.core.evaluation import generate_question_context_pairs
# Generate test dataset
test_dataset = generate_question_context_pairs(
    nodes[:20],
    llm=llm,
    num_questions_per_chunk=1
)
test_dataset.save_json("test_dataset.json")

# We can also load the dataset from a json file if already done previously.
from llama_index.core.evaluation import EmbeddingQAFinetuneDataset
test_dataset = EmbeddingQAFinetuneDataset.from_json(
    "test_dataset.json"
)

test_queries, test_relevance = create_query_relevance(test_dataset)
100%|██████████| 20/20 [00:29<00:00,  1.49s/it]
The output.

Let’s evaluate the recall on the generated test dataset using the Deep Lakes evaluation Python function.

Recall measures the proportion of relevant items successfully retrieved by the system from all relevant items available in the dataset.

Formula: Recall is calculated as:

Recall=Number of Relevant Items RetrievedTotal Number of Relevant Items in the Dataset\text{Recall} = \frac{\text{Number of Relevant Items Retrieved}}{\text{Total Number of Relevant Items in the Dataset}}

It focuses on the system's ability to find all relevant items. A high recall means the system is good at not missing relevant items.

💡
Compared to Hit Rate: Recall is about the system's thoroughness in retrieving all relevant items, and Hit Rate is about its effectiveness in ensuring that each query retrieves something relevant.
# Evaluate recall on the generated test dataset
recalls = vector_store._vectorstore.deep_memory.evaluate(
    queries=test_queries,
    relevance=test_relevance,
    embedding_function=openai_embeddings.embed_documents,
)
Embedding queries took 1.24 seconds
---- Evaluating without Deep Memory ---- 
Recall@1:	  55.2%
Recall@3:	  87.1%
Recall@5:	  90.5%
Recall@10:	  97.4%
Recall@50:	  100.0%
Recall@100:	  100.0%
---- Evaluating with Deep Memory ---- 
Recall@1:	  56.0%
Recall@3:	  87.1%
Recall@5:	  92.2%
Recall@10:	  99.1%
Recall@50:	  100.0%
Recall@100:	  100.0%
The output.

Now, let’s get the Hit Rate and MRR scores of our Deep Memory enabled vector store.

We start measuring the Hit Rate and MRR of our base vector store:

import os
from llama_index.postprocessor.cohere_rerank import CohereRerank
from llama_index.core.evaluation import (
    RetrieverEvaluator,
)

base_retriever = vector_index.as_retriever(similarity_top_k=10)
deep_memory_retriever = vector_index.as_retriever(
similarity_top_k=10, vector_store_kwargs={"deep_memory": True}
)

base_retriever_evaluator = RetrieverEvaluator.from_metric_names(
    ["mrr", "hit_rate"], retriever=base_retriever
)
eval_results = await base_retriever_evaluator.aevaluate_dataset(test_dataset)
print(display_results_retriever("Retriever Results", eval_results))
Retriever Name  Hit Rate       MRR
0  Retriever Results  0.974138  0.717809
The output.

Now, the same evaluation for the Deep Memory Vector Store

deep_memory_retriever = vector_index.as_retriever(
similarity_top_k=10, vector_store_kwargs={"deep_memory": True}
)

dm_retriever_evaluator = RetrieverEvaluator.from_metric_names(
    ["mrr", "hit_rate"], retriever=deep_memory_retriever
)
dm_eval_results = await dm_retriever_evaluator.aevaluate_dataset(test_dataset)
print(display_results_retriever("Retriever Results", dm_eval_results))
Retriever Name  Hit Rate       MRR
0  Retriever Results  0.991379  0.72865
The output.

We can see a small increase in the MRR score compared to the baseline RAG pipeline while our Hit Rate stays the same. Note that this is again mainly due to our evaluation test set and the fact that we only chose 20 chunks. You can experiment with more chunks or manually build a different test set for improved results, especially in your application!

Conclusion

In this lesson, optimizing a LlamaIndex RAG pipeline involved a structured approach to improve information retrieval and generation quality.

We adjusted retrieval top_k values, evaluated two embedding models, introduced reranking mechanisms, and integrated Active Loop’s Deep Memory, some leading to performance enhancements. The improvements were somewhat negligible in this short demo. Still, it is crucial to try these more advanced improvements as they could have high impacts on a real application and improved evaluation dataset. We also highlight the importance of a good evaluation set of tools, such as a well-curated and large enough evaluation dataset.

RESOURCES

  • Colab notebook for the lesson:
  • LlamaIndex and Deep Memory integration:

This lesson is based on the Llamaindex AI-engineer-workshop posted by Disiok.