Introduction
This chapter focuses on designing and evaluating experiments to systematically assess RAG systems. Starting with dataset preparation, we guide you through setting up various query engine configurations and running targeted experiments. The primary emphasis is on evaluation—how to effectively measure and compare the performance of different RAG setups using structured scoring methods. By the end of this chapter, you’ll have a clear framework for experimenting with and analyzing RAG systems.
The evaluation methodology builds on ideas from my ARAGOG paper. While more advanced and automated methods for evaluation now exist, this chapter intentionally focuses on building the process from scratch. By starting from the fundamentals, this approach ensures you understand every step of the experimentation and evaluation workflow. In this context, the ARAGOG paper serves as a perfect foundation, as its "from-scratch" design aligns with the goals of this chapter.
Load the Dataset
For this example, we will use the AI-ArXiv dataset from Hugging Face (same as in the chapter on creating eval dataset). This dataset contains research papers focused on AI, including their titles, summaries, authors, and full content. It provides a rich source of information suitable for generating diverse and challenging Q&A pairs, which are critical for robust RAG evaluation.The dataset contains 423 papers, which is an ideal size for experiments - enough noise to challenge the system, but not too costly to run.
We have already prepared this dataset in the Deep lake - the best vector store in the world (totally unbiased take). Before initializing the index, we define embedding model that matches the dimensions of our Deep lake dataset. This dataset is public, so you should be able to access it with read only mode.
from llama_index.core import VectorStoreIndex
from llama_index.vector_stores.deeplake import DeepLakeVectorStore
from llama_index.embeddings.openai import OpenAIEmbedding
embed_model = OpenAIEmbedding(model="text-embedding-ada-002")
# Create an index over the documents
vector_store = DeepLakeVectorStore(dataset_path="al://matouseibich/ai-arxiv-test", read_only=True)
index = VectorStoreIndex.from_vector_store(
vector_store=vector_store, embed_model = embed_model
)
Loading QA Pairs
We will use QA pairs from the ARAGOG paper. These questions are based on the AI-ArXiv dataset and are designed to evaluate RAG systems comprehensively, covering both simple and complex queries. If you want to create your own QA pairs, follow the instructions in the chapter on creating an eval dataset [LINK].
Code to Load QA Pairs:
import json
# Specify the path to the local JSON file (download from https://github.com/predlico/ARAGOG/blob/main/eval_questions/benchmark.json)
file_path = "benchmark.json"
# Load the JSON data
with open(file_path, "r") as f:
data = json.load(f)
# Check the structure of the loaded data
if "questions" in data and "ground_truths" in data:
questions = data["questions"]
answers = data["ground_truths"]
# Combine questions and answers into a list of dictionaries
qa_pairs = [{"question": q, "answer": a} for q, a in zip(questions, answers)]
print(f"Successfully loaded {len(qa_pairs)} QA pairs.\n")
# Display a few QA pairs
for qa in qa_pairs[:3]: # Display the first 3 QA pairs
print(f"Question: {qa['question']}")
print(f"Answer: {qa['answer']}\n")
else:
print("The JSON file does not contain 'questions' and 'ground_truths' keys. Please verify the structure.")
Sample QA Pairs
- Question:
- Question:
- Question:
"What are the two main tasks BERT is pre-trained on?”
Answer:
"Masked LM (MLM) and Next Sentence Prediction (NSP).”
"What model sizes are reported for BERT, and what are their specifications?"
Answer:
"BERTBASE (L=12, H=768, A=12, Total Parameters=110M) and BERTLARGE (L=24, H=1024, A=16, Total Parameters=340M).”
"How does BERT's architecture facilitate the use of a unified model across diverse NLP tasks?”
Answer:
"BERT uses a multi-layer bidirectional Transformer encoder architecture, allowing for minimal task-specific architecture modifications in fine-tuning.”
Initializing the LLM
In this section, we will initialize the large language models (LLMs) that will be used for our experiments. For this example, we will test two versions of the GPT-4o model: the full-sized version (gpt-4o
) and the smaller, more efficient version (gpt-4o-mini
). Using both models allows us to compare their performance on the same evaluation dataset, providing insights into how model size impacts RAG tasks.
from llama_index.llms.openai import OpenAI
import os
# Set your OpenAI API key
os.environ["OPENAI_API_KEY"] = ""
# Initialize the LLM with GPT-4o
llm_gpt4o = OpenAI(model="gpt-4o", temperature=0)
llm_gpt4o_mini = OpenAI(model="gpt-4o-mini", temperature=0)
Setting Up the Prompt Template for Answering
A well-designed prompt is critical for guiding the language model to generate accurate, context-based answers. In this step, we define a PromptTemplate
that ensures the model adheres to specific rules while answering queries.
The template explicitly instructs the model to rely solely on the provided context for its responses, avoiding the use of any prior knowledge. Additionally, it emphasizes succinctness, restricting answers to a maximum of two sentences and 250 characters. By prohibiting direct references to the context (e.g., "Based on the context..."), the prompt ensures that the answers remain clear and professional. This structured approach to prompt engineering enhances the reliability and factual accuracy of the system.
from llama_index.core import PromptTemplate
text_qa_template = PromptTemplate("""You are an expert Q&A system that is trusted around the world for your factual accuracy.
Always answer the query using the provided context information, and not prior knowledge. Ensure your answers are fact-based and accurately reflect the context provided.
Some rules to follow:
1. Never directly reference the given context in your answer.
2. Avoid statements like 'Based on the context, ...' or 'The context information ...' or anything along those lines.
3. Focus on succinct answers that provide only the facts necessary, do not be verbose.Your answers should be max two sentences, up to 250 characters.
---------------------
{context_str}
---------------------
Given the context information and not prior knowledge, answer the query.
Query: {query_str}
Answer: """)
Setting Up Experiments
In this section, we set up various experiments to evaluate different query engine configurations. These experiments explore a combination of query transformation techniques, post-processing strategies, and language models to assess their impact on retrieval and answering performance.
Experiment Overview
- NAIVE Query Engine: This is the simplest configuration, where the query engine retrieves the top
k
most relevant chunks and directly generates answers using the selected LLM (e.g., GPT-4o or GPT-4o-mini). - HyDE Transformation: The Hypothetical Document Embeddings (HyDE) approach expands the query by generating a hypothetical answer using the LLM and then embedding it alongside the original query. This enriched embedding improves the retrieval process.
- LLM Reranker: A post-processing step where the top retrieved documents are reranked based on their relevance to the query. The reranker uses the LLM to assign relevance scores, ensuring that only the most pertinent chunks are used for answering.
- Combination of HyDE and LLM Reranker: This combines the benefits of both techniques—query expansion for better retrieval and reranking for improved selection of relevant information.
Model Configurations
We run these experiments for two LLMs:
- GPT-4o: A high-performance language model with enhanced capabilities.
- GPT-4o-mini: A smaller variant designed for lower computational overhead.
Simplified Setup
While we demonstrate four configurations here, the possibilities for experiments are vast. For example, you could:
- Test multiple language models beyond GPT-4o variants.
- Explore different vector stores like graph-based RAG or hybrid setups.
- Experiment with a variety of prompt templates.
- Incorporate diverse index structures or similarity metrics.
In real-world scenarios, you might have dozens of experiments to optimize every aspect of the system, from retrieval accuracy to computational efficiency. This setup represents a starting point, and as your system grows, so will the complexity of your experimental framework.
from llama_index.core.indices.query.query_transform import HyDEQueryTransform
from llama_index.core.query_engine import TransformQueryEngine
from llama_index.core.postprocessor import LLMRerank
from llama_index.core import Settings
Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-large")
# GPT-4o
## NAIVE
query_engine_naive_4o = index.as_query_engine(llm = llm_gpt4o,
text_qa_template=text_qa_template,
similarity_top_k=3,
embed_model=embed_model)
## HYDE
hyde = HyDEQueryTransform(include_original=True)
query_engine_hyde_4o = TransformQueryEngine(query_engine_naive_4o, hyde)
## LLM rerank
llm_rerank = LLMRerank(choice_batch_size=10, top_n=3)
query_engine_llm_rerank_4o = index.as_query_engine(
similarity_top_k=10,
text_qa_template=text_qa_template,
node_postprocessors=[llm_rerank],
embed_model=embed_model,
llm=llm_gpt4o
)
## Combine HyDE + LLM rerank
## HyDE + LLM Rerank
query_engine_hyde_llm_rerank_4o = TransformQueryEngine(query_engine_llm_rerank_4o, hyde)
# GPT-4o-mini
## NAIVE
query_engine_naive_mini = index.as_query_engine(llm = llm_gpt4o_mini,
text_qa_template=text_qa_template,
similarity_top_k=3,
embed_model=embed_model)
## HYDE
query_engine_hyde_mini = TransformQueryEngine(query_engine_naive_mini, hyde)
## LLM rerank
query_engine_llm_rerank_mini = index.as_query_engine(
similarity_top_k=10,
text_qa_template=text_qa_template,
node_postprocessors=[llm_rerank],
embed_model=embed_model,
llm=llm_gpt4o_mini
)
## Combine HyDE + LLM rerank
## HyDE + LLM Rerank
query_engine_hyde_llm_rerank_mini = TransformQueryEngine(query_engine_llm_rerank_mini, hyde)
Running the Experiments
This section describes how we run the experiments using the query engines set up earlier. The goal is to evaluate how each configuration performs when answering a (sub)set of questions from the QA dataset. This process provides initial insights into the effectiveness of various retrieval and answering strategies.
Workflow
- Query Engines: We define a dictionary of query engine configurations, each representing a unique combination of techniques (e.g., naive retrieval, HyDE query expansion, LLM reranking).
- Subset Evaluation: This run is intentionally executed only on the first two evaluation questions. Running it on full dataset can cost a couple of dollars - so if you want to do that, just remove
[:2]
. The results at the end of the chapter are from the full run, so it is not necessary to run it on full data yourself. - Execution: Each query engine is used to answer the selected questions. The results are stored in a DataFrame for further analysis.
- Saving Results: The outputs are saved as a CSV file, enabling you to review and analyze the answers generated by each query engine.
Considerations
Running experiments on a subset is a good practice to debug and validate configurations before scaling to the entire dataset. Once validated, the same workflow can be extended to process the complete QA dataset or include additional query engines. This modular approach keeps the system flexible and efficient as the experiments grow in scope and complexity.
# Query engines (assuming they are already initialized)
query_engines = {
"naive_4o": query_engine_naive_4o,
"hyde_4o": query_engine_hyde_4o,
"llm_rerank_4o": query_engine_llm_rerank_4o,
"hyde_llm_rerank_4o": query_engine_hyde_llm_rerank_4o,
"naive_mini": query_engine_naive_mini,
"hyde_mini": query_engine_hyde_mini,
"llm_rerank_mini": query_engine_llm_rerank_mini,
"hyde_llm_rerank_mini": query_engine_hyde_llm_rerank_mini,
}
# Create a DataFrame with only the subset of QA pairs
results_df = pd.DataFrame(qa_pairs[:2]) # Limit DataFrame to first 2 pairs
# Iterate through engines and run queries
for engine_name, engine in query_engines.items():
print(f"Running queries for engine: {engine_name}")
results = []
for idx, qa in enumerate(qa_pairs[:2]): # Limit to the first 2 QA pairs
print(f"Querying question {idx + 1}: {qa['question']}")
response = engine.query(qa["question"])
results.append(response)
results_df[engine_name] = results # Append results for the current engine
print(f"Completed queries for engine: {engine_name}\n")
# Save the subset results to a CSV
results_df.to_csv("experiment_results_subset.csv", index=False)
print("Results saved to 'experiment_results_subset.csv'")
Evaluation Prompt
To systematically evaluate the quality of the answers generated by different query engines, we use a custom evaluation prompt. This prompt guides the LLM in assigning a numerical score to each answer based on its accuracy, relevance, and completeness compared to the ground truth.
Prompt Details
The evaluation prompt:
- Scale: Scores answers on a scale of 1 to 10.
- 1: The answer is completely incorrect or unrelated.
- 10: The answer is entirely accurate, detailed, and matches the ground truth.
- Criteria:
- Correctness: Does the answer align with the ground truth?
- Completeness: Does the answer cover all key aspects of the question?
- Relevance: Is the answer directly related to the question without irrelevant details?
Structure
The prompt is structured to include:
- Question: The original query posed to the engine.
- Truth: The ground truth answer from the dataset.
- Provided Answer: The answer generated by the query engine.
- Instructions: Clear guidelines for the LLM to evaluate the response objectively.
Considerations for Production Systems
While this setup evaluates answers using a single numerical score, real-world production systems often require more granular evaluation metrics. For example, you may want to assess:
- Accuracy: How factually correct the answer is.
- Completeness: Whether the answer includes all necessary details.
- Relevance: Whether the answer avoids unnecessary or unrelated information.
- Toxicity: Whether the answer contains any inappropriate or harmful content.
To achieve this, you could split the evaluation into multiple prompts, each focusing on a specific metric (and/or use a framework like RAGAS). This allows for a more detailed analysis and better insights into the strengths and weaknesses of each query engine. In this demonstration, we simplify by using a single combined score for clarity and efficiency.
evaluation_prompt = PromptTemplate("""Evaluate the accuracy of the provided answer based on the original question and the ground truth answer. Assign a score on a scale of 1 to 10, where:
- 1 means the answer is completely incorrect and unrelated to the question.
- 10 means the answer is completely accurate, detailed, and matches the ground truth.
### Question:
{question}
### Truth:
{truth}
### Provided Answer:
{new_answer}
### Instructions:
1. Compare the provided answer to the ground truth, considering correctness, completeness, and relevance to the question.
2. Assign a score based on how well the provided answer matches the ground truth.
3. If the provided answer is partially correct or incomplete, reduce the score accordingly.
4. If the provided answer is unrelated to the question or completely incorrect, assign the lowest score (1).
Provide only the numerical score.
""")
Evaluation
With the experiments run and their outputs stored, the next step is to evaluate the quality of the answers generated by each query engine. This process uses the previously defined evaluation prompt to compare each answer against the ground truth and assigns a numerical score on a scale of 1 to 10. The evaluation focuses on measuring accuracy, completeness, and relevance systematically for every engine and question.
Code Overview
- Iteration: The script iterates through all QA pairs and query engines, processing each question-answer pair.
- Prompt Filling: For each pair, the question, ground truth answer, and the generated answer are inserted into the evaluation prompt.
- LLM Evaluation: The prompt is sent to GPT-4o, which serves as the judge, providing a score based on predefined criteria.
- Error Handling: The script incorporates error handling to ensure that individual failures, such as malformed responses, do not disrupt the entire evaluation process.
- Storing Results: Scores are appended as new columns in the results DataFrame, with the complete dataset saved to a CSV file (
experiment_results_with_scores.csv
) for further analysis.
Using GPT-4o as the Judge
GPT-4o evaluates the generated answers, following the logic that the evaluation model (the "judge") should be more advanced than the models being evaluated (the "workers"). This ensures a higher standard of assessment and nuanced scoring. However, this approach may introduce bias, as GPT-4o might favor responses aligned with its reasoning patterns, potentially scoring its own outputs higher than those of GPT-4o-mini or other models.
Considerations for Production Systems
In production, evaluations often require more nuanced and multidimensional metrics. For example:
- Separate Metrics: Accuracy, relevance, completeness, and toxicity could be assessed independently, each with tailored prompts.
- Diverse Evaluators: Different models or even human reviewers could be incorporated to reduce bias and improve robustness.
# Iterate over the dataset and calculate scores
for engine_name in query_engines.keys():
scores = []
for idx, row in results_df.iterrows():
question = row["question"]
truth = row["answer"]
new_answer = row[engine_name]
# Prepare the prompt with placeholders filled
formatted_prompt = evaluation_prompt.format(
question=question,
truth=truth,
new_answer=new_answer
)
# Call the LLM to evaluate
try:
response_obj = llm_gpt4o.complete(formatted_prompt)
print("Raw response object:", response_obj) # Debugging: Inspect the response structure
# Extract the actual text from the CompletionResponse object
response_text = response_obj.text.strip() # Assuming 'text' contains the result
print("Extracted response text:", response_text) # Debugging: Verify the extracted text
# Convert the text to a float score
score = float(response_text)
print("Parsed score:", score) # Debugging: Verify the parsed score
scores.append(score)
except AttributeError as ae:
print(f"Attribute error for engine {engine_name}, question {idx + 1}: {ae}")
scores.append(None)
except ValueError as ve:
print(f"Value error parsing score for engine {engine_name}, question {idx + 1}: {response_text}")
scores.append(None)
except Exception as e:
print(f"Error evaluating for engine {engine_name}, question {idx + 1}: {e}")
scores.append(None)
# Add the scores as a new column to the DataFrame
results_df[f"{engine_name}_score"] = scores
# Save the scored dataset
results_df.to_csv("experiment_results_with_scores.csv", index=False)
print("Results with scores saved to 'experiment_results_with_scores.csv'")
Results
IMPORTANT NOTE - the results in this section do not matter and are not the point of this chapter. The point is to show you how to do automatic evaluation, not to demonstrate that technique X is better than technique Y. That is heavily use-case dependent and for making such claims we would need experiments on 30 different datasets.
Having said that, let us create a nice plot for the results:
import matplotlib.pyplot as plt
# Calculate the mean scores for each _score column
score_columns = [col for col in results_df.columns if col.endswith("_score")]
mean_scores = results_df[score_columns].mean()
# Convert mean scores to percentages and sort them
mean_scores_percentage = mean_scores * 10 # Adjust scaling if necessary
sorted_mean_scores = mean_scores_percentage.sort_values()
# Create a scatter plot with sorted experiments
plt.figure(figsize=(8, 6))
plt.scatter(sorted_mean_scores.index, sorted_mean_scores, color='b', s=100) # Single dots for mean scores
plt.title("Mean Scores by Experiment")
plt.ylabel("Mean Score (%)")
plt.xlabel("Experiment")
plt.ylim(0, 100) # Y-axis starts at 0 and goes to 100%
plt.xticks(rotation=45, ha='right') # Rotate x-axis labels for better readability
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()
The results are rather interesting. The best technique is definitely LLM reranking, which is described in detail in [LINK]. But the most surprising part is that experiments with gpt-4o-mini are generally rated higher than the ones with gpt-4o. This is quite unexpected since the gpt-4o is supposed to be a better model. To investigate deeper you would need to do some “error analysis” - take individual QA pairs and their eval scores and decide whether the decisions of the judge LLM make sense. If they do not, you adjust the judge prompt or the experiment design. (Note: the results were generated using llamaindex VectorStore, you might get slightly different outcomes with Deep Lake)
Important thing to mention here is statistical validity of these results (do not run away please, it will be short 🤓). The thing with LLMs is that their are stochastic in nature. That means, that every time we run an experiment and following evaluation, we might get slightly different results/scores. And the difference can be even percentage points, not just rounding errors. This can be mitigated by running everything multiple times and then taking an average, but it can get costly, because the total number of LLM calls is equal to this:
Where is the number of QA pairs in your evaluation dataset, is the number of different configurations of your system you want to try (in our case 8, but can be 100 or more) and is the number of re-runs you need to do for statistical validity. Unfortunately I do not have a rule of thumb for you, the variability in your individual use-case may be different than in mine. In general, doing 5-10 re-runs instead of a single run is a good idea. But hey, doing a single run of systematic evaluation is better than evaluating 10 QA pairs manually in a spreadsheet, so please do not get discouraged by my statistical aside.
Conclusion
In this chapter, we explored a hands-on approach to evaluating RAG systems. Starting with the creation of a dataset and index, we defined experiments using different query engine configurations, including naive retrieval, query expansion with HyDE, and reranking with LLMs. Each configuration was tested on a set of QA pairs, and the results were systematically evaluated using a custom scoring prompt.
This process highlighted the importance of experimentation and provided a practical workflow for comparing model performance. While the evaluation was simplified to a single numerical score, the foundation laid here can be extended with more nuanced metrics and advanced methods.
Jupyter: Google Colab