RAG evaluation tools: Overview of the open-source eval landscape

Introduction

This chapter covers different tools that help with evaluation, from automated optimization with Auto-RAG to metrics-based frameworks like RAGAS and observability tools like TruLens. We’ll also briefly mention other tools like Langfuse and LlamaIndex to give you a full picture of what is available.

AutoRAG

AutoRAG brings the concept of AutoML to RAG. It simplifies and automates the process of building and optimizing RAG pipelines, saving a lot of manual work. I came across AutoRAG on Reddit when I was looking for something similar to AutoML for the area of RAG/LLMs. After exchanging ideas with the team, we collaborated on a research paper, and I have been impressed with their work ever since. Their GitHub repo has 2.7k stars, and the tool is incredibly useful.

Key Features

End-to-End Automation

AutoRAG handles the entire RAG pipeline—from retrieval to generation. Once set up, it automatically runs experiments to find the best configuration for your use case.

Simple YAML Configuration

You set everything up with a YAML file (example below). It supports different retrieval and generation strategies, which makes it very flexible.

Metrics-Driven Approach

AutoRAG optimizes pipelines based on a wide range of metrics, from traditional NLP metrics (F1, METEOR) to LLM based metrics (consistency, relevance).

Automatic Dataset Creation

AutoRAG can even create custom QA datasets for you, although this can be a bit tricky because it does not include human feedback (as discussed in the evaluation chapter).

How AutoRAG Works

AutoRAG follows a clear optimization loop:

Pipeline Definition: Configure nodes (retrieval, generation) and strategies using a YAML file.
Experimentation: The tool tests various configurations, running benchmarks on retrieval and generation outputs.
Evaluation: AutoRAG uses specified metrics to rank configurations and identify the best-performing setup.

For example, the YAML file below defines a two-stage pipeline (retrieval and generation):

node_lines:
- node_line_name: retrieve_node_line  # Set Node Line (Arbitrary Name)
  nodes:
    - node_type: retrieval  # Set Retrieval Node
      strategy:
        metrics: [retrieval_f1, retrieval_recall, retrieval_ndcg, retrieval_mrr]  # Set Retrieval Metrics
      top_k: 3
      modules:
        - module_type: vectordb
          vectordb: default
        - module_type: bm25
        - module_type: hybrid_rrf
          weight_range: (4,80)
- node_line_name: post_retrieve_node_line  # Set Node Line (Arbitrary Name)
  nodes:
    - node_type: prompt_maker  # Set Prompt Maker Node
      strategy:
        metrics:   # Set Generation Metrics
          - metric_name: meteor
          - metric_name: rouge
          - metric_name: sem_score
            embedding_model: openai
      modules:
        - module_type: fstring
          prompt: "Read the passages and answer the given question. \n Question: {query} \n Passage: {retrieved_contents} \n Answer : "
    - node_type: generator  # Set Generator Node
      strategy:
        metrics:  # Set Generation Metrics
          - metric_name: meteor
          - metric_name: rouge
          - metric_name: sem_score
            embedding_model: openai
      modules:
        - module_type: openai_llm
          llm: gpt-4o-mini
          batch: 16

This configuration defines two stages: retrieval and generation. Each stage includes strategies, metrics, and modules, all specified in a straightforward format.

Further Resources

GitHub Repository - Detailed documentation and examples for setting up pipelines.
Research Paper - Discusses the theoretical foundations and applications of AutoRAG.

Why is AutoRAG Useful

Once you’ve set up the YAML file, AutoRAG runs everything automatically. It iterates over different configurations, evaluates them, and identifies the best-performing setup. This makes it easy to experiment with various strategies and metrics without writing custom scripts. AutoRAG is highly recommended for efficiently optimizing RAG systems at scale.

In the image below, you can see all the different modules AutoRAG can optimize.

RAGAS

RAGAS is a lightweight framework designed to evaluate RAG pipelines. Unlike tools that handle retrieval or generation, RAGAS focuses solely on evaluation, making it an essential complement to tools like Auto-RAG or LlamaIndex. It provides a structured approach to assess the performance of RAG pipelines across multiple metrics, even when you don’t have labeled datasets.

Key Features

Works with Labeled and Unlabeled Datasets

RAGAS is flexible enough to work with both labeled and unlabeled datasets:

With labeled data, you can compare generated answers to predefined ground truths.
Without labeled data, RAGAS uses reference-free metrics to evaluate aspects like faithfulness and relevance.

Comprehensive Metrics Suite

RAGAS evaluates multiple dimensions of RAG performance, including:

Faithfulness: How well the generated answer reflects the retrieved information.
Relevance: How closely the generated answer matches the query’s intent.
Context Recall and Relevancy: The quality and relevance of retrieved documents.

Scalable Evaluation

Since it doesn’t rely on human annotations, RAGAS is highly scalable, making it ideal for rapid experimentation and development cycles.

Evaluating with RAGAS: Amnesty QA Dataset Example

RAGAS simplifies evaluation for RAG pipelines by providing a suite of metrics to assess model performance. This example demonstrates how to evaluate a dataset using RAGAS with a sample from a real-world dataset, Amnesty QA, which contains questions and answers related to Amnesty International's topics.

We will use the following evaluation metrics:

Context Precision: Measures the proportion of retrieved context that is relevant to the query.
Faithfulness: Evaluates how well the generated answer aligns with the retrieved context.
Answer Relevancy: Assesses whether the answer is relevant to the query.
Context Recall: Examines the coverage of relevant context retrieved for the query.

# Install required libraries
# !pip install llama-index ragas 
from ragas.metrics import (
    answer_relevancy,
    faithfulness,
    context_recall,
    context_precision,
)
from ragas import evaluate, SingleTurnSample, EvaluationDataset
import getpass
import os

# Prompt the user to enter the OpenAI API key securely
api_key = getpass.getpass(prompt='Enter your OpenAI API key: ')

# Set the OpenAI API key as an environment variable
os.environ["OPENAI_API_KEY"] = api_key

# Only one sample for demonstration, full dataset can be found at https://huggingface.co/datasets/explodinggradients/amnesty_qa
sample1 = SingleTurnSample(
    user_input='Which private companies in the Americas are the largest GHG emitters according to the Carbon Majors database?',
    retrieved_contexts=['The private companies responsible for the most emissions during this period, according to the database, are from the United States: ExxonMobil, Chevron and Peabody.\nThe largest emitter amongst state-owned companies in the Americas is Mexican company Pemex, followed by Venezuelan company Petróleos de Venezuela, S.A.'],
    response='According to the Carbon Majors database, the largest private companies in the Americas that are the largest GHG emitters are:\n\n1. Chevron Corporation (United States)\n2. ExxonMobil Corporation (United States)\n3. ConocoPhillips Company (United States)\n4. BP plc (United Kingdom, but with significant operations in the Americas)\n5. Royal Dutch Shell plc (Netherlands, but with significant operations in the Americas)\n6. Peabody Energy Corporation (United States)\n7. Duke Energy Corporation (United States)\n8. TotalEnergies SE (France, but with significant operations in the Americas)\n9. BHP Group Limited (Australia, but with significant operations in the Americas)\n10. Rio Tinto Group (United Kingdom/Australia, but with significant operations in the Americas)\n\nPlease note that the rankings may change over time as new data becomes available.',
    reference='The largest private companies in the Americas that are the largest GHG emitters according to the Carbon Majors database are ExxonMobil, Chevron, and Peabody.',
)

dataset = EvaluationDataset(samples=[sample1])

# Perform evaluation on the 'eval' split
result = evaluate(
    dataset,
    metrics=[
        context_precision,
        faithfulness,
        answer_relevancy,
        context_recall,
    ],
)

# Convert results to a pandas DataFrame for better readability
df = result.to_pandas()

# Display the first few rows of the evaluation results
print(df.head())

Example output:

Query:

Which private companies in the Americas are the largest GHG emitters according to the Carbon Majors database?

Retrieved Context:

The private companies responsible for the most emissions during this period, according to the database, are from the United States: ExxonMobil, Chevron and Peabody.\nThe largest emitter amongst state-owned companies in the Americas is Mexican company Pemex, followed by Venezuelan company Petróleos de Venezuela, S.A.

Response:

According to the Carbon Majors database, the largest private companies in the Americas that are the largest GHG emitters are:

Chevron Corporation (United States)
ExxonMobil Corporation (United States)
ConocoPhillips Company (United States)
BP plc (United Kingdom, but with significant operations in the Americas)
Royal Dutch Shell plc (Netherlands, but with significant operations in the Americas)
Peabody Energy Corporation (United States)
Duke Energy Corporation (United States)
TotalEnergies SE (France, but with significant operations in the Americas)
BHP Group Limited (Australia, but with significant operations in the Americas)
Rio Tinto Group (United Kingdom/Australia, but with significant operations in the Americas)

Please note that the rankings may change over time as new data becomes available.

Reference:

The largest private companies in the Americas that are the largest GHG emitters according to the Carbon Majors database are ExxonMobil, Chevron, and Peabody.

Faithfulness Score:

0.0

The faithfulness score of 0.0 reflects the response's significant deviation from the retrieved context, as it introduces unsupported companies not mentioned in the evidence. This highlights the importance of faithfulness as a metric to ensure that generated answers are strictly grounded in the retrieved information.

Trulens

TruLens is an open-source tool for evaluating and improving RAG pipelines and LLM applications. What sets it apart is its focus on observability - not just providing metrics, but also integrating into your development workflow to help you monitor, debug, and iterate on your system in real time.

Key Features

Custom Feedback Functions: Evaluate different stages of your pipeline, such as queries, retrievals, and responses, with flexible feedback functions tailored to your needs.
Real-Time Insights: TruLens integrates into your workflow, providing live observability for debugging and performance tracking during experiments or production.

Video Example

Here’s a video example where Anupam Datta, founder of Trulens, explains the core principles:

Other tools

In addition to the previously discussed tools, several other platforms offer valuable features for evaluating and monitoring RAG systems. Here's an overview of these tools:

LlamaIndex Evaluation Modules

LlamaIndex provides comprehensive evaluation modules tailored for RAG systems. These modules assess both retrieval and generation components, offering metrics such as recall, precision, and context relevance. By integrating these evaluations, developers can fine-tune their RAG pipelines to enhance performance.

Tonic Validate

Tonic Validate is a platform designed to streamline the evaluation of RAG applications. It offers a suite of metrics to assess accuracy, context quality, and latency, among other factors. With its user-friendly interface and integration capabilities, Tonic Validate enables developers to monitor and improve their RAG systems effectively.

Langfuse

Langfuse focuses on observability and monitoring for LLM applications. It provides tools to trace and debug workflows, capturing production traces and offering insights into system performance. While not exclusively an evaluation tool, Langfuse's monitoring capabilities are beneficial for maintaining and optimizing RAG systems in production environments.

Conclusion

The tools in this chapter show different ways to evaluate RAG systems depending on your needs. Auto-RAG focuses on automation, RAGAS provides structured metrics, and TruLens helps with real-time monitoring and debugging. Each tool serves a specific purpose, so you can choose the right one to fit your workflow and improve your system’s performance.

Coming up next: End to end example where everything (hopefully) comes together! 🤩

Jupyter: Google Colab