Mastering Advanced RAG Techniques with LlamaIndex

Introduction

The Retrieval-Augmented Generation (RAG) pipeline heavily relies on retrieval performance guided by the adoption of various techniques and advanced strategies. Methods like query expansion, query transformations, and query construction each play a distinct role in refining the search process. These techniques enhance the scope of search queries and the overall result quality.

In addition to core methods, strategies such as reranking (with the Cohere Reranker), recursive retrieval, and small-to-big retrieval further enhance the retrieval process.

Together, these techniques create a comprehensive and efficient approach to information retrieval, ensuring that searches are wide-ranging, highly relevant, and accurate. Before starting this guide, make sure you install all the requirements in the requirements section.

Querying in LlamaIndex

As mentioned in a previous lesson, the process of querying an index in LlamaIndex is structured around several key components.

  • Retrievers: These classes are designed to retrieve a set of nodes from an index based on a given query. Retrievers source the relevant data from the index.
  • Query Engine: It is the central class that processes a query and returns a response object. Query Engine leverages the retrievers and the response synthesizer modules to curate the final output.
  • Query Transform: It is a class that enhances a raw query string with various transformations to improve the retrieval efficiency. It can be used in conjunction with a Retriever and a Query Engine.

Incorporating the above components can lead to the development of an effective retrieval engine, complementing the functionality of any RAG-based application. However, the relevance of search results can noticeably improve with more advanced techniques like query construction, query expansion, and query transformations.

Query Construction

Query construction in RAG converts user queries to a format that aligns with various data sources. This process involves transforming questions into vector formats for unstructured data, facilitating their comparison with vector representations of source documents to identify the most relevant ones. It also applies to structured data, such as databases where queries are formatted in a compatible language like SQL, enabling effective data retrieval.

The core idea is to answer user queries by leveraging the inherent structure of the data. For instance, a query like "movies about aliens in the year 1980" combines a semantic component like "aliens" (which will get better results if retrieved through vector storage) with a structured component like "year == 1980". The process involves translating a natural language query into the query language of a specific database, such as SQL for relational databases or Cypher for graph databases.

Incorporating different approaches to perform query construction depends on the specific use case. The first category includes the MetadataFilter classes for vector stores with metadata filtering, an auto-retriever that translates natural language into unstructured queries. This involves defining the data source, interpreting the user query, extracting logical conditions, and forming an unstructured request. The other approach is Text-to-SQL for relational databases; converting natural language into SQL requests poses challenges like hallucination (creating fictitious tables or fields) and user errors (mis-spellings or irregularities). This is addressed by providing the LLM with an accurate database description and using few-shot examples to guide query generation.

Query Construction improves RAG answer quality with logical filter conditions inferred directly from user questions, and the retrieved text chunks passed to the LLM are refined before final answer synthesis.

💡
Query Construction is a process that translates natural language queries into structured or unstructured database queries, enhancing the accuracy of data retrieval.

Query Expansion

Query expansion works by extending the original query with additional terms or phrases that are related or synonymous.

For instance, if the original query is too narrow or uses specific terminology, query expansion can include broader or more commonly used terms relevant to the topic. Suppose the original query is "climate change effects." Query expansion would involve adding related terms or synonyms to this query, such as "global warming impact," "environmental consequences," or "temperature rise implications."

One approach to do it is utilizing the synonym_expand_policy from the KnowledgeGraphRAGRetriever class. In the context of LlamaIndex, the effectiveness of query expansion is usually enhanced when combined with the Query Transform class.

Query Transformation

Query transformations modify the original query to make it more effective in retrieving relevant information. Transformations can include changes in the query's structure, the use of synonyms, or the inclusion of contextual information.

Consider a user query like "What were Microsoft's revenues in 2021?" To enhance this query through transformations, the original query could be modified to be more like “Microsoft revenues 2021”, which is more optimized for search engines and vector DBs.

Query transformations involve changing the structure of a query to improve its performance.

Query Engine

A Query engine is a sophisticated interface designed to interact with data through natural language queries. It's a system that processes queries and delivers responses. As mentioned in previous lessons, multiple query engines can be combined for enhanced functionality, catering to complex data interrogation needs.

For a more interactive experience resembling a back-and-forth conversation, a Chat Engine can be used in scenarios requiring multiple queries and responses, providing a more dynamic and engaging interaction with data.

A basic usage of query engines is to call the .as_query_engine() method on the created Index. This section will include a step-by-step example of creating indexes from text files and utilizing query engines to interact with the dataset.

The first step is installing the required packages using Python package manager (PIP), followed by setting the API key environment variables.

!pip3 install deeplake==3.9.27 langchain openai tiktoken llama-index cohere
%pip install llama-index-vector-stores-deeplake
%pip install llama-index-llms-openai
The sample code.
import os
import getpass
os.environ['ACTIVELOOP_TOKEN'] = getpass.getpass('Enter your ActiveLoop API key: ')
os.environ['OPENAI_API_KEY'] = getpass.getpass('Enter your OpenAI API key: ')
The sample code.

The next step is downloading the text file that serves as our source document. This file is a compilation of all the essays Paul Graham wrote on his blog, merged into a single text file. You have the option to download the file from the provided URL, or you can execute these commands in your terminal to create a directory and store the file.

!mkdir -p 'paul_graham/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'paul_graham/paul_graham_essay.txt'
The sample code.

Now, use the SimpleDirectoryReader within the LlamaIndex framework to read all files from a specified directory. This class will automatically cycle through the files, reading them as Document objects.

from llama_index import SimpleDirectoryReader
from llama_index.core import SimpleDirectoryReader

# load documents
documents = SimpleDirectoryReader("./paul_graham").load_data()
# load documents
documents = SimpleDirectoryReader("./paul_graham").load_data()
The sample code.

We can now employ the ServiceContext to divide the lengthy single document into several smaller chunks with some overlap. Following this, we can proceed to create the nodes out of the generated documents.

from llama_index.core import Settings
Settings.chunk_size = 512
Settings.chunk_overlap = 64

node_parser = Settings.node_parser

nodes = node_parser.get_nodes_from_documents(documents)
The sample code.

The nodes must be stored in a vector store database to enable easy access. The DeepLakeVectorStore class can create an empty dataset when given a path. You can use genai360 to access the processed dataset or alter the organization ID to your Activeloop username to store the data in your workspace.

from llama_index.vector_stores.deeplake import DeepLakeVectorStore

my_activeloop_org_id = "your_org_id"
my_activeloop_dataset_name = "LlamaIndex_paulgraham_essays"
dataset_path = f"hub://{my_activeloop_org_id}/{my_activeloop_dataset_name}"

# Create an index over the documnts
vector_store = DeepLakeVectorStore(dataset_path=dataset_path, overwrite=False)
The sample code.
Your Deep Lake dataset has been successfully created!
The output.

The new database will be wrapped as a StorageContext object, which accepts nodes to provide the necessary context for establishing relationships if needed. Finally, the VectorStoreIndex takes in the nodes along with links to the database and uploads the data to the cloud. Essentially, it constructs the index and generates embeddings for each segment.

from llama_index.core.storage.storage_context import StorageContext
from llama_index.core import VectorStoreIndex

storage_context = StorageContext.from_defaults(vector_store=vector_store)
storage_context.docstore.add_documents(nodes)
vector_index = VectorStoreIndex(nodes, storage_context=storage_context)
The sample code.
Uploading data to deeplake dataset.
100%|██████████| 40/40 [00:00<00:00, 40.60it/s]
|Dataset(path='hub://genai360/LlamaIndex_paulgraham_essays', tensors=['text', 'metadata', 'embedding', 'id'])

  tensor      htype      shape      dtype  compression
  -------    -------    -------    -------  ------- 
   text       text      (40, 1)      str     None   
 metadata     json      (40, 1)      str     None   
 embedding  embedding  (40, 1536)  float32   None   
    id        text      (40, 1)      str     None
The output.

The created index serves as the basis for defining the query engine. We initiate a query engine by using the vector index object and executing the .as_query_engine() method. The following code sets the streaming flag to True, which reduces idle waiting time for the end user (more details on this will follow). Additionally, it employs the similarity_top_k flag to specify the number of source documents it can consult to respond to each query.

query_engine = vector_index.as_query_engine(streaming=True, similarity_top_k=10)
The sample code.

The final step involves utilizing the .query() method to engage with the source data. We can pose questions and receive answers. As mentioned, the query engine employs retrievers and a response synthesizer to formulate an answer.

streaming_response = query_engine.query(
    "What does Paul Graham do?",
)
streaming_response.print_response_stream()
The sample code.
Paul Graham is an artist and entrepreneur. He is passionate about creating paintings that can stand the test of time. He has also co-founded Y Combinator, a startup accelerator, and is actively involved in the startup ecosystem. While he has a background in computer science and has worked on software development projects, his primary focus is on his artistic pursuits and supporting startups.
The output.

The query engine can be configured into a streaming mode, providing a real-time response stream to enhance continuity and interactivity. This feature is beneficial in reducing idle time for end users. It allows users to view each word as generated, meaning they don't have to wait for the model to produce the entire text. To observe the impact of this feature, use the print_response_stream method on the response object of the query engine.

Sub Question Query Engine

Sub Question Query Engine, a more sophisticated querying method, can be employed to address the challenge of responding to complex queries. This engine can generate several sub-questions from the user's main question, answer each separately, and then compile the responses to construct the final answer. First, we must modify the previous query engine by removing the streaming flag, which conflicts with this technique.

query_engine = vector_index.as_query_engine(similarity_top_k=10)
The sample code.

We register the created query_engine as a tool by employing the QueryEngineTool class and compose metadata (description) for it. It is done to inform the framework about this tool's function and enable it to select the most suitable tool for a given task, especially when multiple tools are available. Then, the combination of the tools we declared earlier and the service context, which was previously defined, can be used to initialize the SubQuestionQueryEngine object.

from llama_index.core.tools import QueryEngineTool, ToolMetadata
from llama_index.core.query_engine import SubQuestionQueryEngine

query_engine_tools = [
    QueryEngineTool(
        query_engine=query_engine,
        metadata=ToolMetadata(
            name="pg_essay",
            description="Paul Graham essay on What I Worked On",
        ),
    ),
]

query_engine = SubQuestionQueryEngine.from_defaults(
    query_engine_tools=query_engine_tools,
    use_async=True,
)
The sample code.

The setup is ready to ask a question using the same query method. As observed, it formulates three questions, each responding to a part of the query, and attempts to find their answers individually. A response synthesizer then processes these answers to create the final output.

response = query_engine.query(
    "How was Paul Grahams life different before, during, and after YC?"
)
print( ">>> The final response:\n", response )
The Sample Code.
Generated 3 sub questions.
[pg_essay] Q: What did Paul Graham work on before YC?
[pg_essay] Q: What did Paul Graham work on during YC?
[pg_essay] Q: What did Paul Graham work on after YC?
[pg_essay] A: During YC, Paul Graham worked on writing essays and working on YC itself.
[pg_essay] A: Before YC, Paul Graham worked on a variety of projects. He wrote essays, worked on YC's internal software in Arc, and also worked on a new version of Arc. Additionally, he started Hacker News, which was originally meant to be a news aggregator for startup founders.
[pg_essay] A: After Y Combinator (YC), Paul Graham worked on various projects. He focused on writing essays and also worked on a programming language called Arc. However, he gradually reduced his work on Arc due to time constraints and the infrastructure dependency on it. Additionally, he engaged in painting for a period of time. Later, he worked on a new version of Arc called Bel, which he worked on intensively and found satisfying. He also continued writing essays and exploring other potential projects.

>>> The final response:
 Paul Graham's life was different before, during, and after YC. Before YC, he worked on a variety of projects including writing essays, developing YC's internal software in Arc, and creating Hacker News. During YC, his focus shifted to writing essays and working on YC itself. After YC, he continued writing essays but also worked on various projects such as developing the programming language Arc and later its new version called Bel. He also explored other potential projects and engaged in painting for a period of time. Overall, his work and interests evolved throughout these different phases of his life.
The output.

Custom Retriever Engine

As you might have noticed, the choice of retriever and its parameters (e.g., the number of returned documents) influences the quality and relevance of the outcomes generated by the QueryEngine. LlamaIndex supports the creation of custom retrievers. Custom retrievers are a combination of different retriever styles, creating more nuanced retrieval strategies that adapt to distinct individual queries. The RetrieverQueryEngine operates with a designated retriever, which is specified at the time of its initialization. The choice of this retriever is vital as it significantly impacts the query results' outcome.

There are two main types of RetrieverQueryEngine:

  1. VectorIndexRetriever fetches the top-k nodes that are most similar to the query. It focuses on relevance and similarity, ensuring the results closely align with the query's intent. It is the approach we used in previous subsections.
  2. Use Case: It is ideal for situations where precision and relevance to the specific query are paramount, like in detailed research or topic-specific inquiries.

  3. SummaryIndexRetriever retrieves all nodes related to the query without prioritizing their relevance. This approach is less concerned with aligning closely to the specific context of the question and more about providing a broad overview.
  4. Use Case: Useful in scenarios where a comprehensive sweep of information is needed, regardless of the direct relevance to the specific terms of the query, like in exploratory searches or general overviews.

💡
You can read the following tutorial for a usage example: Building and Advanced Fusion Retriever from Scratch.

Reranking

While any retrieval mechanism capable of extracting multiple chunks from a large document can be efficient to an extent, there is always a likelihood that it will select some irrelevant candidates among the results. Reranking is re-evaluating and re-ordering search results to present the most relevant options. By eliminating the chunks with lower scores, the final context given to the LLM boosts overall efficiency as the LLM gets more concentrated information.

The Cohere Reranker improves the performance of retrieving close content. While the semantic search component is already highly capable of retrieving relevant documents, the Rerank endpoint boosts the quality of the search results, especially for complex and domain-specific queries. It sorts the search results according to their relevance to the query. It is important to note that Rerank is not a replacement for a search engine but a supplementary tool for sorting search results in the most effective way possible for the user.

The process begins with grouping documents into batches, after which the LLM evaluates each batch, attributing relevance scores to them. The final step in the reranking process involves aggregating the most relevant documents from all these batches to form the final retrieval response. This method guarantees that the most pertinent information is highlighted and becomes the focal point of the search outcomes.

The necessary dependencies have already been installed; the only remaining step is to obtain your API key from Cohere.com and substitute it for the placeholder provided.

import cohere
import os

os.environ['COHERE_API_KEY'] = getpass.getpass('Enter your Cohere API key: ')

# Get your cohere API key on: www.cohere.com
co = cohere.Client(os.environ['COHERE_API_KEY'])

# Example query and passages
query = "What is the capital of the United States?"
documents = [
   "Carson City is the capital city of the American state of Nevada. At the  2010 United States Census, Carson City had a population of 55,274.",
   "The Commonwealth of the Northern Mariana Islands is a group of islands in the Pacific Ocean that are a political division controlled by the United States. Its capital is Saipan.",
   "Charlotte Amalie is the capital and largest city of the United States Virgin Islands. It has about 20,000 people. The city is on the island of Saint Thomas.",
   "Washington, D.C. (also known as simply Washington or D.C., and officially as the District of Columbia) is the capital of the United States. It is a federal district. ",
   "Capital punishment (the death penalty) has existed in the United States since before the United States was a country. As of 2017, capital punishment is legal in 30 of the 50 states.",
   "North Dakota is a state in the United States. 672,591 people lived in North Dakota in the year 2010. The capital and seat of government is Bismarck."
   ]
The sample code.

We define a rerank object by passing both the query and the documents. We also set the rerank_top_k argument to 3; we specifically instruct the system to retrieve the top three highest-scored candidates by the model. In this case, the model employed for reranking is rerank-multilingual-v2.0.

results = co.rerank(query=query, documents=documents, top_n=3, model='rerank-english-v3.0') # Change top_n to change the number of results returned. If top_n is not passed, all results will be returned.
for el in results:
  print(f"{el[0]} : {el[1]}")
The sample code.
Document Rank: 1, Document Index: 3
Document: Washington, D.C. (also known as simply Washington or D.C., and officially as the District of Columbia) is the capital of the United States. It is a federal district. The President of the USA and many major national government offices are in the territory. This makes it the political center of the United States of America.
Relevance Score: 0.99

Document Rank: 2, Document Index: 1
Document: The Commonwealth of the Northern Mariana Islands is a group of islands in the Pacific Ocean that are a political division controlled by the United States. Its capital is Saipan.
Relevance Score: 0.30

Document Rank: 3, Document Index: 5
Document: Capital punishment (the death penalty) has existed in the United States since before the United States was a country. As of 2017, capital punishment is legal in 30 of the 50 states. The federal government (including the United States military) also uses capital punishment.
Relevance Score: 0.27
The output.

This can be accomplished using LlamaIndex in conjunction with Cohere Rerank. The rerank object can be integrated into a query engine, allowing it to manage the reranking process seamlessly in the background. We will use the same vector index defined earlier to prevent writing repetitive codes and integrate the rerank object with it. The CohereRerank class initiates a rerank object by taking in the API key and specifying the number of documents to be returned following the scoring process.

%pip install llama-index-postprocessor-cohere-rerank
!pip install llama-index cohere pypdf
import os
from llama_index.postprocessor.cohere_rerank import CohereRerank

cohere_rerank = CohereRerank(api_key=os.environ['COHERE_API_KEY'], top_n=2)
The sample code.

Now, we can employ the same as_query_engine method and utilize the node_postprocessing argument to incorporate the reranker object. The retriever initially selects the top 10 documents based on semantic similarity, and then the reranker reduces this number to 2.

query_engine = vector_index.as_query_engine(
    similarity_top_k=10,
    node_postprocessors=[cohere_rerank],
)

response = query_engine.query(
    "What did Sam Altman do in this essay?",
)
print(response)
The sample code.
Sam Altman was asked if he wanted to be the president of Y Combinator (YC) and initially said no. However, after persistent persuasion, he eventually agreed to take over as president starting with the winter 2014 batch.
The output.
💡
Rerank computes a relevance score for the query and each document and returns a sorted list from the most to the least relevant document.

The reranking process in search systems offers numerous advantages, including practicality, enhanced performance, simplicity, and integration capabilities. It allows for augmenting existing systems without requiring complete overhauls, making it a cost-effective solution for improving search functionality. Reranking elevates search systems, which is particularly useful for complex, domain-specific queries in embedding-based systems.

The Cohere Rerank has proven to be effective in improving search quality across various embeddings, making it a reliable option for enhancing search results.

Advanced Retrievals

An alternative method for retrieving relevant documents involves using document summaries instead of extracting fragmented snippets or brief text chunks to respond to queries. This technique ensures that the answers reflect the entire context or topic being examined, offering a more thorough grasp of the subject.

Recursive Retrieval

The recursive retrieval method is particularly effective for documents with a hierarchical structure, allowing them to form relationships and connections between the nodes. According to Jerry Liu, founder of LlamaIndex, this is evident in cases like a PDF, which may contain "sub-data" such as tables and diagrams, alongside references to other documents. This technique can precisely navigate through the graph of connected nodes to locate information. This technique is versatile and can be applied in various scenarios, such as with node references, document agents, or even the query engine. For practical applications, including processing a PDF file and utilizing data from tables, you can refer to the tutorials in the LlamaIndex documentation here.

Small-to-Big retrieval

The small-to-big retrieval approach is a strategic method for information search, starting with concise, focused sentences to pinpoint the most relevant section of content with a question. It then passes a longer text to the model, allowing for a broader understanding of the context preceding and following the targeted area. This technique is particularly useful in situations where the initial query may not encompass the entirety of relevant information or where the data's relationships are intricate and multi-layered.

The LlamaIndex framework employs the Sentence Window Retrieval technique, which involves using the SentenceWindowNodeParser class to break down documents into individual sentences per node. Each node includes a "window" that encompasses the sentences surrounding the main node sentence. (It is 5 sentences before and after each node by default) During retrieval, the single sentences initially retrieved are substituted with their respective windows, including the adjacent sentences, through the MetadataReplacementNodePostProcessor. This substitution ensures that the Large Language Model receives a comprehensive view of the context surrounding each sentence.

💡
The small-to-big retrieval method begins by extracting small, targeted text segments and then presents the larger text chunks from which these segments were derived to the Large Language Model, thereby offering a more complete scope of information.

You can follow a hands-on tutorial to implement this technique from the documentation here.

Conclusion

Effective information retrieval involves mastering techniques such as query expansion, query transformations, and query construction, coupled with advanced strategies like reranking, recursive retrieval, and small-to-big retrieval. Together, these techniques enhance the search process by increasing accuracy and broadening the range of results. By incorporating these methods, information retrieval systems become more proficient in providing precise results, essential for improving the performance of RAG-based applications.

>> Notebook.

RESOURCES:

  • COHERE RERANK NOTEBOOK
  • recursive retrieval
  • llamaindex notebook