Build a RAG System with LLM and Deep Memory. Use three different datasets and test the quality of response with and without the Deep Memory feature to improve retrieval!
Traditionally, retrieval methods have relied on standard techniques like RAG or query-based document retrieval. However, often the results are not fully satisfying as the retrieved chunks of text are not exactly what we expect. To solve this problem, many techniques are tested every day.
In this guide, we will explore how to enhance the storage of indices in the retrieval pipeline in a more effective way through finetuning techniques.
Unlike other techniques, Deep Memory enables an automatic and convenient finetuning of the retrieval step on the chunks of data provided, thus improving the solution when compared to a classic and agnostic RAG application.
Deep Memory emerges as a fundamental solution in addressing the critical need for accurate retrieval in generating high-quality results. It is crucial to increase the accuracy of Deep Lake’s vector search by up to 22%, achieved through learning an index from labeled queries tailored to specific applications. Importantly, this improvement is achieved without compromising research time, demonstrating the efficacy of Deep Memory in fine-tuning the retrieval process.
The enterprise application landscape, particularly the development of “chat with your data” solutions, highlights the importance of accurate retrieval. Current practices involve the integration of Retrieval Augmented Generation (RAG) systems with Large Language Models (LLMs) like GPT-4.
The role of Deep Memory in significantly improving the accuracy of vector search becomes fundamental in this context, as it offers a potential solution to increase the reliability of these applications. By emphasizing the importance of accurate retrieval, the integration of technologies like Deep Memory becomes a focal point in achieving the desired level of consistency and precision in generating results.
Let’s build a Deep Memory RAG application!
We need three main components: A dataset containing the text chunks we want to retrieve, an LLM model to generate the text embeddings and the Deep Memory library.
To try the project, we can go directly to the GitHub repository at and follow the instructions. Now, let's take a detailed look at some parts of the code.
Get Deep Memory Access
As Step 0, please note that Deep Memory is a premium feature in Activeloop paid plans. As a reminder, you can redeem a free trial. As a part of the course, all course takers can redeem a free extended trial of one month for the Activeloop Growth plan by redeeming GENAI360 promo code at checkout. To redeem the plan, please create a Deep Lake Account, and on the following screen on account creation, please watch the following video.
Preparing the Dataset
In this guide, we have prepared three different datasets that could be downloaded and tested with and without the Deep Memory feature.
To find out more about them, you can follow the links:
- Finance: we chose the FinQA Dataset that contains text explaining the economy, acquisitions, etc. It also is a QA dataset, making it easier for us to embed as we already have questions and related answers without generating them! This work focuses on answering deep questions about financial data, aiming to automate the analysis of a large corpus of financial documents. In contrast to existing tasks in the general domain, the finance domain includes complex numerical reasoning and an understanding of heterogeneous representations. Source: https://github.com/czyssrs/FinQA
- Legal: the Legalbench Dataset contains questions and answers about legal subjects like a company's legal rights, policies, and such. That is a very tedious and specific topic that is not readable by everyone, so retrieving the right information for this task is very welcome! LegalBench tasks span multiple types (binary classification, multi-class classification, extraction, generation, entailment), multiple types of text (statutes, judicial opinions, contracts, etc.), and multiple areas of law (evidence, contracts, civil procedure, etc.). It is a benchmark consisting of different legal reasoning tasks.
- Biomedical: To address a biomedical topic we chose the Cord19 Dataset, which is about Covid. As it is a very discussed topic, retrieving every possible information is crucial and so we wanted to test it. CORD-19 is a corpus of academic papers about COVID-19 and related coronavirus research. It's curated and maintained by the Semantic Scholar team at the Allen Institute for AI to support text mining and NLP research.
Source: https://huggingface.co/datasets/nguha/legalbench?clone=true
These three datasets are in the Activeloop organization space, so you need to load them in a Tensor Database format to be able to take advantage of the Deep Memory functionality.
In the code above the variable user_hub
will be equal to the Organization name, in our case "activeloop"
, and name_db
will be the <dataset_name>
.
def load_vector_store(user_hub, name_db):
vector_store_db = DeepLakeVectorStore(
f"hub://{user_hub}/{name_db}",
embedding_function=embeddings_function.embed_documents,
runtime={"tensor_db": True},
)
return vector_store_db
The datasets were created with a preprocessing consisting of 3 different steps:
- Gather the data
- Divide the data into chunks
- Create sample questions
The last point listed must be applied for every chunk. Specifically, it creates a relevance score that represents how relevant the question is when compared to the chunk of text (this is necessary for the most critical part: the Deep Memory Finetuning).
Gather the data
The simplest solution is to download a QA Dataset on whatever is our topic of interest so that it possesses everything we need. We can easily do it with the following command:
wget <source_link>
But what if our data is just a long text?
Chunk Generation
To generate the chunks, we can use libraries like Langchain, which provide methods to divide our text into chunks automatically.
Below are some examples of the generated chunks:
- Legal:
- Biomedical:
- Finance:
There are multiple strategies to do this step most efficiently. For instance, we can use “.” as a separator character, or define the length of each chunk as a standard, or a combination of those!
We suggest creating chunks that are not too short and overlapping them to keep relevant information intact.
The main disadvantage of not overlapping chunks is the potential loss of information; depending on the nature of the data and the requirements of the analysis or modeling task, it's often beneficial to experiment with overlapping chunks.
from langchain.text_splitter import RecursiveCharacterTextSplitter
# Divide text in chunks
def create_chunks(context, chunk_size=300, chunk_overlap=50):
# Initialize the text splitter with custom parameters
custom_text_splitter = RecursiveCharacterTextSplitter(
# Set custom chunk size
chunk_size = chunk_size,
chunk_overlap = chunk_overlap,
# Use length of the text as the size measure
length_function = len,
)
chunks = custom_text_splitter.split_text(context)
return chunks
Questions and Relevance Generation
This is the most subtle step: how do we generate information like questions or scores without a properly trained model?
The good (not so old) LLMs come to the rescue to accomplish this task! We can use prompt engineering on a LLM to generate a question for each chunk and at the same time generate the relevance score as a classification task. We just need to call the model and then parse the text output to obtain the necessary data. To do this we construct a dataset of questions and relevance. Relevance is a set of pairs (corpus.id: str, significance: str) that provides information where the answer is inside the corpus. Sometimes an answer can be found in multiple locations or have different significance. Relevance enables Deep Memory training to optimize the embedding space for higher accuracy.
An example of how to generate questions and relevance scores is the following:
#Sample Prompt message to generate Question and Answer for the provided context
system_message = """
Generate a question related to the context and provide a relevance score on a scale of 0 to 1, where 0 is not relevant at all and 1 is highly relevant.
The input is provided in the following format:
Context: [The context that for the generated question]
The output is in the following format:
#Question#: [Text of the question]
#Relevance#: [score number between 0 and 1]
The context is: {context}
"""
def get_chunk_qa_data(context):
# Generate the Question and Relevance Text with LLM
llm = OpenAI(temperature=0)
llm_chain = LLMChain(llm=llm, prompt=PromptTemplate.from_template(system_message))
output = llm_chain(context)
# CHECK THE RELEVANCE STRING IN THE OUTPUT
check_relevance = None
relevance_strings = ["#Relevance#: ", "Relevance#: ", "Relevance: ", "Relevance"]
for rel_str in relevance_strings:
if rel_str in output["text"]:
check_relevance = rel_str
break
if check_relevance is None:
raise ValueError("Relevance not found in the output")
messages = output["text"].split(check_relevance)
relevance = messages[1]
# CHECK THE QUESTION STRING IN THE OUTPUT
question = None
question_strings = ["#Question#: ", "Question#: ", "Question: ", "Question"]
for qst_str in question_strings:
if qst_str in messages[0]:
question = messages[0].split(qst_str)[1]
break
if question is None:
raise ValueError("Question not found in the output")
return question, relevance
Now we are all set for the fun part!
Deep Memory Tool
Now we dive into the core of this tutorial: Deep Memory.
Deep Memory is one of the tools included into the High Performance Features of Deep Lake. It is very effective in improving the retrieval accuracy of an LLM model by optimizing your vector store for your use-case, enhancing the performance of you overall LLM app.
This is possible by performing a fine-tuning of the embeddings of your embedding model using your own Dataset enriched by QA additional information that consists in Questions and Relevance scores (how much the question is related to the text).
Creating the Deep Memory Vector Store
Deep Memory leads an index from labeled queries tailored to your Dataset, without impacting search time. These results can be achieved with only a few hundred example pairs of prompt embeddings and the most relevant answers from the vector store.
As we can see, Deep Memory uses the dataset text (corpus) along with the questions (queries) and relevance score we generated to train an enhanced retrieval model that can be used without any other modification and adds no latency while boosting the retrieval quality.
Deep Memory increases retrieval accuracy without altering your existing workflow.
def load_vector_store(user_hub, name_db):
vector_store_db = DeepLakeVectorStore(
f"hub://{user_hub}/{name_db}",
embedding_function=embeddings_function.embed_documents,
runtime={"tensor_db": True},
)
return vector_store_db
In order to create a Deep Memory Dataset, we just need 2 things: the chunks of text
- An embedding function to generate the text embeddings
- The metadata we prepared (questions and relevance scores)
For the first point, embeddings can still be computed using a model of your choice such as Open AI ada-002 or other OSS models BGE by BAAI. Furthermore, search results from Deep Memory can be further improved by combining them with lexical search or reranker. For the second point, we explained how to generate questions and relevance scores in the previous paragraphs.
If you want to go deeper into details, there are some questions and relevance computed during the training phase:
Legal Dataset:
- Chunk: "Confidential Information means all confidential information relating to the Purpose which the Disclosing Party or any of its Affiliates, discloses or makes available, to the Receiving Party or any of its Affiliates, before, on or after the Effective Date. This includes the fact that discussions and negotiations are taking place concerning the Purpose and the status of those discussions and negotiations."
- Question: What is the definition of Confidential Information?
Biomedical Dataset:
- Chunk: "The P2 64 and P3 regions encode the non-structural proteins 2B and 2C and 3A, 3B (1-3) (VPg), 3C pro and 4 structural protein-coding regions is replaced by reporter genes, allow the study of genome 68 replication without the requirement for high containment."
- Question: What are the non-structural proteins encoded by the P2 64 and P3 regions?
Finance Dataset:
- Chunk: "the deferred fuel cost revisions variance resulted from a revised unbilled sales pricing estimate made in december 2002 and a further revision made in the first quarter of 2003 to more closely align the fuel component of that pricing with expected recoverable fuel costs . the asset retirement obligation variance was due to the implementation of sfas 143 , "accounting for asset retirement obligations" adopted in january 2003 . see "critical accounting estimates" for more details on sfas 143 . the increase was offset by decommissioning expense and had no effect on net income . the volume variance was due to a decrease in electricity usage in the service territory . billed usage decreased 1868 gwh in the industrial sector including the loss of a large industrial customer to cogeneration."
- Question: What was the impact of the asset retirement obligation variance on net income?
Providing those inputs to the Vector Store, we can upload the Dataset we created to the Active Loop Dataset Repository. After this step, the Deep Memory feature will do 2 things automatically:
- Generate the embeddings the embedding model we defined
- Finetune the indices using the Deep Memory feature
And we are all set to try our improved RAG applications!
But how to try it out now?
Deep Memory Search
After creating the Deep Memory Dataset, we can search for the right piece of text for our question using the following code:
def get_answer(vector_store_db, user_question, deep_memory):
# deep memory inside the vectore store ==> deep_memory=True
answer = vector_store_db.search(
embedding_data=user_question,
embedding_function=embeddings_function.embed_query,
deep_memory=deep_memory,
return_view=False,
)
return answer
Developing a Deep Memory Search with Gradio
We created a Gradio application to test our application more easily.
The interface allows us to select the dataset we want to test, write a question, and instantly generate the answer. We can also compare the response returned by the Deep Memory model with that returned by the model without.
Classic RAG vs Deep Memory
To test out the improvements of the Deep Memory step, we prepared and shared 3 datasets we mentioned earlier: Legal, Medical, Finance. In the output windows, you can see the benefits of this amazing tool when compared to more classical approaches.
If you want to try these models, we suggest you to try one of these questions:
- Legal Dataset:
- What are the provisions of this Agreement regarding the disclosure of Confidential Information to third parties?
- Biomedical Dataset:
- What are the advantages of using the new package to visualize data?
- Finance Dataset:
- What were the primary factors that contributed to the improvement in net cash provided by operating activities during 2015?
The following example shows how the model with deep memory and dataset finance is more efficient in the response generated:
Deep Memory model:
The provisions of this Agreement state that disclosure of Confidential Information to third party consultants and professional advisors is allowed, as long as those third parties agree to be bound by this Agreement. Additionally, both parties are required to keep any confidential information they may have access to confidential, unless required by law or necessary to perform their obligations under this Agreement. This includes not only the information itself, but also the terms of the Agreement and the fact that the parties are considering a business arrangement.
Non Deep Memory model:
The provisions of this Agreement state that disclosure of Confidential Information to third party consultants and professional advisors is allowed, as long as those third parties agree to be bound by this Agreement. Additionally, the Confidential Information includes the terms of this agreement, the fact that the information is being made available, and the possibility of a business arrangement between the parties.
Evaluation Metrics
After testing out our datasets, we can see that the Deep Memory contribution is visible in retrieving more suitable information to the query provided as a question by the user. The following metrics show how the Deep Memory feature can improve performance:
Legal Dataset:
---- Evaluating without Deep Memory ----
Recall@1: 12.0%
Recall@3: 37.0%
Recall@5: 47.0%
Recall@10: 57.0%
Recall@50: 87.0%
Recall@100: 94.0%
---- Evaluating with Deep Memory ----
Recall@1: 19.0%
Recall@3: 56.0%
Recall@5: 66.0%
Recall@10: 79.0%
Recall@50: 88.0%
Recall@100: 95.0%
Biomedical Dataset:
---- Evaluating without Deep Memory ----
Recall@1: 59.0%
Recall@3: 75.0%
Recall@5: 78.0%
Recall@10: 81.0%
Recall@50: 91.0%
Recall@100: 94.0%
---- Evaluating with Deep Memory ----
Recall@1: 69.0%
Recall@3: 81.0%
Recall@5: 83.0%
Recall@10: 86.0%
Recall@50: 97.0%
Recall@100: 98.0%
Financial Dataset:
---- Evaluating without Deep Memory ----
Recall@1: 18.0%
Recall@3: 51.0%
Recall@5: 65.0%
Recall@10: 71.0%
Recall@50: 98.0%
Recall@100: 99.0%
---- Evaluating with Deep Memory ----
Recall@1: 26.0%
Recall@3: 66.0%
Recall@5: 75.0%
Recall@10: 81.0%
Recall@50: 99.0%
Recall@100: 99.0%
In conclusion, it is crucial to recognize that in NLP, success is not only determined by the richness of the data, but also depends on the effectiveness of the retrieval strategy. Although collecting large and diverse datasets is undeniably valuable, how the information is retrieved and presented plays a key role in optimizing model performance. As we have seen in this brief guide, there are tools such as Deep Memory that allow us to be more accurate and efficient and thus generate more relevant answers.