Building a Custom Document Retrieval Tool with Deep Lake and LangChain: A Step-by-Step Workflow

Introduction

This lesson is a walkthrough on constructing an efficient document retrieval system designed to extract valuable insights from the FAQs of a service. The goal of this system is to swiftly provide users with relevant information by promptly fetching pertinent documents that explain a company's operations.

Sifting through multiple sources or FAQs can be a tiresome task for users. Our retrieval system steps in here, providing concise, precise, and quick answers to these questions, thereby saving users time and effort.

Workflow

  1. Setting up Deep Lake: Deep Lake is a type of vector store database designed for storing and querying high-dimensional vectors efficiently. In our case, we're using Deep Lake to store document embeddings and their corresponding text.
  2. Storing documents in Deep Lake: Once Deep Lake is set up, we’ll create embeddings for our documents. In this workflow, we're using OpenAI's model for creating these embeddings. Each document's text is fed into the model, and the output is a high-dimensional vector representing the text's semantic content. The embeddings and their corresponding documents are then stored in Deep Lake. This will set up our vector database, which is ready to be queried.
  3. Creating the retrieval tool: Now, we use Langchain to create a custom tool that will interact with Deep Lake. This tool is essentially a function that takes a query as input and returns the most similar documents from Deep Lake as output. To find the most similar documents, the tool first computes the embedding of the query using the same model we used for the documents. Then, it queries Deep Lake with this query embedding, and Deep Lake returns the documents whose embeddings are most similar to the query embedding.
  4. Using the tool with an agent: Finally, we use this custom tool with an agent from Langchain. When the agent receives a question, it uses the tool to retrieve relevant documents from Deep Lake, and then it uses its language model to generate a response based on these documents.

Let’s start! First, we load the OpenAI API key as an environment variable:

import os
os.environ["OPENAI_API_KEY"] = "<INSERT_YOUR_OPENAI_API_KEY>"
os.environ["ACTIVELOOP_TOKEN"] = "<INSERT_YOUR_ACTIVELOOP_TOKEN>"

Setting up Deep Lake

Next, we'll set up a Deep Lake vector database and add some documents to it. The hub path for a Deep Lake dataset is in the format hub://<org_id>/<dataset_name>. Remember to install the required packages with the following command: pip install langchain==0.0.208 deeplake==3.9.27 openai==0.27.8 tiktoken.

# We'll use an embedding model to compute the embeddings of our documents
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import DeepLake

# instantiate embedding model
embeddings = OpenAIEmbeddings(model="text-embedding-ada-002")

# create Deep Lake dataset
# We'll store the documents and their embeddings in the deep lake vector db
# TODO: use your organization id here. (by default, org id is your username)
my_activeloop_org_id = "<YOUR-ACTIVELOOP-ORG-ID>"
my_activeloop_dataset_name = "langchain_course_custom_tool"
dataset_path = f"hub://{my_activeloop_org_id}/{my_activeloop_dataset_name}"
db = DeepLake(dataset_path=dataset_path, embedding_function=embeddings)

You should now be able to visualize your dataset on the Activeloop website.

Storing documents in Deep Lake

We can then add some FAQs related to PayPal as our knowledge base.

# add faqs to the dataset
faqs = [
    "What is PayPal?\nPayPal is a digital wallet that follows you wherever you go. Pay any way you want. Link your credit cards to your PayPal Digital wallet, and when you want to pay, simply log in with your username and password and pick which one you want to use.",
    "Why should I use PayPal?\nIt's Fast! We will help you pay in just a few clicks. Enter your email address and password, and you're pretty much done! It's Simple! There's no need to run around searching for your wallet. Better yet, you don't need to type in your financial details again and again when making a purchase online. We make it simple for you to pay with just your email address and password.",
    "Is it secure?\nPayPal is the safer way to pay because we keep your financial information private. It isn't shared with anyone else when you shop, so you don't have to worry about paying businesses and people you don't know. On top of that, we've got your back. If your eligible purchase doesn't arrive or doesn't match its description, we will refund you the full purchase price plus shipping costs with PayPal's Buyer Protection program.",
    "Where can I use PayPal?\nThere are millions of places you can use PayPal worldwide. In addition to online stores, there are many charities that use PayPal to raise money. Find a list of charities you can donate to here. Additionally, you can send funds internationally to anyone almost anywhere in the world with PayPal. All you need is their email address. Sending payments abroad has never been easier.",
    "Do I need a balance in my account to use it?\nYou do not need to have any balance in your account to use PayPal. Similar to a physical wallet, when you are making a purchase, you can choose to pay for your items with any of the credit cards that are attached to your account. There is no need to pre-fund your account."
]
db.add_texts(faqs)

# Get the retriever object from the deep lake db object
retriever = db.as_retriever()

Creating the retrieval tool

Now, we’ll construct the custom tool function that will retrieve the relevant documents from the Deep Lake database:

from langchain.agents import tool

# We define some variables that will be used inside our custom tool
# We're creating a custom tool that looks for relevant documents in our deep lake db
CUSTOM_TOOL_N_DOCS = 3 # number of retrieved docs from deep lake to consider
CUSTOM_TOOL_DOCS_SEPARATOR ="\n\n" # how to join together the retrieved docs to form a single string

# We use the tool decorator to wrap a function that will become our custom tool
# Note that the tool has a single string as input and returns a single string as output
# The name of the function will be the name of our custom tool
# The docstring of the function will be the description of our custom tool
# The description is used by the agent to decide whether to use the tool for a specific query
@tool
def retrieve_n_docs_tool(query: str) -> str:
    """ Searches for relevant documents that may contain the answer to the query."""
    docs = retriever.get_relevant_documents(query)[:CUSTOM_TOOL_N_DOCS]
    texts = [doc.page_content for doc in docs]
    texts_merged = CUSTOM_TOOL_DOCS_SEPARATOR.join(texts)
    return texts_merged

Our function, retrieve_n_docs_tool, is designed with a specific purpose in mind - to search for and retrieve relevant documents based on a given query. It accepts a single string input, which is the user's question or query, and it's designed to return a single string output.

To find the relevant documents, our function makes use of the retriever object's get_relevant_documents method. Given the query, this method searches for and returns a list of the most relevant documents. But we only need some of the documents it finds. We only want the top few. That's where the [: CUSTOM_TOOL_N_DOCS] slice comes in. It allows us to select the top CUSTOM_TOOL_N_DOCS number of documents from the list, where CUSTOM_TOOL_N_DOCS is a predefined constant that tells us how many documents to consider. In this case, that’s 3 documents, as we specified.

Now that we have our top documents, we want to extract the text from each of them. We achieve this using a list comprehension that iterates over each document in our list, docs, and extracts the page_content or text from each document. The result is a list of the top 3 relevant document texts. Next, we want to join these individual texts from a list into a single string using .join(texts) method. Finally, our function returns texts_merged, a single string that comprises the joined texts from the relevant documents. The @tool decorator wraps the function, turning it into a custom tool.

Using the tool with an agent

We can now initialize the agent that uses our custom tool.

# Load a LLM to create an agent using our custom tool
from langchain.llms import OpenAI
# Classes for initializing the agent that will use the custom tool
from langchain.agents import initialize_agent, AgentType

# Let's create an agent that uses our custom tool
# We set verbose=True to check if the agent is using the tool for generating the final answer
llm = OpenAI(model="gpt-3.5-turbo-instruct", temperature=0)
agent = initialize_agent([retrieve_n_docs_tool], llm, agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION, verbose=True)

The initialize_agent function takes in three parameters: a list of the custom tools, the language learning model, and the type of agent. We're using the OpenAI LLM and specifying the agent type as ZERO_SHOT_REACT_DESCRIPTION. With verbose=True we can check if the agent is using the tool when generating the final answer.

Once the agent has been set up, it can be queried:

response = agent.run("Are my info kept private when I shop with Paypal?")
print(response)

You should see something like the following printed output.

> Entering new AgentExecutor chain...
 I need to find out what Paypal does to protect my information
Action: retrieve_n_docs_tool
Action Input: "Paypal privacy policy"
Observation: Is it secure?
PayPal is the safer way to pay because we keep your financial information private. It isn't shared with anyone else when you shop, so you don't have to worry about paying businesses and people you don't know. On top of that, we've got your back. If your eligible purchase doesn't arrive or doesn't match its description, we will refund you the full purchase price plus shipping costs with PayPal's Buyer Protection program.

Why should I use PayPal?
It's Fast! We will help you pay in just a few clicks. Enter your email address and password, and you're pretty much done! It's Simple! There's no need to run around searching for your wallet. Better yet, you don't need to type in your financial details again and again when making a purchase online. We make it simple for you to pay with just your email address and password.

What is PayPal?
PayPal is a digital wallet that follows you wherever you go. Pay any way you want. Link your credit cards to your PayPal Digital wallet, and when you want to pay, simply log in with your username and password and pick which one you want to use.
Thought: I now understand how Paypal keeps my information secure
Final Answer: Yes, your information is kept private when you shop with Paypal. PayPal is a digital wallet that follows you wherever you go and keeps your financial information private. It is not shared with anyone else when you shop, and PayPal also offers Buyer Protection to refund you the full purchase price plus shipping costs if your eligible purchase doesn't arrive or doesn't match its description.

> Finished chain.

… along with this printed response.

Yes, your information is kept private when you shop with Paypal. PayPal is a digital wallet that follows you wherever you go and keeps your financial information private. It is not shared with anyone else when you shop, and PayPal also offers Buyer Protection to refund you the full purchase price plus shipping costs if your eligible purchase doesn't arrive or doesn't match its description.

By reading the agent printed output, we see that the agent decided to use the retrieve_n_docs_tool tool to retrieve relevant documents to the Paypal privacy policy query. The final answer is then generated using the information contained in the retrieved documents.

Conclusion

The experiment showcases the power of AI in information retrieval and comprehension, explicitly using a custom tool to provide accurate and contextual responses to user queries.

This experiment solves the problem of efficient and relevant information retrieval. Instead of manually reading through a large number of documents or frequently asked questions to find the information they need, the user can simply ask a question and get a relevant response. This can greatly enhance user experience, especially for customer support services or any platform that relies on providing accurate information swiftly.

Congratulations on finishing this module on tools! Now you can test your new knowledge with the module quizzes. The next (and last!) module will be about LLM agents.