In this lesson, we will explore the application of the Retrieval-augmented Generation (RAG) method in processing a company's financial information contained within a PDF document. The process includes extracting critical data from a PDF file (like text, tables, graphs, etc.) and saving them in a vector store database such as Deep Lake for quick and efficient retrieval. Next, a RAG-enabled bot can access stored information to respond to end-user queries.

This task requires diverse tools, including Unstructured.io for text/table extraction, OpenAI's GPT-4V for extracting information from graphs, and LlamaIndex for developing a bot with retrieval capabilities. As previously mentioned, data preprocessing plays a significant role in the RAG process. So, we start by pulling data from a PDF document. The content of this lesson focuses on demonstrating how to extract data from a single PDF document for ease of understanding. Nevertheless, the accompanying notebook provided after the lesson will analyze three separate reports, offering a broader scope of information. Before starting this guide, make sure you install all the requirements in the requirements section.

Extracting Data

Extracting textual data is relatively straightforward, but processing graphical elements such as line or bar charts can be more challenging. The latest OpenAI model equipped with vision processing, GPT-4V, is valuable for visual elements. We can feed the slides to the model and ask it to describe it in detail, which then will be used to complement the textual information. This lesson uses Tesla's Q3 financial report as the source document. It is possible to download the document using the wget command.

!wget https://digitalassets.tesla.com/tesla-contents/image/upload/IR/TSLA-Q3-2023-Update-3.pdf

The sample code.

💡

The preprocessing tasks outlined in the next section might be time-consuming and necessitate API calls to OpenAI endpoints, which come with associated costs. To mitigate this, we have made the preprocessed dataset and the checkpoints of the output of each section available at the end of this lesson, allowing you to utilize them with the provided notebook.

1. Text/Tables

The unstructured package is an effective tool for extracting information from PDF files. It requires two tools, popplerand tesseract, that help render PDF documents. We suggest setting up these packages on Google Colab, freely available for students to execute and experiment with code. We will briefly mention the installation of these packages on other operating systems. Let's install the utilities and their dependencies using the following commands.

If you are running the code with Google Colab, restart the session after installing the packages.

!apt-get -qq install poppler-utils
!apt-get -qq install tesseract-ocr
!pip install tensorflow-probability imageio pillow

!pip install unstructured[all-docs] fastapi kaleido uvicorn typing-extensions pydantic

The commands to install required packages.

💡

These packages are easy to install on Linux and Mac operating systems using apt-get and brew. However, they are more complex to install on Windows OS. You can follow the below instructions for a step-by-step guide if you use Windows. [Installing Poppler on Windows] [Installing Tesseract on Windows]

The process is simple after installing all the necessary packages and dependencies. We simply use the partition_pdf function, which extracts text and table data from the PDF and divides it into multiple chunks. We can customize the size of these chunks based on the number of characters.

from unstructured.partition.pdf import partition_pdf

raw_pdf_elements = partition_pdf(
    filename="./TSLA-Q3-2023-Update-3.pdf",
    # Use layout model (YOLOX) to get bounding boxes (for tables) and find titles
    # Titles are any sub-section of the document
    infer_table_structure=True,
    # Post processing to aggregate text once we have the title
    chunking_strategy="by_title",
    # Chunking params to aggregate text blocks
    # Attempt to create a new chunk 3800 chars
    # Attempt to keep chunks > 2000 chars
    # Hard max on chunks
    max_characters=4000,
    new_after_n_chars=3800,
    combine_text_under_n_chars=2000
)

The sample code.

The previous code identifies and extracts various elements from the PDF, which can be classified into CompositeElements (the textual content) and Tables. We use the Pydantic package to create a new data structure that stores information about each element, including their type and text. The code below iterates through all extracted elements, keeping them in a list where each item is an instance of the Element type.

from pydantic import BaseModel
from typing import Any

# Define data structure
class Element(BaseModel):
    type: str
    text: Any

# Categorize by type
categorized_elements = []
for element in raw_pdf_elements:
    if "unstructured.documents.elements.Table" in str(type(element)):
        categorized_elements.append(Element(type="table", text=str(element)))
    elif "unstructured.documents.elements.CompositeElement" in str(type(element)):
        categorized_elements.append(Element(type="text", text=str(element)))

The sample code.

Creating the Element data structure enables convenient storage of the additional information, which can be beneficial for identifying the source of each answer, whether it is derived from texts, tables, or figures.

2. Graphs

The next step is gathering information from the charts to add context. The primary challenge is extracting images from the pages to feed into OpenAI's endpoint. A practical approach is to convert the PDF to images and pass each page to the model, inquiring if it detects any graphs. If it identifies one or more charts, the model can describe the data and the trends they represent. If no graphs are detected, the model will return an empty array as an indication.

💡

A drawback of this approach is that it increases the number of requests to the model, consequently leading to higher costs. The issue is that each page must be processed, regardless of whether it contains graphs, which is not an efficient approach. It is possible to reduce the cost by manually flagging the pages.

The initial step involves installing the pdf2image package to convert the PDF into images. This also requires the poppler tool, which we have already installed.

!pip install -q pdf2image

The commands to install required packages.

The code below uses the convert_from_path function, which takes the path of a PDF file. We can iterate over each page and save it as a PNG file using the .save() method. These images will be saved in the ./pages directory. Additionally, we define the pages_png variable that holds the path of each image.

import os
from pdf2image import convert_from_path

os.mkdir("./pages")
convertor = convert_from_path('./TSLA-Q3-2023-Update-3.pdf')

for idx, image in enumerate( convertor ):
    image.save(f"./pages/page-{idx}.png")

pages_png = [file for file in os.listdir("./pages") if file.endswith('.png')]

The sample code.

Defining a few helper functions and variables is necessary before sending the image files to the OpenAI API. The headers variable will contain the OpenAI API Key, enabling the server to authenticate our requests. The payload carries configurations such as the model name, the maximum token limit, and the prompts. It instructs the model to describe the graphs and generate responses in JSON format, addressing scenarios like encountering multiple graphs on a single page or finding no graphs at all. We will add the images to the payload before sending the requests. Finally, there is the encode_image() function, which encodes the images in base64 format, allowing them to be processed by OpenAI.

Before using the libraries, it's essential to configure the OPENAI_API_KEY and ACTIVELOOP_TOKEN variables in the environment. Remember to substitute the placeholder values with your actual keys from the respective platforms.

import os
import getpass
os.environ["ACTIVELOOP_TOKEN"] = getpass.getpass("Enter your ActiveLoop Token: ")
os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter your OpenAI API Key: ")

headers = {
  "Content-Type": "application/json",
  "Authorization": "Bearer " + str( os.environ["OPENAI_API_KEY"] )
}

payload = {
  "model": "gpt-4-vision-preview",
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "You are an assistant that find charts, graphs, or diagrams from an image and summarize their information. There could be multiple diagrams in one image, so explain each one of them separately. ignore tables."
        },
        {
          "type": "text",
          "text": 'The response must be a JSON in following format {"graphs": [<chart_1>, <chart_2>, <chart_3>]} where <chart_1>, <chart_2>, and <chart_3> placeholders that describe each graph found in the image. Do not append or add anything other than the JSON format response.'
        },
        {
          "type": "text",
          "text": 'If could not find a graph in the image, return an empty list JSON as follows: {"graphs": []}. Do not append or add anything other than the JSON format response. Dont use coding "```" marks or the word json.'
        },
        {
          "type": "text",
          "text": "Look at the attached image and describe all the graphs inside it in JSON format. ignore tables and be concise."
        }
      ]
    }
  ],
  "max_tokens": 1000
}

# Function to encode the image to base64 format
def encode_image(image_path):
  with open(image_path, "rb") as image_file:
    return base64.b64encode(image_file.read()).decode('utf-8')

The sample code.

The remaining steps include: 1) utilizing the pages_png variable to loop through the images, 2) encoding the image into base64 format, 3) adding the image into the payload, and finally, 4) sending the request to OpenAI and handling its responses. We will use the same Element data structure to store each image's type (graph) and the text (descriptions of the graphs).

import requests
import tqdm
import copy
import json
graphs_description = []
for idx, page in tqdm.tqdm( enumerate( pages_png ) ):
  # Getting the base64 string
  base64_image = encode_image(f"./pages/{page}")

  # Adjust Payload
  tmp_payload = copy.deepcopy(payload)
  tmp_payload['messages'][0]['content'].append({
    "type": "image_url",
    "image_url": {
      "url": f"data:image/png;base64,{base64_image}"
    }
  })

  try:
    response = requests.post("https://api.openai.com/v1/chat/completions", headers=headers, json=tmp_payload)
    response = response.json()
    graph_data = json.loads( response['choices'][0]['message']['content'] )['graphs']

    desc = [f"{page}\n" + '\n'.join(f"{key}: {item[key]}" for key in item.keys()) for item in graph_data]

    graphs_description.extend( desc )

  except:
    # Skip the page if there is an error.
    print("skipping... error in decoding.")
    continue

graphs_description = [Element(type="graph", text=str(item)) for item in graphs_description]

The sample code.

Store on Deep Lake

This section will utilize the Deep Lake vector database to store the collected information and their embeddings. These embedding vectors convert pieces of text into numerical representations that capture their meaning, enabling similarity metrics such as cosine similarity to identify documents with close relationships. For instance, a prompt inquiring about a company's total revenue would result in high cosine similarity with a database document stating the revenue amount as X dollars.

The data preparation is complete with the extraction of all crucial information from the PDF. The next step involves combining the output from the previous sections, resulting in a list containing 41 entries.

all_docs = categorized_elements + graphs_description

print( len( all_docs ) )

The sample code.

The output.

Given that we are using LlamaIndex, we can use its integration with Deep Lake to create and store the dataset. Begin by installing LlamaIndex and deeplake packages along with their dependencies.

!pip install -q llama_index deeplake==3.9.27 cohere
%pip install llama-index-vector-stores-deeplake

The commands to install required packages.

The integration of LlamaIndex enables the use of DeepLakeVectorStore class, which is designed to create a new dataset. Simply enter your organization ID, which by default is your Activeloop username, in the code provided below. This code will generate an empty dataset, ready to store documents.

from llama_index.vector_stores import DeepLakeVectorStore

# TODO: use your organization id here. (by default, org id is your username)

my_activeloop_org_id = "<YOUR-ACTIVELOOP-ORG-ID>"
my_activeloop_dataset_name = "tsla_q3"
dataset_path = f"hub://{my_activeloop_org_id}/{my_activeloop_dataset_name}"

vector_store = DeepLakeVectorStore( dataset_path=dataset_path,
									runtime={"tensor_db": True},
									overwrite=False)

The sample code.

Your Deep Lake dataset has been successfully created!

The output.

Next, we must pass the created vector store to a StorageContext class. This class serves as a wrapper to create storage from various data types. In our case, we're generating the storage from a vector database, which is accomplished simply by passing the created database instance using the .from_defaults() method.

from llama_index.core.storage.storage_context import StorageContext

storage_context = StorageContext.from_defaults(vector_store=vector_store)

The sample code.

To store our preprocessed data, we must transform them into LlamaIndex Documents for compatibility with the library. The LlamaIndex Document is an abstract class that acts as a wrapper for various data types, including text files, PDFs, and database outputs. This wrapper facilitates the storage of valuable information with each sample. In our case, we can include a metadata tag to hold extra details like the data type (text, table, or graph) or denote document relationships. This approach simplifies the retrieval of these details later.

As shown in the code below, you can employ built-in classes like SimpleDirectoryReader to automatically read files from a specified path or proceed manually. It will loop through our list containing all the extracted information and assign text and a category to each document.

from llama_index.core import Document

documents = [Document(text=t.text, metadata={"category": t.type},) for t in categorized_elements]

The sample code.

Lastly, we can utilize the VectorStoreIndex class to generate embeddings for the documents and employ the database instance to store these values. By default, it uses OpenAI's Ada model to create the embeddings.

from llama_index.core import VectorStoreIndex

index = VectorStoreIndex.from_documents(
    documents, storage_context=storage_context
)

The sample code.

Uploading data to deeplake dataset.
100%|██████████| 29/29 [00:00<00:00, 46.26it/s]
\Dataset(path='hub://alafalaki/tsla_q3-nograph', tensors=['text', 'metadata', 'embedding', 'id'])

  tensor      htype      shape      dtype  compression
  -------    -------    -------    -------  ------- 
   text       text      (29, 1)      str     None   
 metadata     json      (29, 1)      str     None   
 embedding  embedding  (29, 1536)  float32   None   
    id        text      (29, 1)      str     None

The output.

💡

The dataset has already been created and is hosted under the GenAI360 organization on the Activeloop hub. If you prefer not to use OpenAI APIs for generating embeddings, you can test the remaining codes using these publicly accessible datasets. Just substitute the dataset_path variable with the following: hub://genai360/tsla_q3.

Get Deep Memory Access

As Step 0, please note that Deep Memory is a premium feature in Activeloop paid plans. As a reminder, you can redeem a free trial. As a part of the course, all course takers can redeem a free extended trial of one month for the Activeloop Growth plan by redeeming GENAI360 promo code at checkout. To redeem the plan, please create a Deep Lake Account, and on the following screen on account creation, please watch the following video.

Activate Deep Memory

The Deep Memory feature from Activeloop enhances the retriever's accuracy. This improvement allows the model to access higher-quality data, leading to more detailed and informative responses. In earlier lessons, we already covered the basics of Deep Memory, so we will not dive into more details. The process begins by fetching chunks of data from the cloud and using GPT-3.5 to create specific questions for each chunk. These generated questions are then utilized in the Deep Memory training procedure to enhance the embedding quality. In our experience, this approach led to a 25% enhancement in performance.

💡

Activeloop recommends using a dataset containing a minimum of 100 chunks, ensuring sufficient context for the model to enhance the embedding space effectively. So, the codes in this section are based on three PDF documents. For the complete code and execution steps to process three documents instead of one, please refer to the accompanying notebook. The processed dataset is available in the cloud on the GenAI360 organization. You can access using the following key: hub://genai360/tesla_quarterly_2023.

The initial phase involves loading the pre-existing dataset and reading the text of each chunk along with its corresponding ID.

from llama_index.vector_stores.deeplake import DeepLakeVectorStore

dataset_path = f"hub://{my_activeloop_org_id}/{my_activeloop_dataset_name}"

db = DeepLakeVectorStore(
    dataset_path=dataset_path,
    runtime={"tensor_db": True},
    read_only=True
)

# fetch dataset docs and ids if they exist (optional you can also ingest)
docs = db._vectorstore.dataset.text.data(fetch_chunks=True, aslist=True)['value']
ids = db._vectorstore.dataset.id.data(fetch_chunks=True, aslist=True)['value']
print(len(docs))

The sample code.

Deep Lake Dataset in hub://genai360/tesla_quarterly_2023 already exists, loading from the storage

127

The output.

The following code segment outlines a function designed to use GPT-3.5 for generating questions corresponding to each data chunk. This involves crafting a specialized tool tailored for the OpenAI API. Primarily, the code configures suitable prompts for API requests to produce the questions and compiles them with their associated chunk IDs into a list.

import json
import random
from tqdm import tqdm
from openai import OpenAI

client = OpenAI()
# Set the function JSON Schema for openai function calling feature
tools = [
    {
        "type": "function",
        "function": {
            "name": "create_question_from_text",
            "parameters": {
                "type": "object",
                "properties": {
                    "question": {
                        "type": "string",
                        "description": "Question created from the given text",
                    },
                },
                "required": ["question"],
            },
            "description": "Create question from a given text.",
        },
    }
]

def generate_question(tools, text):
    try:
        response = client.chat.completions.create(
            model="gpt-4o-mini",
            tools=tools,
            tool_choice={
                "type": "function",
                "function": {"name": "create_question_from_text"},
            },
            messages=[
                {"role": "system", "content": "You are a world class expert for generating questions based on provided context. You make sure the question can be answered by the text."},
                {
                    "role": "user",
                    "content": text,
                },
            ],
        )

        json_response = response.choices[0].message.tool_calls[0].function.arguments
        parsed_response = json.loads(json_response)
        question_string = parsed_response["question"]
        return question_string
    except:
        question_string = "No question generated"
        return question_string

def generate_queries(docs: list[str], ids: list[str], n: int):

    questions = []
    relevances = []
    pbar = tqdm(total=n)
    while len(questions) < n:
        # 1. randomly draw a piece of text and relevance id
        r = random.randint(0, len(docs)-1)
        text, label = docs[r], ids[r]

        # 2. generate queries and assign and relevance id
        generated_qs = [generate_question(tools, text)]
        if generated_qs == ["No question generated"]:
            continue

        questions.extend(generated_qs)
        relevances.extend([[(label, 1)] for _ in generated_qs])
        pbar.update(len(generated_qs))

    return questions[:n], relevances[:n]

questions, relevances = generate_queries(docs, ids, n=70)

The sample code.

100%|██████████| 20/20 [00:19<00:00,  1.02it/s]

The output.

Now, we can use the questions and the reference ids to activate the Deep Memory using the .deep_memory.train() method to improve the embedding representations. You can see the status of the training process using the .info method.

%pip install -qU langchain-openai

from langchain_openai import OpenAIEmbeddings
embeddings = OpenAIEmbeddings()

job_id = db._vectorstore.deep_memory.train(
    queries=questions,
    relevance=relevances,
    embedding_function=embeddings.embed_documents,
)

The sample code.

Starting DeepMemory training job
Your Deep Lake dataset has been successfully created!
Preparing training data for deepmemory:
Creating 20 embeddings in 1 batches of size 20:: 100%|██████████| 1/1 [00:03<00:00,  3.23s/it]
DeepMemory training job started. Job ID: 6581e3056a1162b64061a9a4

# During training you can check the status of the training run
db._vectorstore.deep_memory.status(job_id="6581e3056a1162b64061a9a4")

--------------------------------------------------------------
|                  6581e3056a1162b64061a9a4                  |
--------------------------------------------------------------
| status                     | completed                     |
--------------------------------------------------------------
| progress                   | eta: 0.2 seconds              |
|                            | recall@10: 100.00% (+7.14%)   |
--------------------------------------------------------------
| results                    | recall@10: 100.00% (+7.14%)   |
--------------------------------------------------------------

The output.

The dataset is now prepared and compatible with the Deep Memory feature. It's crucial to note that the Deep Memory option must be actively set to true when using the dataset for inference.

Chatbot In Action

In this section, we will use the created dataset as the retrieval object, providing the necessary context for the GPT-3.5-turbo model (the default choice for LlamaIndex) to answer the questions. Keep in mind that the inference outcomes presented in the subsequent section are derived from processing three PDF files, which are consistent with the sample codes provided in the notebook. To access the processed dataset containing all the PDF documents, use hub://genai360/tesla_quarterly_2023 as the dataset path in the code below.

The DeepLakeVectorStore class also handles loading a dataset from the hub. The key distinction in the code below, compared to the previous sections, lies in the use of the .from_vector_store() method. This method creates indexes directly from the database rather than variables.

from llama_index.core.storage.storage_context import StorageContext
from llama_index.core import VectorStoreIndex

vector_store = DeepLakeVectorStore(dataset_path=dataset_path, overwrite=False)
storage_context = StorageContext.from_defaults(vector_store=vector_store)

index = VectorStoreIndex.from_vector_store(
    vector_store, storage_context=storage_context
)

The sample code.

We can now use the .as_query_engine() method of the index variables to establish a query engine. This will allow us to ask questions from various data sources. Notice the vector_store_kwargs argument, which activates the deep_memory feature by setting it to True. This step is essential for enabling the feature on the retriever. The .query() method takes a prompt and searches for the most relevant data points within the database to construct an answer.

query_engine = index.as_query_engine(vector_store_kwargs={"deep_memory": True})
response = query_engine.query(
    "What are the trends in vehicle deliveries?",
)
response

The sample code.

The trends in vehicle deliveries on the Quarter 3 report show an increasing trend over the quarters.

The output.

Screenshot referenced graph.

As observed, the chatbot effectively utilized the data from the descriptions of the graphs we generated in the report. On the right, there's a screenshot of the bar chart which the chatbot referenced to generate its response.

Additionally, we conducted an experiment where we compiled the same dataset but excluded the graph descriptions. This dataset can be accessed via hub://genai360/tesla_quarterly_2023-nograph path. The purpose was to determine whether including the descriptions aids the chatbot's performance.

In quarter 3, there was a decrease in Model S/X deliveries compared to the previous quarter, with a 14% decline. However, there was an increase in Model 3/Y deliveries, with a 29% growth. Overall, total deliveries in quarter 3 increased by 27% compared to the previous quarter.

The output of the chatbot without the graph data.

You'll observe that the chatbot points to incorrect text segments. Despite the answer being contextually similar, it doesn't provide the correct answer. The graph shows an upward trend, a detail that might not have been mentioned in the report's text.

Conclusion

In this lesson, we explored the steps of developing a chatbot capable of utilizing PDF files as a knowledge base to answer questions. Additionally, we employed the vision capability of GPT-4V to identify and describe graphs from each page. Describing the charts and their illustrated trends improves the chatbot's accuracy in answering and providing additional context.

>> Notebook.

>> Preprocessed Text/Label:

categorized_elements.pkl104.6KB

>> Preprocessed Graphs:

graphs_description.pkl16.1KB

Multimodal Financial Document Analysis and Recall; Tesla Investor Presentations

Extracting Data

1. Text/Tables

2. Graphs

Store on Deep Lake

Get Deep Memory Access

Activate Deep Memory

Chatbot In Action

Conclusion