Regardless of the chosen model or prompt formulation, language models have inherent limitations that cannot be resolved with the techniques we learned. These models have a cut-off date for the training process, which means they typically lack access to trending news and the latest developments. This limitation can result in the models providing responses that may not be factually accurate and potentially hallucinating information.
In this module, we will delve into techniques that enable us to provide accurate context to language models, enhancing their ability to answer questions effectively. Additional context can be sourced from various channels such as databases, URLs, or different file types. Several preprocessing steps are necessary to facilitate this process. These include utilizing splitters to ensure the content's length falls within the model's input window size and converting text into embedding vectors, which aids in identifying contextually similar resources.
Now, let's explore each lesson with a short description to give you a glimpse of what lies ahead.
- Exploring The Role of LangChain's Indexes and Retrievers: To kick off the module, we introduce the Deep Lake database and its seamless integration with the LangChain library. This lesson highlights the benefits of utilizing Deep Lake, including the ability to retrieve pertinent documents for contextual use. Additionally, we delve into the limitations of this approach and present solutions to overcome them.
- Streamlined Data Ingestion: Text, PyPDF, Selenium URL Loaders, and Google Drive Sync: The LangChain library offers a variety of helper classes designed to facilitate data loading and extraction from diverse sources. Regardless of whether the information originates from a PDF file or website content, these classes streamline the process of handling different data formats.
- What are Text Splitters and Why They are Useful: The length of the contents may vary depending on their source. For instance, a PDF file containing a book may exceed the input window size of the model, making it incompatible with direct processing. However, splitting the large text into smaller segments will allow us to use the most relevant chunk as the context instead of expecting the model to comprehend the whole book and answer a question. This lesson will thoroughly explore different approaches that enable us to accomplish this objective.
- Exploring the World of Embeddings: Embeddings are high-dimensional vectors that capture semantic information. Large language models can transform textual data into embedding space, allowing for versatile representations across languages. These embeddings serve as valuable tools to identify relevant information by quantifying the distance between data points, thereby indicating closer semantic meaning for points closer together. The LangChain integration provides necessary functions for both transforming and calculating similarities.
- Build a Customer Support Question Answering Chatbot: This practical example demonstrates the utilization of a website's content as supplementary context for a chatbot to respond to user queries effectively. The code implementation involves employing the mentioned data loaders, storing the corresponding embeddings in the Deep Lake dataset, and ultimately retrieving the most pertinent documents based on the user's question.
- Conversation Intelligence: Gong.io Open-Source Alternative AI Sales Assistant: In this lesson, we will explore how LangChain, Deep Lake, and GPT-4 can be used to develop a sales assistant able to give advice to salesman, taking into considerations internal guidelines.
- FableForge: Creating Picture Books with OpenAI, Replicate, and Deep Lake: In this final lesson, we are going to delve into a use case of AI technology in the creative domain of children's picture book creation in a project called "FableForge", leveraging both OpenAI GPT-3.5 LLM for writing the story and Stable Diffusion for generating images for it.
To summarize, this module will teach you how to enrich language models with additional context to improve the quality of their responses. It can eliminate issues like hallucinations. In the current module, we focus on utilizing external documents and retrieving information from databases. Furthermore, In future modules, we will explore incorporating internet search results to enable the models to answer trending questions.