Streamlined Data Ingestion: Text, PyPDF, Selenium URL Loaders, and Google Drive Sync

Introduction

The TextLoader handles plain text files, while the PyPDFLoader specializes in PDF files, offering easy access to content and metadata. SeleniumURLLoader is designed for loading HTML documents from URLs that require JavaScript rendering. Lastly, the Google Drive Loader provides seamless integration with Google Drive, allowing for the import of data from Google Docs or folders.

Image by Midjourney

TextLoader

Import the LangChain and necessary loaders from langchain.document_loaders. Remember to install the required packages with the following command: pip install langchain==0.0.208 deeplake==3.9.27 openai==0.27.8 tiktoken.

from langchain.document_loaders import TextLoader

loader = TextLoader('file_path.txt')
documents = loader.load()

The sample code.

[Document(page_content='<FILE_CONTENT>', metadata={'source': 'file_path.txt'})]

The output.

You can use the encoding argument to change the encoding type. (For example: encoding="ISO-8859-1")

PyPDFLoader (PDF)

The LangChain library provides two methods for loading and processing PDF files: PyPDFLoader and PDFMinerLoader. We mainly focus on the former, which is used to load PDF files into an array of documents, where each document contains the page content and metadata with the page number. First, install the package using Python Package Manager (PIP).

!pip install -q pypdf

Here's a code snippet to load and split a PDF file using PyPDFLoader:


from langchain.document_loaders import PyPDFLoader

loader = PyPDFLoader("example_data/layout-parser-paper.pdf")
pages = loader.load_and_split()

print(pages[0])

The sample code.

Document(page_content='<PDF_CONTENT>', metadata={'source': '/home/cloudsuperadmin/scrape-chain/langchain/deep_learning_for_nlp.pdf', 'page': 0})

The output.

Using PyPDFLoader offers advantages such as simple, straightforward usage and easy access to page content and metadata, like page numbers, in a structured format. However, it has disadvantages, including limited text extraction capabilities compared to PDFMinerLoader.

SeleniumURLLoader (URL)

The SeleniumURLLoader module offers a robust yet user-friendly approach for loading HTML documents from a list of URLs requiring JavaScript rendering. Here is a guide and example for using this class which starts by installing the package using the Python Package Manager (PIP). The codes has been tested for unstructured and selenium libraries with 0.7.7 and 4.10.0, respectively. However, feel free to install the latest versions.

!pip install -q unstructured selenium

Instantiate the SeleniumURLLoader class by providing a list of URLs to load, for example:


from langchain.document_loaders import SeleniumURLLoader

urls = [
    "https://www.youtube.com/watch?v=TFa539R09EQ&t=139s",
    "https://www.youtube.com/watch?v=6Zv6A_9urh4&t=112s"
]

loader = SeleniumURLLoader(urls=urls)
data = loader.load()

print(data[0])

Document(page_content="OPENASSISTANT TAKES ON CHATGPT!\n\nInfo\n\nShopping\n\nWatch later\n\nShare\n\nCopy link\n\nTap to unmute\n\nIf playback doesn't begin shortly, try restarting your device.\n\nYou're signed out\n\nVideos you watch may be added to the TV's watch history and influence TV recommendations. To avoid this, cancel and sign in to YouTube on your computer.\n\nUp next\n\nLiveUpcoming\n\nPlay Now\n\nMachine Learning Street Talk\n\nSubscribe\n\nSubscribed\n\nSwitch camera\n\nShare\n\nAn error occurred while retrieving sharing information. Please try again later.\n\n2:19\n\n2:19 / 59:51\n\nWatch full video\n\n•\n\nScroll for details\n\nNew!\n\nWatch ads now so you can enjoy fewer interruptions\n\nGot it\n\nAbout\n\nPress\n\nCopyright\n\nContact us\n\nCreators\n\nAdvertise\n\nDevelopers\n\nTerms\n\nPrivacy\n\nPolicy & Safety\n\nHow YouTube works\n\nTest new features\n\nNFL Sunday Ticket\n\n© 2023 Google LLC", metadata={'source': 'https://www.youtube.com/watch?v=TFa539R09EQ&t=139s
'})

The SeleniumURLLoader class includes the following attributes:

URLs (List[str]): List of URLs to load.
continue_on_failure (bool, default=True): Continues loading other URLs on failure if True.
browser (str, default="chrome"): Browser selection, either 'Chrome' or 'Firefox'.
executable_path (Optional[str], default=None): Browser executable path.
headless (bool, default=True): Browser runs in headless mode if True.

Customize these attributes during SeleniumURLLoader instance initialization, such as using Firefox instead of Chrome by setting the browser to "firefox":

loader = SeleniumURLLoader(urls=urls, browser="firefox")

Upon invoking the load() method, a list of Document instances containing the loaded content is returned. Each Document instance includes a page_content attribute with the extracted text from the HTML and a metadata attribute containing the source URL.

Bear in mind that SeleniumURLLoader may be slower than other loaders since it initializes a browser instance for each URL. Nevertheless, it is advantageous for loading pages necessitating JavaScript rendering.

💡

This approach will not work in Google Colab environment without further configuration which is not in the scope of this course. Try running the code directly using the Python interpreter.

Google Drive loader

The LangChain Google Drive Loader efficiently imports data from Google Drive by using the GoogleDriveLoader class. It can fetch data from a list of Google Docs document IDs or a single folder ID.

Prepare necessary credentials and tokens:

By default, the GoogleDriveLoader searches for the credentials.json file in ~/.credentials/credentials.json. Use the credentials_file keyword argument to modify this path.
The token.json file follows the same principle and will be created automatically upon the loader's first use.

To set up the credentials_file, follow these steps:

Create a new Google Cloud Platform project or use an existing one by visiting the Google Cloud Console. Ensure that billing is enabled for your project.
Enable the Google Drive API by navigating to its dashboard in the Google Cloud Console and clicking "Enable."
Create a service account by going to the Service Accounts page in the Google Cloud Console. Follow the prompts to set up a new service account.
Assign necessary roles to the service account, such as "Google Drive API - Drive File Access" and "Google Drive API - Drive Metadata Read/Write Access," depending on your needs.
After creating the service account, access the "Actions" menu next to it, select "Manage keys," click "Add Key," and choose "JSON" as the key type. This generates a JSON key file and downloads it to your computer, which serves as your credentials_file.

Retrieve the folder or document ID from the URL:

Import the GoogleDriveLoader class:

from langchain.document_loaders import GoogleDriveLoader

Instantiate GoogleDriveLoader:

loader = GoogleDriveLoader(
    folder_id="your_folder_id",
    recursive=False  # Optional: Fetch files from subfolders recursively. Defaults to False.
)

Load the documents:

docs = loader.load()

Note that currently, only Google Docs are supported.

Conclusion

In conclusion, the process of streamlined data ingestion has been significantly simplified with the integration of various powerful loaders, including TextLoader, PyPDFLoader, SeleniumURLLoader, and Google Drive Loader. Each of these tools caters to specific file types and data sources, ensuring efficient and comprehensive data management.

In the next lesson, we’ll learn about common ways of splitting texts into smaller chunks, so that they can easily be inserted into prompts with limited tokens size.