What are Text Splitters and Why They are Useful

Introduction

Large Language Models, while recognized for creating human-like text, can also "hallucinate" and produce seemingly plausible yet incorrect or nonsensical information. Interestingly, this tendency can be advantageous in creative tasks, as it generates a range of unique and imaginative ideas, sparking new perspectives and driving the creative process. However, this poses a challenge in situations where accuracy is critical, such as code reviews, insurance-related tasks, or research question responses.

One approach to mitigating hallucination is to provide documents as sources of information to the LLM and ask it to generate an answer based on the knowledge extracted from the document. This can help reduce the likelihood of hallucination, and users can verify the information with the source document.

Let's discuss the pros and cons of this approach:

Pros:

Reduced hallucination: By providing a source document, the LLM is more likely to generate content based on the given information, reducing the chances of creating false or irrelevant information.
Increased accuracy: With a reliable source document, the LLM can generate more accurate answers, especially in use cases where accuracy is crucial.
Verifiable information: Users can cross-check the generated content with the source document to ensure the information is accurate and reliable.

Cons:

Limited scope: Relying on a single document may limit the scope of the generated content, as the LLM will only have access to the information provided in the document.
Dependence on document quality: The accuracy of the generated content heavily depends on the quality and reliability of the source document. The LLM will likely generate incorrect or misleading content if the document contains inaccurate or biased information.
Inability to eliminate hallucination completely: Although providing a document as a base reduces the chances of hallucination, it does not guarantee that the LLM will never generate false or irrelevant information.

Addressing another challenge, LLMs have a maximum prompt size, preventing them from feeding entire documents. This makes it crucial to divide documents into smaller parts, and Text Splitters prove to be extremely useful in achieving this. Text Splitters help break down large text documents into smaller, more digestible pieces that language models can process more effectively.

Using a Text Splitter can also improve vector store search results, as smaller segments might be more likely to match a query. Experimenting with different chunk sizes and overlaps can be beneficial in tailoring results to suit your specific needs.

Customizing Text Splitter

When handling lengthy pieces of text, it's crucial to break them down into manageable chunks. This seemingly simple task can quickly become complex, as keeping semantically related text segments intact is essential. The definition of "semantically related" may vary depending on the type of text. In this article, we'll explore various strategies to achieve this.

At a high level, text splitters follow these steps:

Divide the text into small, semantically meaningful chunks (often sentences).
Combine these small chunks into a larger one until a specific size is reached (determined by a particular function).
Once the desired size is attained, separate that chunk as an individual piece of text, then start forming a new chunk with some overlap to maintain context between segments.

Consequently, there are two primary dimensions to consider when customizing your text splitter:

The method used to split the text
The approach for measuring chunk size

Character Text Splitter

This type of splitter can be used in various scenarios where you must split long text pieces into smaller, semantically meaningful chunks. For example, you might use it to split a long article into smaller chunks for easier processing or analysis. The splitter allows you to customize the chunking process along two axes - chunk size and chunk overlap - to balance the trade-offs between splitting the text into manageable pieces and preserving semantic context between chunks.

Load the documents using the PyPDFLoader class. You need to install the pypdf package using Python Package Manager. (pip install -q pypdf) Remember to install also the required packages with the following command: pip install langchain==0.0.208 deeplake==3.9.27 openai==0.27.8 tiktoken. (You can download a sample PDF file from the following link or use any PDF file that you have)

The One Page Linux Manual.pdf94.3KB

from langchain.document_loaders import PyPDFLoader
loader = PyPDFLoader("The One Page Linux Manual.pdf")
pages = loader.load_and_split()

By loading the text file, we can ask more specific questions related to the subject, which helps minimize the likelihood of LLM hallucinations and ensures more accurate, context-driven responses.

from langchain.text_splitter import CharacterTextSplitter

text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=20)
texts = text_splitter.split_documents(pages)

print(texts[0])

print (f"You have {len(texts)} documents")
print ("Preview:")
print (texts[0].page_content)

The sample code.

page_content='THE ONE     PAGE LINUX MANUALA summary of useful Linux commands\nVersion 3.0 May 1999 squadron@powerup.com.au\nStarting & Stopping\nshutdown -h now Shutdown the system now and do not\nreboot\nhalt Stop all processes - same as above\nshutdown -r 5 Shutdown the system in 5 minutes and\nreboot\nshutdown -r now Shutdown the system now and reboot\nreboot Stop all processes and then reboot - same\nas above\nstartx Start the X system\nAccessing & mounting file systems\nmount -t iso9660 /dev/cdrom\n/mnt/cdromMount the device cdrom\nand call it cdrom under the\n/mnt directory\nmount -t msdos /dev/hdd\n/mnt/ddriveMount hard disk “d” as a\nmsdos ...' metadata={'source': 'The One Page Linux Manual.pdf', 'page': 0}
You have 2 documents

Preview:
THE ONE     PAGE LINUX MANUALA summary of useful Linux commands
Version 3.0 May 1999 squadron@powerup.com.au
Starting & Stopping
shutdown -h now Shutdown the system now and do not
reboot
halt Stop all processes - same as above
shutdown -r 5 Shutdown the system in 5 minutes and
reboot
shutdown -r now Shutdown the system now and reboot
reboot Stop all processes and then reboot - same
as above
startx Start the X system
Accessing & mounting file systems
mount -t iso9660 /dev/cdrom
...

The output.

No universal approach for chunking text will fit all scenarios - what's effective for one case might not be suitable for another. Finding the best chunk size for your project means going through a few steps. First, clean up your data by getting rid of anything that's not needed, like HTML tags from websites. Then, pick a few different chunk sizes to test. The best size will depend on what kind of data you're working with and the model you're using. Finally, test out how well each size works by running some queries and comparing the results. You might need to try a few different sizes before finding the best one. This process might take some time, but getting the best results from your project is worth it.

Recursive Character Text Splitter

The Recursive Character Text Splitter is a text splitter designed to split the text into chunks based on a list of characters provided. It attempts to split text using the characters from a list in order until the resulting chunks are small enough. By default, the list of characters used for splitting is ["\n\n", "\n", " ", "], which tries to keep paragraphs, sentences, and words together as long as possible, as they are generally the most semantically related pieces of text. This means that the class first tries to split the text into two new-line characters. If the resulting chunks are still larger than the desired chunk size, it will then try to split the output by a single new-line character, followed by a space character, and so on, until the desired chunk size is achieved.

To use the RecursiveCharacterTextSplitter, you can create an instance of it and provide the following parameters:

chunk_size : The maximum size of the chunks, as measured by the length_function (default is 100).

chunk_overlap: The maximum overlap between chunks to maintain continuity between them (default is 20).

length_function: parameter is used to calculate the length of the chunks. By default, it is set to len, which counts the number of characters in a chunk. However, you can also pass a token counter or any other function that calculates the length of a chunk based on your specific requirements.

Using a token counter instead of the default len function can benefit specific scenarios, such as when working with language models with token limits. For example, OpenAI's GPT-3 has a token limit of 4096 tokens per request, so you might want to count tokens instead of characters to better manage and optimize your requests.

Here's an example of how to use RecursiveCharacterTextSplitter.

from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

loader = PyPDFLoader("The One Page Linux Manual.pdf")
pages = loader.load_and_split()

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=50,
    chunk_overlap=10,
    length_function=len,
)

docs = text_splitter.split_documents(pages)
for doc in docs:
    print(doc)

page_content='THE ONE     PAGE LINUX MANUALA summary of useful' metadata={'source': 'The One Page Linux Manual.pdf', 'page': 0}
page_content='of useful Linux commands' metadata={'source': 'The One Page Linux Manual.pdf', 'page': 0}
page_content='Version 3.0 May 1999 squadron@powerup.com.au' metadata={'source': 'The One Page Linux Manual.pdf', 'page': 0}
page_content='Starting & Stopping' metadata={'source': 'The One Page Linux Manual.pdf', 'page': 0}
...
page_content='- includes' metadata={'source': 'The One Page Linux Manual.pdf', 'page': 1}
page_content='handy command summary. Visit:' metadata={'source': 'The One Page Linux Manual.pdf', 'page': 1}
page_content='www.powerup.com.au/~squadron' metadata={'source': 'The One Page Linux Manual.pdf', 'page': 1}

The output.

We created an instance of the RecursiveCharacterTextSplitter class with the desired parameters. The default list of characters to split by is ["\n\n", "\n", " ", ""].

The text is first split by two new-line characters (\n\n). Then, since the chunks are still larger than the desired chunk size (50), the class tries to split the output by a single new-line character (\n).

In this example, the text is loaded from a file, and the RecursiveCharacterTextSplitter is used to split it into chunks with a maximum size of 50 characters and an overlap of 10 characters. The output will be a list of documents containing the split text.

To use a token counter, you can create a custom function that calculates the number of tokens in a given text and pass it as the length_function parameter. This will ensure that your text splitter calculates the length of chunks based on the number of tokens instead of the number of characters. The exploration of this concept will be part of our upcoming lessons.

NLTK Text Splitter

The NLTKTextSplitter in LangChain is an implementation of a text splitter that uses the Natural Language Toolkit (NLTK) library to split text based on tokenizers. The goal is to split long texts into smaller chunks without breaking the structure of sentences and paragraphs.

💡

If it is your first time using this package, it is required to install the NLTK library using pip install -q nltk and run the following Python code to download the packages that LangChain needs. import nltk; nltk.download(’punkt’);

from langchain.text_splitter import NLTKTextSplitter

# Load a long document
with open('/home/cloudsuperadmin/scrape-chain/langchain/LLM.txt', encoding= 'unicode_escape') as f:
    sample_text = f.read()

text_splitter = NLTKTextSplitter(chunk_size=500)
texts = text_splitter.split_text(sample_text)
print(texts)

['Building LLM applications for production\nApr 11, 2023 \x95 Chip Huyen text \n\nA question that I\x92ve has been asked a lot recently is how large language models (LLMs) will change machine learning workflows.\n\nAfter working with several companies who are working with LLM applications and personally going down a rabbit hole building my applications, I realized two things:\n\nIt\x92s easy to make something cool with LLMs, but very hard to make something production-ready with them.', 'LLM limitations are exacerbated by a lack of engineering rigor in prompt engineering, partially due to the ambiguous nature of natural languages, and partially due to the nascent nature of the field.\n\nThis post consists of three parts .\n\nPart 1 discusses the key challenges of productionizing LLM applications and the solutions that I\x92ve seen.\n\nPart 2[…]

However, as mentioned in your context, the NLTKTextSplitter is not specifically designed to handle word segmentation in English sentences without spaces. For this purpose, you can use alternative libraries like pyenchant or word segment.

SpacyTextSplitter

The SpacyTextSplitter helps split large text documents into smaller chunks based on a specified size. This is useful for better management of large text inputs. It's important to note that the SpacyTextSplitter is an alternative to NLTK-based sentence splitting. You can create a SpacyTextSplitter object by specifying the chunk_size parameter, measured by a length function passed to it, which defaults to the number of characters.

from langchain.text_splitter import SpacyTextSplitter

# Load a long document
with open('/home/cloudsuperadmin/scrape-chain/langchain/LLM.txt', encoding= 'unicode_escape') as f:
    sample_text = f.read()

# Instantiate the SpacyTextSplitter with the desired chunk size
text_splitter = SpacyTextSplitter(chunk_size=500, chunk_overlap=20)

# Split the text using SpacyTextSplitter
texts = text_splitter.split_text(sample_text)

# Print the first chunk
print(texts[0])

The sample code.

Building LLM applications for production
Apr 11, 2023  Chip Huyen text

A question that I've been asked a lot recently is how large language models (LLMs) will change machine learning workflows.

After working with several companies who are working with LLM applications and personally going down a rabbit hole building my applications, I realized two things:

Its easy to make something cool with LLMs, but very hard to make something production-ready with them.

The output.

MarkdownTextSplitter

The MarkdownTextSplitter is designed to split text written using Markdown languages like headers, code blocks, or dividers. It is implemented as a simple subclass of RecursiveCharacterSplitter with Markdown-specific separators. By default, these separators are determined by the Markdown syntax, but they can be customized by providing a list of characters during the initialization of the MarkdownTextSplitter instance. The chunk size, which is initially set to the number of characters, is measured by the length function passed in. To customize the chunk size, provide an integer value when initializing an instance.

from langchain.text_splitter import MarkdownTextSplitter

markdown_text = """
# 

# Welcome to My Blog!

## Introduction
Hello everyone! My name is **John Doe** and I am a _software developer_. I specialize in Python, Java, and JavaScript.

Here's a list of my favorite programming languages:

1. Python
2. JavaScript
3. Java

You can check out some of my projects on [GitHub](https://github.com).

## About this Blog
In this blog, I will share my journey as a software developer. I'll post tutorials, my thoughts on the latest technology trends, and occasional book reviews.

Here's a small piece of Python code to say hello:

\``` python
def say_hello(name):
    print(f"Hello, {name}!")

say_hello("John")
\```

Stay tuned for more updates!

## Contact Me
Feel free to reach out to me on [Twitter](https://twitter.com) or send me an email at johndoe@email.com.

"""

markdown_splitter = MarkdownTextSplitter(chunk_size=100, chunk_overlap=0)
docs = markdown_splitter.create_documents([markdown_text])

print(docs)

The sample code.

[Document(page_content='# \n\n# Welcome to My Blog!', metadata={}), Document(page_content='Introduction', metadata={}), 
Document(page_content='Hello everyone! My name is **John Doe** and I am a _software developer_. I specialize in Python,', metadata={}), 
Document(page_content='Java, and JavaScript.', metadata={}), Document(page_content="Here's a list of my favorite programming languages:\n\n1. Python\n2. JavaScript\n3. Java", metadata={}), 
Document(page_content='You can check out some of my projects on [GitHub](https://github.com).', metadata={}), 
Document(page_content='About this Blog', metadata={}), 
Document(page_content="In this blog, I will share my journey as a software developer. I'll post tutorials, my thoughts on", metadata={}), 
Document(page_content='the latest technology trends, and occasional book reviews.', metadata={}), 
Document(page_content="Here's a small piece of Python code to say hello:", metadata={}), Document(page_content='\\```python\ndef say_hello(name):\n    print(f"Hello, {name}!")\n\nsay_hello("John")\n\\', metadata={}), 
Document(page_content='Stay tuned for more updates!', metadata={}), Document(page_content='Contact Me', metadata={}), 
Document(page_content='Feel free to reach out to me on [Twitter](https://twitter.com) or send me an email at', metadata={}), 
Document(page_content='johndoe@email.com.', metadata={})]

The output.

The MarkdownTextSplitter offers a practical solution for dividing text while preserving the structure and meaning provided by Markdown formatting. By recognizing the Markdown syntax (e.g., headings, lists, and code blocks), you can intelligently divide the content based on its structure and hierarchy, resulting in more semantically coherent chunks. This splitter is especially valuable when managing extensive Markdown documents.

TokenTextSplitter

The main advantage of using TokenTextSplitter over other text splitters, like CharacterTextSplitter, is that it respects the token boundaries, ensuring that the chunks do not split tokens in the middle. This can be particularly helpful in maintaining the semantic integrity of the text when working with language models and embeddings.

This type of splitter breaks down raw text strings into smaller pieces by initially converting the text into BPE (Byte Pair Encoding) tokens, and subsequently dividing these tokens into chunks. It then reassembles the tokens within each chunk back into text. The tiktoken python package is required for using this class. (pip install -q tiktoken)

from langchain.text_splitter import TokenTextSplitter

# Load a long document
with open('/home/cloudsuperadmin/scrape-chain/langchain/LLM.txt', encoding= 'unicode_escape') as f:
    sample_text = f.read()

# Initialize the TokenTextSplitter with desired chunk size and overlap
text_splitter = TokenTextSplitter(chunk_size=100, chunk_overlap=50)

# Split into smaller chunks
texts = text_splitter.split_text(sample_text)
print(texts[0])

The sample code.

Building LLM applications for production
Apr 11, 2023  Chip Huyen text

A question that I've been asked a lot recently is how large language models (LLMs) will change machine learning workflows. After working with several companies who are working with LLM applications and personally going down a rabbit hole building my applications, I realized two things:

It’s easy to make something cool with LLMs, but very hard to make something with production.

The output.

The chunk_size parameter sets the maximum number of BPE tokens in each chunk, while chunk_overlap defines the number of overlapping tokens between adjacent chunks. By modifying these parameters, you can fine-tune the granularity of the text chunks.

One potential drawback of using TokenTextSplitter is that it may require additional computation when converting text to BPE tokens and back. If you need a faster and simpler text-splitting method, you might consider using CharacterTextSplitter, which directly splits the text based on character count, offering a more straightforward approach to text segmentation.

RECAP:

Text splitters are essential for managing long text, improving language model processing efficiency, and enhancing vector store search results. Customizing text splitters involves selecting the splitting method and measuring chunk size.

CharacterTextSplitter is an example that helps balance manageable pieces and semantic context preservation. Experimenting with different chunk sizes and overlaps tailor the results for specific use cases.

RecursiveCharacterTextSplitter focuses on preserving semantic relationships while offering customizable chunk sizes and overlaps.

NLTKTextSplitter utilizes the Natural Language Toolkit library for more accurate text segmentation. SpacyTextSplitter leverages the popular SpaCy library to split texts based on linguistic features. MarkdownTextSplitter is tailored for Markdown-formatted texts, ensuring content is split meaningfully according to the syntax. Lastly, TokenTextSplitter employs BPE tokens for splitting, offering a fine-grained approach to text segmentation.

Conclusion

Selecting the appropriate text splitter depends on the specific requirements and nature of the text you are working with, ensuring optimal results for your text processing tasks.

In the next lesson, we’ll learn more about how word embeddings work and how embedding models are used with indexers in LangChain.

RESOURCES:

Split by character | 🦜️🔗 Langchain

This is the simplest method. This splits based on characters (by default "\n\n") and measure chunk length by number of characters.

python.langchain.com

Split code | 🦜️🔗 Langchain

CodeTextSplitter allows you to split your code with multiple language support. Import enum Language and specify the language.

python.langchain.com

Recursively split by character | 🦜️🔗 Langchain

This text splitter is the recommended one for generic text. It is parameterized by a list of characters. It tries to split on them in order until the chunks are small enough. The default list is ["\n\n", "\n", " ", ""]. This has the effect of trying to keep all paragraphs (and then sentences, and then words) together as long as possible, as those would generically seem to be the strongest semantically related pieces of text.

python.langchain.com

You can find the code of this lesson in this online Notebook.