Mastering Text Splitting for Optimal Language Model Performance

In the fast-evolving field of natural language processing, efficient text splitting is crucial to maximizing language model performance. Here’s an in-depth look at the different levels of text splitting and key techniques, to see how they contribute to superior language model performance.

ARTIFICIAL INTELIGENCE

5/26/20244 min read

In the fast-evolving field of natural language processing, efficient text splitting is crucial to maximizing language model performance. At our software factory, we excel in employing sophisticated text splitting techniques to enhance retrieval processes and optimize application outcomes. Here’s an in-depth look at the five levels of text splitting and how they contribute to superior language model performance. 

The Five Levels of Text Splitting 

1. Character Splitting 

Character splitting is the most basic level, where text is divided into individual characters. This method is straightforward but often inefficient for understanding and retrieval processes. It’s primarily useful for very fine-grained tasks that require detailed text analysis at the character level. 

  

2. Recursive Character Splitting 

Recursive character splitting involves creating chunks by counting characters and using overlap for better organization. This method analyzes text structure recursively to determine appropriate chunk sizes, leveraging natural text structures like paragraphs to group related information together, optimizing semantic coherence. 

  

3. Document-Specific Splitting 

Document-specific techniques utilize unique separators such as headers in markdown or code formatting in Python, enhancing chunking accuracy and relevance. This method adapts to different document types, improving data processing and preparation for language model tasks. By tailoring the splitting process to the document's format, it ensures that each chunk is contextually relevant and specific to the document type. For example, in markdown documents, headers can serve as natural breakpoints, while in code documents, logical segments like functions or classes can be used. 

  

4. Semantic Splitting 

Semantic chunking involves advanced techniques like hierarchical clustering with positional reward and finding break points between sequential sentences using embeddings. This method focuses on understanding the context and relationships between sentences, ensuring more effective chunking based on the actual content and meaning of the text. Hierarchical clustering groups sentences that are semantically related, creating coherent chunks that maintain the narrative flow. Embeddings comparison involves analyzing the semantic similarity between sentences to identify natural breakpoints, ensuring that chunks are meaningful and contextually appropriate. 

LlamaIndex provides an out of the box class to implement SemanticSplitter and we don't need to do it from scratch. Here is the documentation reference: https://docs.llamaindex.ai/en/stable/api_reference/node_parsers/semantic_splitter/?h=semantic#llama_index.core.node_parser.SemanticSplitterNodeParser  

  

5. Agentic Chunking 

Agentic chunking uses a language model to make decisions on chunking text into manageable parts. It extracts propositions from sentences, enhancing readability and processing. This method develops a personalized 'agentic chunker' system, demonstrating a practical application of the concept to efficiently chunk text and improve overall text processing. 

 The idea is to emulate how a human would chunk a document, follow these steps: 

1. Get a scratch piece of paper or notepad. 

2. Start at the top of the document, assuming the first part will be a chunk. 

3. Continuously evaluate each new sentence or piece to determine if it should be part of the current chunk. If not, start a new chunk. 

4. Repeat this process until the end of the document. 

The agentic chunker dynamically adjusts text chunks based on the context and relevance, improving comprehension and organization. This approach involves using advanced language models to understand the text deeply, making intelligent decisions about where to split the text to maximize clarity and coherence. 

  

Enhancing Model Performance with Text Splitting 

Advanced techniques to enhance Semantic Chunking 

Hierarchical Clustering  Groups semantically related sentences using positional rewards. This method creates chunks that maintain the semantic coherence of the text, ensuring that related ideas are grouped together. 

Experimental Chunking 

This technique involves measuring the distance between each consecutive chunk, for example sentences, to identify outliers in sentence distances to determine chunk breakpoints. By analyzing the distances between sentences, this method identifies natural breakpoints, creating chunks that are both meaningful and easy to process.  

Chunk Summarization 

Analyzes propositions to generate summaries and update chunk metadata, maintaining content relevance. This process involves summarizing the key propositions in each chunk, ensuring that the summary accurately reflects the content of the chunk. 

Hierarchical Node Parser 

A hierarchical node parser enhances semantic search by extracting structured information from text. It breaks down a document into a hierarchy of nodes, each representing a section, subsection, or individual point within the text. This structure allows for more precise search and retrieval by focusing on the specific context of each node. The parser ensures that the hierarchical relationships within the document are maintained, enabling more accurate matches during searches. 

Llamaindex also help us by providing an implementation of this type of node parser: https://docs.llamaindex.ai/en/stable/api_reference/node_parsers/hierarchical/#llama_index.core.node_parser.HierarchicalNodeParser 

Graph Structure Extraction 

Graph structure extraction involves creating a graph representation of a document to better understand and visualize the relationships between different entities and concepts within the text. This method uses tools like diffbot Transformer to identify entities and their connections, forming a network of nodes and edges. The graph structure aids in answering complex questions and analyzing the interplay between different parts of the text, providing a richer, more interconnected view of the information. 

  

Conclusion 

Effective text splitting is essential for building robust retrieval systems. By understanding and implementing the appropriate level of text splitting, we can optimize our search systems for both accuracy and efficiency.  

In our experience, there is no splitting technique that outperforms the others in all scenarios: depending on the prompt which splitting technique provides the best results. This is why it is a good idea to split your documents using different splitters, and then resolve the prompt using all of them, to finally let the LLM (or a human) decide which answer was the best. 

Contact us

Whether you have a request, a query, or want to work with us, use the form below to get in touch with our team.