5.7 KiB
PdfChunker Documentation
Table of Contents
1. Introduction
The PdfChunker module is a specialized tool designed to split PDF text content into smaller, more manageable chunks. It is a valuable asset for processing PDF documents in natural language processing and text analysis tasks.
This documentation provides a comprehensive guide on how to use the PdfChunker module. It covers its purpose, parameters, and usage, ensuring that you can effectively process PDF text content.
2. Overview
The PdfChunker module serves a critical role in handling PDF text content, which is often lengthy and complex. Key features and parameters of the PdfChunker module include:
separators: Specifies a list ofChunkSeparatorobjects used to split the PDF text content into chunks.tokenizer: Defines the tokenizer used for counting tokens in the text.max_tokens: Sets the maximum token limit for each chunk.
By using the PdfChunker, you can efficiently prepare PDF text content for further analysis and processing.
3. Installation
Before using the PdfChunker module, ensure you have the required dependencies installed. The module relies on the swarms library. You can install this dependency using pip:
pip install swarms
4. Usage
In this section, we'll explore how to use the PdfChunker module effectively. It consists of the PdfChunker class and provides examples to demonstrate its usage.
4.1. PdfChunker Class
The PdfChunker class is the core component of the PdfChunker module. It is used to create a PdfChunker instance, which can split PDF text content into manageable chunks.
Parameters:
separators(list[ChunkSeparator]): Specifies a list ofChunkSeparatorobjects used to split the PDF text content into chunks.tokenizer(OpenAITokenizer): Defines the tokenizer used for counting tokens in the text.max_tokens(int): Sets the maximum token limit for each chunk.
4.2. Examples
Let's explore how to use the PdfChunker class with different scenarios and applications.
Example 1: Basic Chunking
from swarms.chunkers.pdf_chunker import PdfChunker
from swarms.chunkers.chunk_seperator import ChunkSeparator
# Initialize the PdfChunker
pdf_chunker = PdfChunker()
# PDF text content to be chunked
pdf_text = "This is a PDF document with multiple paragraphs and sentences. It should be split into smaller chunks for analysis."
# Chunk the PDF text content
chunks = pdf_chunker.chunk(pdf_text)
# Print the generated chunks
for idx, chunk in enumerate(chunks, start=1):
print(f"Chunk {idx}:\n{chunk.value}")
Example 2: Custom Separators
from swarms.chunkers.pdf_chunker import PdfChunker
from swarms.chunkers.chunk_seperator import ChunkSeparator
# Define custom separators for PDF chunking
custom_separators = [ChunkSeparator("\n\n"), ChunkSeparator(". ")]
# Initialize the PdfChunker with custom separators
pdf_chunker = PdfChunker(separators=custom_separators)
# PDF text content with custom separators
pdf_text = "This PDF document has custom paragraph separators.\n\nIt also uses period-based sentence separators. Split accordingly."
# Chunk the PDF text content
chunks = pdf_chunker.chunk(pdf_text)
# Print the generated chunks
for idx, chunk in enumerate(chunks, start=1):
print(f"Chunk {idx}:\n{chunk.value}")
Example 3: Adjusting Maximum Tokens
from swarms.chunkers.pdf_chunker import PdfChunker
# Initialize the PdfChunker with a custom maximum token limit
pdf_chunker = PdfChunker(max_tokens=50)
# Lengthy PDF text content
pdf_text = "This is an exceptionally long PDF document that should be broken into smaller chunks based on token count."
# Chunk the PDF text content
chunks = pdf_chunker.chunk(pdf_text)
# Print the generated chunks
for idx, chunk in enumerate(chunks, start=1):
print(f"Chunk {idx}:\n{chunk.value}")
4.3. Additional Features
The PdfChunker class also provides additional features:
Recursive Chunking
The _chunk_recursively method handles the recursive chunking of PDF text content, ensuring that each chunk stays within the token limit.
5. Additional Information
- PDF Text Chunking: The
PdfChunkermodule is a specialized tool for splitting PDF text content into manageable chunks, making it suitable for natural language processing tasks involving PDF documents. - Custom Separators: You can customize separators to adapt the PDF text content chunking process to specific document structures.
- Token Count: The module accurately counts tokens using the specified tokenizer, ensuring that chunks do not exceed token limits.
6. Conclusion
The PdfChunker module is a valuable asset for processing PDF text content in various natural language processing and text analysis tasks. This documentation has provided a comprehensive guide on its usage, parameters, and examples, ensuring that you can effectively prepare PDF documents for further analysis and processing.
By using the PdfChunker, you can efficiently break down lengthy and complex PDF text content into manageable chunks, making it ready for in-depth analysis.
Please check the official PdfChunker repository and documentation for any updates beyond the knowledge cutoff date.