5.7 KiB
PdfChunker Documentation
Table of Contents
1. Introduction
The PdfChunker
module is a specialized tool designed to split PDF text content into smaller, more manageable chunks. It is a valuable asset for processing PDF documents in natural language processing and text analysis tasks.
This documentation provides a comprehensive guide on how to use the PdfChunker
module. It covers its purpose, parameters, and usage, ensuring that you can effectively process PDF text content.
2. Overview
The PdfChunker
module serves a critical role in handling PDF text content, which is often lengthy and complex. Key features and parameters of the PdfChunker
module include:
separators
: Specifies a list ofChunkSeparator
objects used to split the PDF text content into chunks.tokenizer
: Defines the tokenizer used for counting tokens in the text.max_tokens
: Sets the maximum token limit for each chunk.
By using the PdfChunker
, you can efficiently prepare PDF text content for further analysis and processing.
3. Installation
Before using the PdfChunker
module, ensure you have the required dependencies installed. The module relies on the swarms
library. You can install this dependency using pip:
pip install swarms
4. Usage
In this section, we'll explore how to use the PdfChunker
module effectively. It consists of the PdfChunker
class and provides examples to demonstrate its usage.
4.1. PdfChunker
Class
The PdfChunker
class is the core component of the PdfChunker
module. It is used to create a PdfChunker
instance, which can split PDF text content into manageable chunks.
Parameters:
separators
(list[ChunkSeparator]): Specifies a list ofChunkSeparator
objects used to split the PDF text content into chunks.tokenizer
(OpenAITokenizer): Defines the tokenizer used for counting tokens in the text.max_tokens
(int): Sets the maximum token limit for each chunk.
4.2. Examples
Let's explore how to use the PdfChunker
class with different scenarios and applications.
Example 1: Basic Chunking
from swarms.chunkers.pdf_chunker import PdfChunker
from swarms.chunkers.chunk_seperator import ChunkSeparator
# Initialize the PdfChunker
pdf_chunker = PdfChunker()
# PDF text content to be chunked
pdf_text = "This is a PDF document with multiple paragraphs and sentences. It should be split into smaller chunks for analysis."
# Chunk the PDF text content
chunks = pdf_chunker.chunk(pdf_text)
# Print the generated chunks
for idx, chunk in enumerate(chunks, start=1):
print(f"Chunk {idx}:\n{chunk.value}")
Example 2: Custom Separators
from swarms.chunkers.pdf_chunker import PdfChunker
from swarms.chunkers.chunk_seperator import ChunkSeparator
# Define custom separators for PDF chunking
custom_separators = [ChunkSeparator("\n\n"), ChunkSeparator(". ")]
# Initialize the PdfChunker with custom separators
pdf_chunker = PdfChunker(separators=custom_separators)
# PDF text content with custom separators
pdf_text = "This PDF document has custom paragraph separators.\n\nIt also uses period-based sentence separators. Split accordingly."
# Chunk the PDF text content
chunks = pdf_chunker.chunk(pdf_text)
# Print the generated chunks
for idx, chunk in enumerate(chunks, start=1):
print(f"Chunk {idx}:\n{chunk.value}")
Example 3: Adjusting Maximum Tokens
from swarms.chunkers.pdf_chunker import PdfChunker
# Initialize the PdfChunker with a custom maximum token limit
pdf_chunker = PdfChunker(max_tokens=50)
# Lengthy PDF text content
pdf_text = "This is an exceptionally long PDF document that should be broken into smaller chunks based on token count."
# Chunk the PDF text content
chunks = pdf_chunker.chunk(pdf_text)
# Print the generated chunks
for idx, chunk in enumerate(chunks, start=1):
print(f"Chunk {idx}:\n{chunk.value}")
4.3. Additional Features
The PdfChunker
class also provides additional features:
Recursive Chunking
The _chunk_recursively
method handles the recursive chunking of PDF text content, ensuring that each chunk stays within the token limit.
5. Additional Information
- PDF Text Chunking: The
PdfChunker
module is a specialized tool for splitting PDF text content into manageable chunks, making it suitable for natural language processing tasks involving PDF documents. - Custom Separators: You can customize separators to adapt the PDF text content chunking process to specific document structures.
- Token Count: The module accurately counts tokens using the specified tokenizer, ensuring that chunks do not exceed token limits.
6. Conclusion
The PdfChunker
module is a valuable asset for processing PDF text content in various natural language processing and text analysis tasks. This documentation has provided a comprehensive guide on its usage, parameters, and examples, ensuring that you can effectively prepare PDF documents for further analysis and processing.
By using the PdfChunker
, you can efficiently break down lengthy and complex PDF text content into manageable chunks, making it ready for in-depth analysis.
Please check the official PdfChunker
repository and documentation for any updates beyond the knowledge cutoff date.