You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
147 lines
5.7 KiB
147 lines
5.7 KiB
# PdfChunker Documentation
|
|
|
|
## Table of Contents
|
|
1. [Introduction](#introduction)
|
|
2. [Overview](#overview)
|
|
3. [Installation](#installation)
|
|
4. [Usage](#usage)
|
|
1. [PdfChunker Class](#pdfchunker-class)
|
|
2. [Examples](#examples)
|
|
5. [Additional Information](#additional-information)
|
|
6. [Conclusion](#conclusion)
|
|
|
|
---
|
|
|
|
## 1. Introduction <a name="introduction"></a>
|
|
|
|
The `PdfChunker` module is a specialized tool designed to split PDF text content into smaller, more manageable chunks. It is a valuable asset for processing PDF documents in natural language processing and text analysis tasks.
|
|
|
|
This documentation provides a comprehensive guide on how to use the `PdfChunker` module. It covers its purpose, parameters, and usage, ensuring that you can effectively process PDF text content.
|
|
|
|
---
|
|
|
|
## 2. Overview <a name="overview"></a>
|
|
|
|
The `PdfChunker` module serves a critical role in handling PDF text content, which is often lengthy and complex. Key features and parameters of the `PdfChunker` module include:
|
|
|
|
- `separators`: Specifies a list of `ChunkSeparator` objects used to split the PDF text content into chunks.
|
|
- `tokenizer`: Defines the tokenizer used for counting tokens in the text.
|
|
- `max_tokens`: Sets the maximum token limit for each chunk.
|
|
|
|
By using the `PdfChunker`, you can efficiently prepare PDF text content for further analysis and processing.
|
|
|
|
---
|
|
|
|
## 3. Installation <a name="installation"></a>
|
|
|
|
Before using the `PdfChunker` module, ensure you have the required dependencies installed. The module relies on the `swarms` library. You can install this dependency using pip:
|
|
|
|
```bash
|
|
pip install swarms
|
|
```
|
|
|
|
---
|
|
|
|
## 4. Usage <a name="usage"></a>
|
|
|
|
In this section, we'll explore how to use the `PdfChunker` module effectively. It consists of the `PdfChunker` class and provides examples to demonstrate its usage.
|
|
|
|
### 4.1. `PdfChunker` Class <a name="pdfchunker-class"></a>
|
|
|
|
The `PdfChunker` class is the core component of the `PdfChunker` module. It is used to create a `PdfChunker` instance, which can split PDF text content into manageable chunks.
|
|
|
|
#### Parameters:
|
|
- `separators` (list[ChunkSeparator]): Specifies a list of `ChunkSeparator` objects used to split the PDF text content into chunks.
|
|
- `tokenizer` (OpenAITokenizer): Defines the tokenizer used for counting tokens in the text.
|
|
- `max_tokens` (int): Sets the maximum token limit for each chunk.
|
|
|
|
### 4.2. Examples <a name="examples"></a>
|
|
|
|
Let's explore how to use the `PdfChunker` class with different scenarios and applications.
|
|
|
|
#### Example 1: Basic Chunking
|
|
|
|
```python
|
|
from swarms.chunkers.pdf_chunker import PdfChunker
|
|
from swarms.chunkers.chunk_seperator import ChunkSeparator
|
|
|
|
# Initialize the PdfChunker
|
|
pdf_chunker = PdfChunker()
|
|
|
|
# PDF text content to be chunked
|
|
pdf_text = "This is a PDF document with multiple paragraphs and sentences. It should be split into smaller chunks for analysis."
|
|
|
|
# Chunk the PDF text content
|
|
chunks = pdf_chunker.chunk(pdf_text)
|
|
|
|
# Print the generated chunks
|
|
for idx, chunk in enumerate(chunks, start=1):
|
|
print(f"Chunk {idx}:\n{chunk.value}")
|
|
```
|
|
|
|
#### Example 2: Custom Separators
|
|
|
|
```python
|
|
from swarms.chunkers.pdf_chunker import PdfChunker
|
|
from swarms.chunkers.chunk_seperator import ChunkSeparator
|
|
|
|
# Define custom separators for PDF chunking
|
|
custom_separators = [ChunkSeparator("\n\n"), ChunkSeparator(". ")]
|
|
|
|
# Initialize the PdfChunker with custom separators
|
|
pdf_chunker = PdfChunker(separators=custom_separators)
|
|
|
|
# PDF text content with custom separators
|
|
pdf_text = "This PDF document has custom paragraph separators.\n\nIt also uses period-based sentence separators. Split accordingly."
|
|
|
|
# Chunk the PDF text content
|
|
chunks = pdf_chunker.chunk(pdf_text)
|
|
|
|
# Print the generated chunks
|
|
for idx, chunk in enumerate(chunks, start=1):
|
|
print(f"Chunk {idx}:\n{chunk.value}")
|
|
```
|
|
|
|
#### Example 3: Adjusting Maximum Tokens
|
|
|
|
```python
|
|
from swarms.chunkers.pdf_chunker import PdfChunker
|
|
|
|
# Initialize the PdfChunker with a custom maximum token limit
|
|
pdf_chunker = PdfChunker(max_tokens=50)
|
|
|
|
# Lengthy PDF text content
|
|
pdf_text = "This is an exceptionally long PDF document that should be broken into smaller chunks based on token count."
|
|
|
|
# Chunk the PDF text content
|
|
chunks = pdf_chunker.chunk(pdf_text)
|
|
|
|
# Print the generated chunks
|
|
for idx, chunk in enumerate(chunks, start=1):
|
|
print(f"Chunk {idx}:\n{chunk.value}")
|
|
```
|
|
|
|
### 4.3. Additional Features
|
|
|
|
The `PdfChunker` class also provides additional features:
|
|
|
|
#### Recursive Chunking
|
|
The `_chunk_recursively` method handles the recursive chunking of PDF text content, ensuring that each chunk stays within the token limit.
|
|
|
|
---
|
|
|
|
## 5. Additional Information <a name="additional-information"></a>
|
|
|
|
- **PDF Text Chunking**: The `PdfChunker` module is a specialized tool for splitting PDF text content into manageable chunks, making it suitable for natural language processing tasks involving PDF documents.
|
|
- **Custom Separators**: You can customize separators to adapt the PDF text content chunking process to specific document structures.
|
|
- **Token Count**: The module accurately counts tokens using the specified tokenizer, ensuring that chunks do not exceed token limits.
|
|
|
|
---
|
|
|
|
## 6. Conclusion <a name="conclusion"></a>
|
|
|
|
The `PdfChunker` module is a valuable asset for processing PDF text content in various natural language processing and text analysis tasks. This documentation has provided a comprehensive guide on its usage, parameters, and examples, ensuring that you can effectively prepare PDF documents for further analysis and processing.
|
|
|
|
By using the `PdfChunker`, you can efficiently break down lengthy and complex PDF text content into manageable chunks, making it ready for in-depth analysis.
|
|
|
|
*Please check the official `PdfChunker` repository and documentation for any updates beyond the knowledge cutoff date.* |