You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
swarms/docs/swarms/chunkers/pdf_chunker.md

5.7 KiB

PdfChunker Documentation

Table of Contents

  1. Introduction
  2. Overview
  3. Installation
  4. Usage
    1. PdfChunker Class
    2. Examples
  5. Additional Information
  6. Conclusion

1. Introduction

The PdfChunker module is a specialized tool designed to split PDF text content into smaller, more manageable chunks. It is a valuable asset for processing PDF documents in natural language processing and text analysis tasks.

This documentation provides a comprehensive guide on how to use the PdfChunker module. It covers its purpose, parameters, and usage, ensuring that you can effectively process PDF text content.


2. Overview

The PdfChunker module serves a critical role in handling PDF text content, which is often lengthy and complex. Key features and parameters of the PdfChunker module include:

  • separators: Specifies a list of ChunkSeparator objects used to split the PDF text content into chunks.
  • tokenizer: Defines the tokenizer used for counting tokens in the text.
  • max_tokens: Sets the maximum token limit for each chunk.

By using the PdfChunker, you can efficiently prepare PDF text content for further analysis and processing.


3. Installation

Before using the PdfChunker module, ensure you have the required dependencies installed. The module relies on the swarms library. You can install this dependency using pip:

pip install swarms

4. Usage

In this section, we'll explore how to use the PdfChunker module effectively. It consists of the PdfChunker class and provides examples to demonstrate its usage.

4.1. PdfChunker Class

The PdfChunker class is the core component of the PdfChunker module. It is used to create a PdfChunker instance, which can split PDF text content into manageable chunks.

Parameters:

  • separators (list[ChunkSeparator]): Specifies a list of ChunkSeparator objects used to split the PDF text content into chunks.
  • tokenizer (OpenAITokenizer): Defines the tokenizer used for counting tokens in the text.
  • max_tokens (int): Sets the maximum token limit for each chunk.

4.2. Examples

Let's explore how to use the PdfChunker class with different scenarios and applications.

Example 1: Basic Chunking

from swarms.chunkers.pdf_chunker import PdfChunker
from swarms.chunkers.chunk_seperator import ChunkSeparator

# Initialize the PdfChunker
pdf_chunker = PdfChunker()

# PDF text content to be chunked
pdf_text = "This is a PDF document with multiple paragraphs and sentences. It should be split into smaller chunks for analysis."

# Chunk the PDF text content
chunks = pdf_chunker.chunk(pdf_text)

# Print the generated chunks
for idx, chunk in enumerate(chunks, start=1):
    print(f"Chunk {idx}:\n{chunk.value}")

Example 2: Custom Separators

from swarms.chunkers.pdf_chunker import PdfChunker
from swarms.chunkers.chunk_seperator import ChunkSeparator

# Define custom separators for PDF chunking
custom_separators = [ChunkSeparator("\n\n"), ChunkSeparator(". ")]

# Initialize the PdfChunker with custom separators
pdf_chunker = PdfChunker(separators=custom_separators)

# PDF text content with custom separators
pdf_text = "This PDF document has custom paragraph separators.\n\nIt also uses period-based sentence separators. Split accordingly."

# Chunk the PDF text content
chunks = pdf_chunker.chunk(pdf_text)

# Print the generated chunks
for idx, chunk in enumerate(chunks, start=1):
    print(f"Chunk {idx}:\n{chunk.value}")

Example 3: Adjusting Maximum Tokens

from swarms.chunkers.pdf_chunker import PdfChunker

# Initialize the PdfChunker with a custom maximum token limit
pdf_chunker = PdfChunker(max_tokens=50)

# Lengthy PDF text content
pdf_text = "This is an exceptionally long PDF document that should be broken into smaller chunks based on token count."

# Chunk the PDF text content
chunks = pdf_chunker.chunk(pdf_text)

# Print the generated chunks
for idx, chunk in enumerate(chunks, start=1):
    print(f"Chunk {idx}:\n{chunk.value}")

4.3. Additional Features

The PdfChunker class also provides additional features:

Recursive Chunking

The _chunk_recursively method handles the recursive chunking of PDF text content, ensuring that each chunk stays within the token limit.


5. Additional Information

  • PDF Text Chunking: The PdfChunker module is a specialized tool for splitting PDF text content into manageable chunks, making it suitable for natural language processing tasks involving PDF documents.
  • Custom Separators: You can customize separators to adapt the PDF text content chunking process to specific document structures.
  • Token Count: The module accurately counts tokens using the specified tokenizer, ensuring that chunks do not exceed token limits.

6. Conclusion

The PdfChunker module is a valuable asset for processing PDF text content in various natural language processing and text analysis tasks. This documentation has provided a comprehensive guide on its usage, parameters, and examples, ensuring that you can effectively prepare PDF documents for further analysis and processing.

By using the PdfChunker, you can efficiently break down lengthy and complex PDF text content into manageable chunks, making it ready for in-depth analysis.

Please check the official PdfChunker repository and documentation for any updates beyond the knowledge cutoff date.