You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
swarms/docs/swarms/tokenizers/basetokenizer.md

3.6 KiB

Documentation for swarms.tokenizers.BaseTokenizer

Overview and Introduction

The swarms.tokenizers library is designed to provide flexible and efficient tokenization utilities for natural language processing (NLP) tasks. The BaseTokenizer class serves as a foundational abstract class from which specific tokenizer implementations can be derived. This class outlines essential functions and properties all tokenizers should have, ensuring consistency and capturing common behaviors required for processing textual data.

Class Definition: BaseTokenizer

Attributes and Methods

Name Type Description
max_tokens int Maximum number of tokens the tokenizer can process.
stop_token str Token used to denote the end of processing.
stop_sequences List[str] (read-only) List of stop sequences initialized post-instantiation.
count_tokens_left Method: (text) -> int Computes the number of tokens that can still be added given the text.
count_tokens Abstract Method: (text) -> int Returns the number of tokens in the given text.

Functionality and Usage

The BaseTokenizer class provides the structure for creating tokenizers. It includes methods for counting the tokens in a given text and determining how many more tokens can be added without exceeding the max_tokens limit. This class should be subclassed, and the count_tokens method must be implemented in subclasses to provide the specific token counting logic.

Example: Subclassing BaseTokenizer

from swarms.tokenizers import BaseTokenizer


class SimpleTokenizer(BaseTokenizer):
    def count_tokens(self, text: Union[str, List[dict]]) -> int:
        if isinstance(text, str):
            # Split text by spaces as a simple tokenization approach
            return len(text.split())
        elif isinstance(text, list):
            # Assume list of dictionaries with 'token' key
            return sum(len(item["token"].split()) for item in text)
        else:
            raise TypeError("Unsupported type for text")


# Usage example
tokenizer = SimpleTokenizer(max_tokens=100)
text = "This is an example sentence to tokenize."
print(tokenizer.count_tokens(text))  # Outputs: 7 (assuming space tokenization)
remaining_tokens = tokenizer.count_tokens_left(text)
print(remaining_tokens)  # Outputs: 93

Note:

Understand that the stop_sequences and stop_token in this particular implementation are placeholders to illustrate the pattern. The actual logic may differ based on specific tokenizer requirements.

Additional Information and Tips

  • Tokenization is a vital step in text processing for NLP. It should be tailored to the requirements of the application.
  • Ensure that tokenizer definitions are in sync with the models and datasets being used.

References and Resources

For a deeper understanding of tokenization and its role in NLP, refer to: