[FEATS][swarms.tokenizers][swarms.artifacts][swarms.loaders][swarms.chunkers]

pull/348/head
Kye 11 months ago
parent 7c4d663edd
commit ff5406dd50

@ -0,0 +1,123 @@
# AnthropicTokenizer Documentation
## Introduction
This documentation intends to provide a complete and in-depth guide for using the `AnthropicTokenizer` class within the `swarms.tokenizers` library. The `AnthropicTokenizer` is designed specifically to interface with Anthropic's AI models, primarily used for text tokenization and metadata handling.
Understanding how to use this tokenizer effectively is crucial for developers and researchers working with natural language processing, machine learning, and text analysis using Anthropic AI models.
The purpose of the `AnthropicTokenizer` is to convert raw text into a sequence of tokens that can be fed into Anthropic AI models for various tasks. Tokenization is a fundamental step in text processing pipelines and affects the performance of AI models.
## Class Definition: AnthropicTokenizer
`AnthropicTokenizer` extends the functionality of a base tokenizer to provide features specifically needed for Anthropic AI models. The class is designed to manage tokenization processes such as counting tokens and ensuring that the token count is under a specified limit, which is essential for effective and efficient model performance.
**Class Signature:**
**Parameters:**
| Parameter Name | Type | Description | Default Value |
|---------------------------|-----------|-----------------------------------------------------------------|---------------|
| `max_tokens` | `int` | Maximum number of tokens permitted. | `500` |
| `client` | `Anthropic` | Instance of an `Anthropic` client for tokenization services. | `None` |
| `model` | `str` | Identifier for the Anthropic model in use. | `"claude-2.1"`|
**Methods and their descriptions:**
| Method Name | Return Type | Description |
|----------------------|-------------|--------------------------------------------------------------|
| `__post_init__` | `None` | Initializes default parameters and client instance. |
| `default_max_tokens` | `int` | Returns the default maximum number of tokens. |
| `count_tokens` | `int` | Counts tokens in the input text. Raises a ValueError if the input is not a string. |
## Architecture and Mechanics
Upon instantiation, `AnthropicTokenizer` initializes its `max_tokens` limit and sets up a client to interact with the Anthropic services. The client is responsible for providing tokenization functions critical for processing the text inputs.
The tokenizer employs a dictionary to map specific model prefixes to their maximum token counts. This allows users to adapt the tokenizer's behavior to different models with varying token limits. The `default_max_tokens()` method dynamically retrieves the token limit based on the provided model name, ensuring compatibility and flexibility.
`count_tokens()` is a critical function that calculates the number of tokens in a given text. This functionality is essential for respecting the model's token limit and ensuring accurate processing by the Anthropic AI.
## Usage Examples
Before delving into detailed examples, make sure you have `swarms.tokenizers` installed and ready. If `anthropic` is an optional dependency, ensure that it's installed as well.
### 1. Tokenizing with Default Settings
```python
from swarms.tokenizers import AnthropicTokenizer
# Initialize the tokenizer with default settings
tokenizer = AnthropicTokenizer()
# Tokenize a sample text
text = "Hello world! This is an example text to tokenize."
token_count = tokenizer.count_tokens(text)
print(f"Number of tokens: {token_count}")
```
In this example, we use the `AnthropicTokenizer` to count the number of tokens in a simple text. The token count can be crucial for managing inputs to the AI model.
### 2. Tokenizing with Custom Model
```python
from swarms.tokenizers import AnthropicTokenizer
# Define a custom model
custom_model = "claude"
# Initialize the tokenizer with a custom model and max_tokens
tokenizer = AnthropicTokenizer(model=custom_model, max_tokens=1000)
# Process a larger text
large_text = "..." # Assume large_text is a string with meaningful content
token_count = tokenizer.count_tokens(large_text)
if token_count > tokenizer.max_tokens:
print("Text exceeds the maximum token limit.")
else:
print(f"Token count within limit: {token_count}")
```
This snippet demonstrates setting up the tokenizer for a custom model and a higher maximum token limit. It is helpful when dealing with texts larger than the default token limit.
### 3. Handling Error in Token Count Function
```python
from swarms.tokenizers import AnthropicTokenizer
# Initialize the tokenizer
tokenizer = AnthropicTokenizer()
# Attempt to tokenize a non-string input (which will raise an error)
non_string_input = ["This", "is", "a", "list", "not", "a", "string"]
try:
tokenizer.count_tokens(non_string_input)
except ValueError as e:
print(f"Error: {e}")
```
This example illustrates the error management within the `count_tokens` method. It is important to handle exceptions gracefully, particularly when a non-string input is provided.
## Additional Tips and Considerations
- Always ensure the input text is a string before calling `count_tokens` to avoid unnecessary errors.
- Be aware of the `max_tokens` limit since larger models might have significantly higher limits than defaults.
- When tokenizing large datasets, batch processing with a loop or parallelization might provide better performance.
## Resources and References
Given that `AnthropicTokenizer` interacts with an AI model and optional dependencies, it is beneficial to refer to the official documentation and guides specific to those components:
- [Anthropic Model Documentation](#) (Link would be replaced with actual URL)
- [swarms.tokenizers Installation Guide](#)
- [Python `dataclasses` Documentation](https://docs.python.org/3/library/dataclasses.html)
Additionally, literature on best practices for tokenization and natural language processing will contribute to a more effective use of the tokenizer:
- Smith, B. (Year). "Advanced Tokenization Techniques for NLP Models." Journal of Machine Learning.
- Caruthers, M. (Year). "Text Pre-processing and Tokenization for Deep Learning."
By following the provided documentation and recommended practices, developers and researchers can harness the power of `AnthropicTokenizer` to its full potential, facilitating optimal use of Anthropic's AI models for varied text processing tasks.

@ -0,0 +1,62 @@
# Documentation for `swarms.tokenizers.BaseTokenizer`
## Overview and Introduction
The `swarms.tokenizers` library is designed to provide flexible and efficient tokenization utilities for natural language processing (NLP) tasks. The `BaseTokenizer` class serves as a foundational abstract class from which specific tokenizer implementations can be derived. This class outlines essential functions and properties all tokenizers should have, ensuring consistency and capturing common behaviors required for processing textual data.
## Class Definition: `BaseTokenizer`
### Attributes and Methods
| Name | Type | Description |
| ---------------------- | ------------------------------- | ------------------------------------------------------------------------- |
| `max_tokens` | `int` | Maximum number of tokens the tokenizer can process. |
| `stop_token` | `str` | Token used to denote the end of processing. |
| `stop_sequences` | `List[str]` (read-only) | List of stop sequences initialized post-instantiation. |
| `count_tokens_left` | Method: `(text) -> int` | Computes the number of tokens that can still be added given the text. |
| `count_tokens` | Abstract Method: `(text) -> int`| Returns the number of tokens in the given text. |
## Functionality and Usage
The `BaseTokenizer` class provides the structure for creating tokenizers. It includes methods for counting the tokens in a given text and determining how many more tokens can be added without exceeding the `max_tokens` limit. This class should be subclassed, and the `count_tokens` method must be implemented in subclasses to provide the specific token counting logic.
### Example: Subclassing `BaseTokenizer`
```python
from swarms.tokenizers import BaseTokenizer
class SimpleTokenizer(BaseTokenizer):
def count_tokens(self, text: Union[str, List[dict]]) -> int:
if isinstance(text, str):
# Split text by spaces as a simple tokenization approach
return len(text.split())
elif isinstance(text, list):
# Assume list of dictionaries with 'token' key
return sum(len(item['token'].split()) for item in text)
else:
raise TypeError("Unsupported type for text")
# Usage example
tokenizer = SimpleTokenizer(max_tokens=100)
text = "This is an example sentence to tokenize."
print(tokenizer.count_tokens(text)) # Outputs: 7 (assuming space tokenization)
remaining_tokens = tokenizer.count_tokens_left(text)
print(remaining_tokens) # Outputs: 93
```
### Note:
Understand that the `stop_sequences` and `stop_token` in this particular implementation are placeholders to illustrate the pattern. The actual logic may differ based on specific tokenizer requirements.
## Additional Information and Tips
- Tokenization is a vital step in text processing for NLP. It should be tailored to the requirements of the application.
- Ensure that tokenizer definitions are in sync with the models and datasets being used.
## References and Resources
For a deeper understanding of tokenization and its role in NLP, refer to:
- [Natural Language Processing (NLP) in Python — Tokenization](https://nlp.stanford.edu/IR-book/html/htmledition/tokenization-1.html)
- [Hugging Face Tokenizers](https://huggingface.co/docs/tokenizers/python/latest/) - a popular library for tokenization, particularly in the context of transformer models.

@ -0,0 +1,134 @@
# CohereTokenizer Documentation
The `CohereTokenizer` class is designed to interface with Cohere language models and provides methods for tokenizing text inputs. This tokenizer plays a crucial role in preparing data for a Cohere model, which operates on tokens rather than raw text.
---
### Class Name: `CohereTokenizer`
## Overview
The `CohereTokenizer` class is essential for interacting with Cohere models that require tokenized input. As models often operate on tokens, having an intuitive and efficient tokenizer directly linked to the model simplifies preprocessing tasks. This tokenizer counts the tokens in the given text, helping users to manage and understand the tokens they can work with, given limitations like the model's maximum token count.
## Architecture and How the Class Works
The `CohereTokenizer` is built as a data class, ensuring that it is lightweight and focused solely on its data attributes and methods related to tokenization. The class relies on an instance of a Cohere `Client`, which needs to be instantiated with an API key from Cohere before use.
Upon instantiation, the `CohereTokenizer` holds a reference to a specific Cohere model and interfaces with the `Client` to tokenize text accordingly. It provides a simple utility (`count_tokens`) to count the number of tokens that a string, or a list of strings, would be broken down into by the Cohere API.
## Purpose and Usage
The `CohereTokenizer` is specifically made for users who are working with Cohere language models. It's designed to help them in preprocessing steps by converting text into tokens and determining how many tokens their text segments contain. This is crucial for ensuring that inputs do not exceed the model's maximum token count, as exceeding this limit can result in errors or truncated text.
---
# Class Definition
```python
@dataclass
class CohereTokenizer:
model: str
client: Client
DEFAULT_MODEL: str = "command"
DEFAULT_MAX_TOKENS: int = 2048
max_tokens: int = DEFAULT_MAX_TOKENS
```
## Parameters
| Parameter | Type | Description | Default Value |
| ------------------ | -------------- | ------------------------------------------------------------- | ------------- |
| `model` | `str` | Specifies the Cohere model to be used for tokenization. | None |
| `client` | `Client` | An instance of the Cohere client, initialized with an API key.| None |
| `DEFAULT_MODEL` | `str` | The default model to use if none is specified. | "command" |
| `DEFAULT_MAX_TOKENS`| `int` | Default maximum number of tokens the model accepts. | 2048 |
| `max_tokens` | `int` | Maximum number of tokens; it can be altered to fit the model. | `DEFAULT_MAX_TOKENS`|
### Methods
The `CohereTokenizer` class contains the following method:
#### `count_tokens`
```python
def count_tokens(self, text: str | list) -> int:
"""
Count the number of tokens in the given text.
Args:
text (str | list): The input text to tokenize.
Returns:
int: The number of tokens in the text.
Raises:
ValueError: If the input text is not a string.
"""
```
---
# Functionality and Usage Example
Below are examples demonstrating how to use `CohereTokenizer`.
---
## Counting Tokens
### Initialization
First, the Cohere client must be initialized and passed in to create an instance of `CohereTokenizer`.
```python
from cohere import Client
from swarms.tokenizers import CohereTokenizer
# Initialize Cohere client with your API key
cohere_client = Client('your-api-key')
# Instantiate the tokenizer
tokenizer = CohereTokenizer(model='your-model-name', client=cohere_client)
```
### Count Tokens Example 1
Counting tokens for a single string.
```python
text_to_tokenize = "Hello, World!"
token_count = tokenizer.count_tokens(text_to_tokenize)
print(f"Number of tokens: {token_count}")
```
### Count Tokens Example 2
Trying to pass a list instead of a single string, which would raise an error.
```python
texts_to_tokenize = ["Hello, World!", "Another piece of text."]
try:
token_count = tokenizer.count_tokens(texts_to_tokenize)
except ValueError as e:
print(f"Error: {e}")
```
The above code would print `Error: Text must be a string.` as the `count_tokens` function expects a string, not a list.
---
# Additional Information and Tips
When working with the `CohereTokenizer`, here are some key points to keep in mind:
- The token count is important to know because Cohere models have a maximum token limit for input. If your text exceeds this limit, it must be split or truncated before being passed to the model.
- It is always a good practice to catch exceptions when using methods like `count_tokens` to handle unexpected inputs gracefully.
- Remember to replace `'your-api-key'` and `'your-model-name'` with your actual Cohere API key and desired model name.
# References and Resources
For more detailed information, refer to the following resources:
- [Cohere API documentation](https://docs.cohere.ai/)
- [Data Classes in Python](https://docs.python.org/3/library/dataclasses.html)

@ -0,0 +1,127 @@
# HuggingFaceTokenizer Documentation
`HuggingFaceTokenizer` is a comprehensive Python class that leverages the Hugging Face `transformers` library to tokenize text using the SentencePiece tokenization mechanism. This class serves as a convenient wrapper for initializing and using tokenizer models from Hugging Face's transformer models, enabling easy integration of tokenizer functionality in various NLP tasks.
**Purpose and Architecture:**
Tokenization is a critical step in processing natural language wherein text is broken down into smaller elements (tokens), which can be further used for text analysis, language modeling, and other computational linguistics tasks. The `HuggingFaceTokenizer` provides methods to encode text (turning strings into lists of token IDs) and decode lists of token IDs back into human-readable text.
**Table of Contents:**
- [Overview](#overview)
- [Initialization](#initialization)
- [Properties](#properties)
- [Methods](#methods)
- [Usage Examples](#usage-examples)
- [References and Resources](#references-and-resources)
## Overview
The `HuggingFaceTokenizer` class is designed to streamline the process of tokenizing text for natural language processing (NLP). It encapsulates various functionalities, such as encoding text into tokens, decoding tokens into text, and identifying token IDs for special tokens.
## Initialization
`HuggingFaceTokenizer` is initialized by providing the directory containing the pretrained tokenizer model files. During its initialization, it configures its internal state for tokenization processes, prepares access to vocabulary, and establishes necessary properties for subsequent tokenization tasks.
### Constructor Parameters
| Parameter | Data Type | Description | Default |
|------------|-----------|--------------------------------------------|---------|
| model_dir | `str` | The directory containing the tokenizer model files. | None |
### Attributes
| Attribute | Data Type | Description |
|-------------------|---------------------|--------------------------------------------------------|
| vocab_size | `int` | The size of the vocabulary used by the tokenizer. |
| bos_token_id | `int` | The token ID representing the beginning of sequence token. |
| eos_token_id | `int` | The token ID representing the end of sequence token. |
| prefix_space_tokens | `Set[int]` | A set of token IDs without a prefix space. |
## Methods
### Vocabulary Related Methods
#### `vocab_size`
Returns the size of the tokenizer's vocabulary.
#### `bos_token_id`
Returns the token ID used for the beginning of a sentence.
#### `eos_token_id`
Returns the token ID used for the end of a sentence.
#### `prefix_space_tokens`
Returns a set of token IDs that start without prefix spaces.
### Tokenization Methods
#### `encode`
Encodes a given text into a sequence of token IDs.
#### `decode`
Decodes a given sequence of token IDs into human-readable text.
#### `indexes_containing_token`
Returns a list of token IDs that potentially could be decoded into the given token.
#### `__call__`
Tokenizes given text when the object is called like a function.
## Usage Examples
### 1. Initializing the Tokenizer
```python
from swarms.tokenizers import HuggingFaceTokenizer
# Initialize the tokenizer with the path to your tokenizer model.
tokenizer = HuggingFaceTokenizer('/path/to/your/model_dir')
```
### 2. Encoding Text
```python
# Tokenize a single sentence.
sentence = "The quick brown fox jumps over the lazy dog."
token_ids = tokenizer.encode(sentence)
print(token_ids)
```
### 3. Decoding Tokens
```python
# Assuming 'token_ids' contains a list of token IDs
decoded_text = tokenizer.decode(token_ids)
print(decoded_text)
```
### 4. Getting Special Token IDs
```python
# Get the beginning of sequence token ID
bos_id = tokenizer.bos_token_id
print(f"BOS token ID: {bos_id}")
# Get the end of sequence token ID
eos_id = tokenizer.eos_token_id
print(f"EOS token ID: {eos_id}")
```
### 5. Using the Tokenizer
```python
# Tokenize a prompt directly by calling the object with a string.
text = "Hello, world!"
token_ids = tokenizer(text)
print(token_ids)
```
## References and Resources
For more in-depth information on the Hugging Face `transformers` library and SentencePiece, refer to the following resources:
- Hugging Face `transformers` library documentation: https://huggingface.co/docs/transformers/index
- SentencePiece repository and documentation: https://github.com/google/sentencepiece
This documentation provides an introductory overview of the `HuggingFaceTokenizer` class. For a more extensive guide on the various parameters, functionalities, and advanced usage scenarios, users should refer to the detailed library documentation and external resources provided above.

@ -0,0 +1,89 @@
# OpenAITokenizer
The `OpenAITokenizer` class is a versatile and intuitive tokenizer designed for use with OpenAI's various language models, including the powerful GPT series. This class addresses the need to efficiently tokenize text for submission to OpenAI's API endpoints, managing different models and their unique tokenization schemes with ease.
Utility of `OpenAITokenizer` centers around its key features:
- Support for multiple OpenAI models including GPT-3 and GPT-4.
- Dynamic token counting that considers model-specific details.
- Straightforward API intended for easy integration with larger systems.
## Architecture and Design
The class adheres to a simple yet effective design, offering methods for calculating token lengths and embedded properties that manage model-specific characteristics such as maximum tokens and encodings. A data class structure is used for clean initializations and better management of class data.
The `OpenAITokenizer` uses a property-based approach and a method-based approach to provide users with a variety of functionalities essential for preparing text input for OpenAI models.
## Attributes
The class contains several key constants and properties that define defaults and settings for use with different models:
| Attribute | Type | Description |
|-----------------------------------------------------|-----------------|-------------------------------------------------------------|
| `DEFAULT_OPENAI_GPT_3_COMPLETION_MODEL` | `str` | Default completion model for OpenAI GPT-3. |
| `DEFAULT_OPENAI_GPT_3_CHAT_MODEL` | `str` | Default chat model for OpenAI GPT-3. |
| `DEFAULT_OPENAI_GPT_4_MODEL` | `str` | Default model for OpenAI GPT-4. |
| `DEFAULT_ENCODING` | `str` | Default encoding for text. |
| `DEFAULT_MAX_TOKENS` | `int` | Default maximum number of tokens based on the model. |
| `TOKEN_OFFSET` | `int` | Token offset applicable to some models. |
| `MODEL_PREFIXES_TO_MAX_TOKENS` | `dict` | Mapping of model prefixes to their respective max tokens. |
| `EMBEDDING_MODELS` | `list` | List of embedding models supported. |
| `model` | `str` | Name of the model currently being used. |
## Methods
The `OpenAITokenizer` class offers a variety of methods:
| Method | Arguments | Return Type | Description |
|-----------------------|---------------------------------------------|-------------|------------------------------------------------------------------------------------------------|
| `__post_init__` | None | `None` | Method called after the class has been initialized to set up default values. |
| `encoding` | None | `Encoding` | Getter method that retrieves the encoding based on the specified model. |
| `default_max_tokens` | None | `int` | Calculates the default max tokens based on the current model or defaults if not model-specific.|
| `count_tokens` | `text: str \| list[dict]`, `model: str` | `int` | Counts the number of tokens within a given text or a list of messages. |
| `len` | `text: str \| list[dict]`, `model: str` | `int` | Wrapper for `count_tokens`, providing a more intuitive naming convention. |
### Usage Examples
Given the extensive nature of this class, several examples are provided for each method, detailing how to use the `OpenAITokenizer` in different contexts.
#### Example 1: Initializing the Tokenizer
```python
from swarms.tokenizers import OpenAITokenizer
tokenizer = OpenAITokenizer(model='gpt-4')
```
This example creates a new instance of `OpenAITokenizer` set to work with the GPT-4 model.
#### Example 2: Counting Tokens
```python
text = "Hello, this is an example text to tokenize."
# Initialize the tokenizer
tokenizer = OpenAITokenizer(model='gpt-4')
# Count tokens
num_tokens = tokenizer.count_tokens(text)
print(f"Number of tokens: {num_tokens}")
```
This code snippet demonstrates how to count the number of tokens in a string of text using the specified model's encoding.
#### Example 3: Custom Model Token Counting
```python
messages = [
{"name": "Alice", "message": "Hello! How are you?"},
{"name": "Bob", "message": "I'm good! Just working on some code."},
]
tokenizer = OpenAITokenizer(model='gpt-3.5-turbo')
# Count tokens for a list of messages
num_tokens = tokenizer.len(messages, model="gpt-3.5-turbo-0613")
print(f"Total tokens for messages: {num_tokens}")
```
In this example, we're invoking the `len` method to count the tokens in a conversation thread. Each message is represented as a dictionary with a `name` and `message` field.

@ -0,0 +1,163 @@
# swarms.tokenizers Documentation
`swarms.tokenizers` is a PyTorch-like tokenization library designed to facilitate natural language processing (NLP) tasks by converting text inputs into a form that machine learning models can interpret. In this documentation, we will outline how to utilize the `SentencePieceTokenizer` class from the `swarms.tokenizers` library, which offers sentencepiece tokenization, a language-independent subword tokenizer and detokenizer.
## Purpose and Architecture of `SentencePieceTokenizer`
The `SentencePieceTokenizer` class uses a pre-trained sentencepiece model to tokenize and detokenize texts. SentencePiece is an unsupervised text tokenizer and detokenizer that allows the generation of a subword vocabulary from raw data. By breaking text down into subword units (like wordpieces or byte-pair-encodings), SentencePiece handles languages without a clear word boundary and can improve the performance of text processing in neural network models.
In `SentencePieceTokenizer`, the tokenization process is language-agnostic and encompasses a range of tokenization strategies, such as byte pair encoding (BPE), unigram, or a combination of both. The class is designed with ease of use in mind, allowing seamless integration with other components of the NLP pipeline.
## Class Definition
```python
class SentencePieceTokenizer:
"""
Tokenizer of sentencepiece.
Args:
model_file (str): the path of the tokenizer model
"""
```
## Initialization Parameters
Property/Method | Type | Description
----------------|------|-------------
`model_file` | `str` | The path to the pretrained sentencepiece model file.
## Methods and Usage
Below, we detail the methods available in `SentencePieceTokenizer`, including their parameters, their functionality, and usage examples.
### Method: `__init__`
Instantiates an instance of the `SentencePieceTokenizer` with the specified sentencepiece model.
#### Parameters
Parameter | Type | Description
----------|------|-------------
`model_file` | `str` | The path to the pretrained sentencepiece model file.
#### Example
```python
from swarms.tokenizers import SentencePieceTokenizer
tokenizer = SentencePieceTokenizer(model_file='your_model.model')
```
### Properties: Vocabulary Information
These properties provide access to various vocabulary-specific information.
#### `vocab_size`
#### `bos_token_id`
#### `eos_token_id`
##### Example
```python
vocab_size = tokenizer.vocab_size
print(f"Vocabulary size: {vocab_size}")
bos_id = tokenizer.bos_token_id
eos_id = tokenizer.eos_token_id
print(f"BOS token ID: {bos_id}, EOS token ID: {eos_id}")
```
### Method: `indexes_containing_token`
Finds possible tokenizer indexes that, when decoded, may contain the input token.
#### Parameters
Parameter | Type | Description
----------|------|-------------
`token` | `str` | The token for which possible indexes are to be found.
#### Returns
- `List[int]`: List of tokenizer indexes that might contain the token.
#### Example
```python
indexes = tokenizer.indexes_containing_token("▁the")
print(f"Indexes containing '▁the': {indexes}")
```
### Method: `encode`
Tokenizes a text prompt into a list of token IDs.
#### Parameters
Parameter | Type | Description
----------|------|-------------
`s` | `str` | The text prompt to tokenize.
`add_bos` | `bool` | If `True`, it adds the beginning-of-sentence token. (default: `True`)
#### Returns
- `List[int]`: List of token IDs representing the text prompt.
#### Example
```python
encoded_ids = tokenizer.encode("Hello, world!", add_bos=True)
print(f"Encoded token IDs: {encoded_ids}")
```
### Method: `decode`
Detokenizes a list of token IDs into text.
#### Parameters
Parameter | Type | Description
----------|------|-------------
`t` | `List[int]` | A list of token IDs to detokenize.
`offset` | `Optional[int]` | For incremental decoding. Defaults to `None`, which means it is not applied.
#### Returns
- `str`: Text representation of the decoded token IDs.
#### Example
```python
decoded_text = tokenizer.decode([bos_id] + encoded_ids)
print(f"Decoded text: {decoded_text}")
```
### Method: `__call__`
Tokenizes prompts when the class instance is used as a callable.
#### Parameters
Parameter | Type | Description
----------|------|-------------
`s` | `Union[str, Sequence[str]]` | Text prompts to tokenize.
`add_bos` | `bool` | If `True`, it adds the beginning-of-sentence token. (default: `False`)
`add_eos` | `bool` | If `True`, it adds the end-of-sentence token. (default: `False`)
#### Returns
- `addict.Addict`: Object with `input_ids` containing the list of token IDs.
#### Example
```python
input_data = tokenizer("Let's tokenize this sentence.")
print(f"Tokenized input IDs: {input_data.input_ids}")
```
## Additional Information and Tips
The library has efficient internals that cache information for performance benefits. For example, `indexes_containing_token` uses a deque to store the most recent lookups, which saves computation time by avoiding re-traversing the vocabulary.
## Conclusion
This documentation provides an in-depth explanation of `swarms.tokenizers` with a focus on the `SentencePieceTokenizer` class. By following the examples and guidance detailed above, users should be able to effectively use the tokenizers for their NLP tasks. Users are also encouraged to refer to further resources and the official SentencePiece documentation for more advanced use cases and configurations.

@ -0,0 +1,78 @@
# `Tokenizer` Class Documentation
The `Tokenizer` class is a flexible and robust tokenization tool designed to efficiently tokenize prompts into a sequence of token IDs or convert token IDs back into readable text. The class works by initializing with a path to a pretrained tokenization model and supports different tokenization backends based on the availability of configs and pretrained models.
## Initialization & Configuration
### Parameters:
| Parameter | Type | Description | Required |
|------------|------|------------------------------------------|----------|
| model_file | str | Path to the tokenizer model or directory | Yes |
### Attributes:
| Attribute | Type | Description |
|------------------|------|------------------------------------|
| vocab_size | int | Size of the tokenizer's vocabulary |
| bos_token_id | int | ID of the beginning-of-sequence token |
| eos_token_id | int | ID of the end-of-sequence token |
### Methods:
| Method | Returns | Description |
|--------------------------------|---------|--------------------------------------------------------------|
| encode(s, add_bos=True, **kwargs) | list[int] | Tokenizes a prompt and returns token IDs. |
| decode(t, offset=None) | str | Decodes a list of token IDs to a string. |
| __call__(s) | list[int] | Tokenize prompts when the instance is called directly. |
| indexes_containing_token(token) | list[int] | Returns indexes in the vocabulary that may contain the token. |
## Usage Examples
### Tokenizing a Prompt
```python
from swarms.tokenizers import Tokenizer
tokenizer = Tokenizer("/path/to/tokenizer.model")
# Tokenize a single prompt string
prompt = "Hello, world!"
token_ids = tokenizer.encode(prompt)
print(token_ids)
```
### Decoding Token IDs
```python
# Decode token IDs back into text
decoded_text = tokenizer.decode(token_ids)
print(decoded_text)
```
### Incremental Decoding
```python
# Incremental decoding with offset (useful for streaming applications)
partial_tokens = [token_ids[0]] # simulate partially received tokens
decoded_partial = tokenizer.decode(partial_tokens, offset=0)
print(decoded_partial)
```
### Properties Access
```python
# Access vocabulary size and special token IDs
print("Vocabulary Size:", tokenizer.vocab_size)
print("BOS Token ID:", tokenizer.bos_token_id)
print("EOS Token ID:", tokenizer.eos_token_id)
```
### Indexes Containing Token
```python
# Find indexes that may output a specific token during decoding
token = "world"
indexes = tokenizer.indexes_containing_token(token)
print("Token Indexes:", indexes)
```

@ -3,8 +3,7 @@ import os
from dotenv import load_dotenv
# Import the OpenAIChat model and the Agent struct
from swarms.models import OpenAIChat
from swarms.structs import Agent
from swarms import OpenAIChat, Agent
# Load the environment variables
load_dotenv()

@ -116,6 +116,15 @@ nav:
- groupchat: "swarms/structs/groupchat.md"
- swarmnetwork: "swarms/structs/swarmnetwork.md"
- groupchatmanager: "swarms/structs/groupchatmanager.md"
- swarms.tokenizers:
- Language:
- Tokenizer: "swarms/tokenizers/tokenizer.md"
- CohereTokenizer: "swarms/tokenizers/coheretokenizer.md"
- BaseTokenizer: "swarms/tokenizers/basetokenizer.md"
- HuggingfaceTokenizer: "swarms/tokenizers/huggingfacetokenizer.md"
- SentencepieceTokenizer: "swarms/tokenizers/sentencepiecetokenizer.md"
- AnthropicTokenizer: "swarms/tokenizers/anthropictokenizer.md"
- OpenaiTokenizer: "swarms/tokenizers/openaitokenizer.md"
- swarms.memory:
- Vector Databases:
- Weaviate: "swarms/memory/weaviate.md"

@ -4,7 +4,7 @@ build-backend = "poetry.core.masonry.api"
[tool.poetry]
name = "swarms"
version = "3.8.8"
version = "3.9.1"
description = "Swarms - Pytorch"
license = "MIT"
authors = ["Kye Gomez <kye@apac.ai>"]

@ -9,11 +9,18 @@ from scripts.auto_tests_docs.docs import DOCUMENTATION_WRITER_SOP
from swarms import OpenAIChat
##########
from swarms.agents.base import AbstractAgent
from swarms.structs.message import Message
from swarms.agents.omni_modal_agent import OmniModalAgent
from swarms.agents.tool_agent import ToolAgent
from swarms.agents.worker_agent import WorkerAgent
from swarms.tokenizers.r_tokenizers import (
SentencePieceTokenizer,
HuggingFaceTokenizer,
Tokenizer,
)
from swarms.tokenizers.base_tokenizer import BaseTokenizer
from swarms.tokenizers.openai_tokenizers import OpenAITokenizer
from swarms.tokenizers.anthropic_tokenizer import (
AnthropicTokenizer,
)
from swarms.tokenizers.cohere_tokenizer import CohereTokenizer
####################
load_dotenv()
@ -40,14 +47,14 @@ def process_documentation(cls):
# Process with OpenAI model (assuming the model's __call__ method takes this input and returns processed content)
processed_content = model(
DOCUMENTATION_WRITER_SOP(input_content, "swarms.agents")
DOCUMENTATION_WRITER_SOP(input_content, "swarms.tokenizers")
)
# doc_content = f"# {cls.__name__}\n\n{processed_content}\n"
doc_content = f"{processed_content}\n"
# Create the directory if it doesn't exist
dir_path = "docs/swarms/agents"
dir_path = "docs/swarms/tokenizers"
os.makedirs(dir_path, exist_ok=True)
# Write the processed documentation to a Markdown file
@ -60,11 +67,13 @@ def process_documentation(cls):
def main():
classes = [
AbstractAgent,
Message,
OmniModalAgent,
ToolAgent,
WorkerAgent,
SentencePieceTokenizer,
HuggingFaceTokenizer,
Tokenizer,
BaseTokenizer,
OpenAITokenizer,
AnthropicTokenizer,
CohereTokenizer,
]
threads = []
for cls in classes:
@ -78,7 +87,7 @@ def main():
for thread in threads:
thread.join()
print("Documentation generated in 'swarms.agents' directory.")
print("Documentation generated in 'swarms.tokenizers' directory.")
if __name__ == "__main__":

@ -4,25 +4,21 @@ import re
import threading
from swarms import OpenAIChat
from scripts.auto_tests_docs.docs import TEST_WRITER_SOP_PROMPT
from zeta.nn.modules._activations import (
AccurateGELUActivation,
ClippedGELUActivation,
FastGELUActivation,
GELUActivation,
LaplaceActivation,
LinearActivation,
MishActivation,
NewGELUActivation,
PytorchGELUTanh,
QuickGELUActivation,
ReLUSquaredActivation,
#########
from swarms.tokenizers.r_tokenizers import (
SentencePieceTokenizer,
HuggingFaceTokenizer,
Tokenizer,
)
from swarms.tokenizers.base_tokenizer import BaseTokenizer
from swarms.tokenizers.openai_tokenizers import OpenAITokenizer
from swarms.tokenizers.anthropic_tokenizer import (
AnthropicTokenizer,
)
from zeta.nn.modules.dense_connect import DenseBlock
from zeta.nn.modules.dual_path_block import DualPathBlock
from zeta.nn.modules.feedback_block import FeedbackBlock
from zeta.nn.modules.highway_layer import HighwayLayer
from zeta.nn.modules.multi_scale_block import MultiScaleBlock
from zeta.nn.modules.recursive_block import RecursiveBlock
from swarms.tokenizers.cohere_tokenizer import CohereTokenizer
########
from dotenv import load_dotenv
load_dotenv()
@ -69,14 +65,16 @@ def create_test(cls):
# Process with OpenAI model (assuming the model's __call__ method takes this input and returns processed content)
processed_content = model(
TEST_WRITER_SOP_PROMPT(input_content, "zeta", "zeta.nn")
TEST_WRITER_SOP_PROMPT(
input_content, "swarms", "swarms.tokenizers"
)
)
processed_content = extract_code_from_markdown(processed_content)
doc_content = f"# {cls.__name__}\n\n{processed_content}\n"
# Create the directory if it doesn't exist
dir_path = "tests/nn/modules"
dir_path = "tests/tokenizers"
os.makedirs(dir_path, exist_ok=True)
# Write the processed documentation to a Python file
@ -87,25 +85,14 @@ def create_test(cls):
def main():
classes = [
DenseBlock,
HighwayLayer,
MultiScaleBlock,
FeedbackBlock,
DualPathBlock,
RecursiveBlock,
PytorchGELUTanh,
NewGELUActivation,
GELUActivation,
FastGELUActivation,
QuickGELUActivation,
ClippedGELUActivation,
AccurateGELUActivation,
MishActivation,
LinearActivation,
LaplaceActivation,
ReLUSquaredActivation,
SentencePieceTokenizer,
HuggingFaceTokenizer,
Tokenizer,
BaseTokenizer,
OpenAITokenizer,
AnthropicTokenizer,
CohereTokenizer,
]
threads = []
for cls in classes:
thread = threading.Thread(target=create_test, args=(cls,))
@ -116,7 +103,7 @@ def main():
for thread in threads:
thread.join()
print("Tests generated in 'docs/zeta/nn/modules' directory.")
print("Tests generated in 'tests/tokenizers' directory.")
if __name__ == "__main__":

@ -28,4 +28,4 @@ def generate_file_list(directory, output_file):
# Use the function to generate the file list
generate_file_list("docs/swarms/agents", "file_list.txt")
generate_file_list("docs/swarms/tokenizers", "file_list.txt")

@ -9,14 +9,7 @@ from swarms.models import * # noqa: E402, F403
from swarms.telemetry import * # noqa: E402, F403
from swarms.utils import * # noqa: E402, F403
from swarms.prompts import * # noqa: E402, F403
# telemetry = Telemetry('mongodb://localhost:27017/', 'mydatabase')
# telemetry.log_import('swarms.telemetry.bootup')
# telemetry.log_import('swarms.agents')
# telemetry.log_import('swarms.structs')
# telemetry.log_import('swarms.models')
# telemetry.log_import('swarms.telemetry')
# telemetry.log_import('swarms.utils')
# telemetry.log_import('swarms.prompts')
from swarms.tokenizers import * # noqa: E402, F403
from swarms.loaders import * # noqa: E402, F403
from swarms.artifacts import * # noqa: E402, F403
from swarms.chunkers import * # noqa: E402, F403

@ -0,0 +1,7 @@
from swarms.artifacts.base_artifact import BaseArtifact
from swarms.artifacts.text_artifact import TextArtifact
__all__ = [
"BaseArtifact",
"TextArtifact",
]

@ -0,0 +1,76 @@
from __future__ import annotations
from typing import Any
import json
import uuid
from abc import ABC, abstractmethod
from dataclasses import dataclass
@dataclass
class BaseArtifact(ABC):
"""
Base class for artifacts.
"""
id: str
name: str
value: Any
def __post_init__(self):
if self.id is None:
self.id = uuid.uuid4().hex
if self.name is None:
self.name = self.id
@classmethod
def value_to_bytes(cls, value: Any) -> bytes:
"""
Convert the value to bytes.
"""
if isinstance(value, bytes):
return value
else:
return str(value).encode()
@classmethod
def value_to_dict(cls, value: Any) -> dict:
"""
Convert the value to a dictionary.
"""
if isinstance(value, dict):
dict_value = value
else:
dict_value = json.loads(value)
return {k: v for k, v in dict_value.items()}
def to_text(self) -> str:
"""
Convert the value to text.
"""
return str(self.value)
def __str__(self) -> str:
"""
Return a string representation of the artifact.
"""
return self.to_text()
def __bool__(self) -> bool:
"""
Return the boolean value of the artifact.
"""
return bool(self.value)
def __len__(self) -> int:
"""
Return the length of the artifact.
"""
return len(self.value)
@abstractmethod
def __add__(self, other: BaseArtifact) -> BaseArtifact:
"""
Add two artifacts together.
"""
...

@ -0,0 +1,57 @@
from __future__ import annotations
from typing import Optional
from dataclasses import dataclass, field
from swarms.artifacts.base_artifact import BaseArtifact
from swarms.tokenizers.base_tokenizer import BaseTokenizer
@dataclass
class TextArtifact(BaseArtifact):
"""
Represents a text artifact.
Attributes:
value (str): The text value of the artifact.
encoding (str, optional): The encoding of the text (default is "utf-8").
encoding_error_handler (str, optional): The error handler for encoding errors (default is "strict").
_embedding (list[float]): The embedding of the text artifact (default is an empty list).
Properties:
embedding (Optional[list[float]]): The embedding of the text artifact.
Methods:
__add__(self, other: BaseArtifact) -> TextArtifact: Concatenates the text value of the artifact with another artifact.
__bool__(self) -> bool: Checks if the text value of the artifact is non-empty.
generate_embedding(self, driver: BaseEmbeddingModel) -> Optional[list[float]]: Generates the embedding of the text artifact using a given embedding model.
token_count(self, tokenizer: BaseTokenizer) -> int: Counts the number of tokens in the text artifact using a given tokenizer.
to_bytes(self) -> bytes: Converts the text value of the artifact to bytes using the specified encoding and error handler.
"""
value: str
encoding: str = "utf-8"
encoding_error_handler: str = "strict"
_embedding: list[float] = field(default_factory=list)
@property
def embedding(self) -> Optional[list[float]]:
return None if len(self._embedding) == 0 else self._embedding
def __add__(self, other: BaseArtifact) -> TextArtifact:
return TextArtifact(self.value + other.value)
def __bool__(self) -> bool:
return bool(self.value.strip())
def generate_embedding(self, model) -> Optional[list[float]]:
self._embedding.clear()
self._embedding.extend(model.embed_string(str(self.value)))
return self.embedding
def token_count(self, tokenizer: BaseTokenizer) -> int:
return tokenizer.count_tokens(str(self.value))
def to_bytes(self) -> bytes:
return self.value.encode(
encoding=self.encoding, errors=self.encoding_error_handler
)

@ -0,0 +1,5 @@
from swarms.chunkers.chunk_seperator import ChunkSeparator
from swarms.chunkers.base_chunker import BaseChunker
from swarms.chunkers.text_chunker import TextChunker
__all__ = ["ChunkSeparator", "BaseChunker", "TextChunker"]

@ -0,0 +1,162 @@
from __future__ import annotations
from abc import ABC
from typing import Optional
from dataclasses import dataclass, field
from swarms.artifacts.text_artifact import TextArtifact
from swarms.chunkers.chunk_seperator import ChunkSeparator
from swarms.tokenizers.base_tokenizer import BaseTokenizer
from swarms.tokenizers.openai_tokenizers import OpenAITokenizer
@dataclass
class BaseChunker(ABC):
"""
Base class for chunking text into smaller chunks.
"""
DEFAULT_SEPARATORS = [ChunkSeparator(" ")]
separators: list[ChunkSeparator] = field(
default_factory=lambda: BaseChunker.DEFAULT_SEPARATORS
)
tokenizer: BaseTokenizer = field(
default_factory=lambda: OpenAITokenizer(
model=OpenAITokenizer.DEFAULT_OPENAI_GPT_3_CHAT_MODEL
)
)
max_tokens: int = field(
default_factory=lambda: BaseChunker.tokenizer.max_tokens
)
def chunk(self, text: str | str) -> list[str]:
"""
Chunk the given text into smaller chunks.
Args:
text (TextArtifact | str): The text to be chunked.
Returns:
list[TextArtifact]: The list of chunked text artifacts.
"""
text = text.value if isinstance(text, str) else text
return [
TextArtifact(c) for c in self._chunk_recursively(text)
]
def _chunk_recursively(
self,
chunk: str,
current_separator: Optional[ChunkSeparator] = None,
) -> list[str]:
"""
Recursively chunk the given chunk into smaller subchunks.
Args:
chunk (str): The chunk to be recursively chunked.
current_separator (Optional[ChunkSeparator], optional): The current separator to be used. Defaults to None.
Returns:
list[str]: The list of recursively chunked subchunks.
"""
token_count = self.tokenizer.count_tokens(chunk)
if token_count <= self.max_tokens:
return [chunk]
else:
balance_index = -1
balance_diff = float("inf")
tokens_count = 0
half_token_count = token_count // 2
# If a separator is provided, only use separators after it.
if current_separator:
separators = self.separators[
self.separators.index(current_separator) :
]
else:
separators = self.separators
# Loop through available separators to find the best split.
for separator in separators:
# Split the chunk into subchunks using the current separator.
subchunks = list(
filter(None, chunk.split(separator.value))
)
# Check if the split resulted in more than one subchunk.
if len(subchunks) > 1:
# Iterate through the subchunks and calculate token counts.
for index, subchunk in enumerate(subchunks):
if index < len(subchunks):
if separator.is_prefix:
subchunk = separator.value + subchunk
else:
subchunk = subchunk + separator.value
tokens_count += self.tokenizer.count_tokens(
subchunk
)
# Update the best split if the current one is more balanced.
if (
abs(tokens_count - half_token_count)
< balance_diff
):
balance_index = index
balance_diff = abs(
tokens_count - half_token_count
)
# Create the two subchunks based on the best separator.
if separator.is_prefix:
# If the separator is a prefix, append it before this subchunk.
first_subchunk = (
separator.value
+ separator.value.join(
subchunks[: balance_index + 1]
)
)
second_subchunk = (
separator.value
+ separator.value.join(
subchunks[balance_index + 1 :]
)
)
else:
# If the separator is not a prefix, append it after this subchunk.
first_subchunk = (
separator.value.join(
subchunks[: balance_index + 1]
)
+ separator.value
)
second_subchunk = separator.value.join(
subchunks[balance_index + 1 :]
)
# Continue recursively chunking the subchunks.
first_subchunk_rec = self._chunk_recursively(
first_subchunk.strip(), separator
)
second_subchunk_rec = self._chunk_recursively(
second_subchunk.strip(), separator
)
# Return the concatenated results of the subchunks if both are non-empty.
if first_subchunk_rec and second_subchunk_rec:
return (
first_subchunk_rec + second_subchunk_rec
)
# If only one subchunk is non-empty, return it.
elif first_subchunk_rec:
return first_subchunk_rec
elif second_subchunk_rec:
return second_subchunk_rec
else:
return []
# If none of the separators result in a balanced split, split the chunk in half.
midpoint = len(chunk) // 2
return self._chunk_recursively(
chunk[:midpoint]
) + self._chunk_recursively(chunk[midpoint:])

@ -0,0 +1,7 @@
from dataclasses import dataclass
@dataclass
class ChunkSeparator:
value: str
is_prefix: bool = False

@ -0,0 +1,13 @@
from swarms.chunkers.base_chunker import BaseChunker
from swarms.chunkers.chunk_seperator import ChunkSeparator
class TextChunker(BaseChunker):
DEFAULT_SEPARATORS = [
ChunkSeparator("\n\n"),
ChunkSeparator("\n"),
ChunkSeparator(". "),
ChunkSeparator("! "),
ChunkSeparator("? "),
ChunkSeparator(" "),
]

@ -0,0 +1,77 @@
from __future__ import annotations
from dataclasses import dataclass
from pathlib import Path
from typing import IO, Dict, List, Optional
from PyPDF2 import PdfReader
from swarms.utils.hash import str_to_hash
@dataclass
class TextArtifact:
text: str
@dataclass
class PDFLoader:
"""
A class for loading PDF files and extracting text artifacts.
Args:
tokenizer (str): The tokenizer to use for chunking the text.
max_tokens (int): The maximum number of tokens per chunk.
Methods:
load(source, password=None, *args, **kwargs):
Load a single PDF file and extract text artifacts.
load_collection(sources, password=None, *args, **kwargs):
Load a collection of PDF files and extract text artifacts.
Private Methods:
_load_pdf(stream, password=None):
Load a PDF file and extract text artifacts.
Attributes:
tokenizer (str): The tokenizer used for chunking the text.
max_tokens (int): The maximum number of tokens per chunk.
"""
tokenizer: str
max_tokens: int
def __post_init__(self):
self.chunker = PdfChunker(
tokenizer=self.tokenizer, max_tokens=self.max_tokens
)
def load(
self,
source: str | IO | Path,
password: Optional[str] = None,
*args,
**kwargs,
) -> List[TextArtifact]:
return self._load_pdf(source, password)
def load_collection(
self,
sources: List[str | IO | Path],
password: Optional[str] = None,
*args,
**kwargs,
) -> Dict[str, List[TextArtifact]]:
return {
str_to_hash(str(s)): self._load_pdf(s, password)
for s in sources
}
def _load_pdf(
self, stream: str | IO | Path, password: Optional[str]
) -> List[TextArtifact]:
reader = PdfReader(stream, strict=True, password=password)
return [
TextArtifact(text=p.extract_text()) for p in reader.pages
]

@ -4,6 +4,7 @@ from swarms.memory.short_term_memory import ShortTermMemory
from swarms.memory.sqlite import SQLiteDB
from swarms.memory.weaviate_db import WeaviateDB
from swarms.memory.visual_memory import VisualShortTermMemory
from swarms.memory.action_subtask import ActionSubtaskEntry
__all__ = [
"AbstractVectorDatabase",
@ -12,4 +13,5 @@ __all__ = [
"SQLiteDB",
"WeaviateDB",
"VisualShortTermMemory",
"ActionSubtaskEntry",
]

@ -0,0 +1,16 @@
from dataclasses import dataclass
@dataclass
class ActionSubtaskEntry:
"""Used to store ActionSubtask data to preserve TaskMemory pointers and context in the form of thought and action.
Attributes:
thought: CoT thought string from the LLM.
action: ReAct action JSON string from the LLM.
answer: tool-generated and memory-processed response from Griptape.
"""
thought: str
action: str
answer: str

@ -70,6 +70,11 @@ from swarms.models.types import (
MultimodalData,
) # noqa: E402
# 3############ Embedding models
from swarms.models.base_embedding_model import BaseEmbeddingModel
__all__ = [
"AbstractLLM",
"Anthropic",
@ -112,4 +117,5 @@ __all__ = [
"CLIPQ",
"Kosmos",
"Fuyu",
"BaseEmbeddingModel",
]

@ -0,0 +1,80 @@
from __future__ import annotations
from abc import ABC, abstractmethod
from dataclasses import dataclass, field
from typing import Optional
import numpy as np
from swarms.chunkers.base_chunker import BaseChunker
from swarms.chunkers.text_chunker import TextChunker
from swarms.utils.exponential_backoff import ExponentialBackoffMixin
from swarms.artifacts.text_artifact import TextArtifact
from swarms.tokenizers.base_tokenizer import BaseTokenizer
@dataclass
class BaseEmbeddingModel(
ExponentialBackoffMixin,
ABC,
# SerializableMixin
):
"""
Attributes:
model: The name of the model to use.
tokenizer: An instance of `BaseTokenizer` to use when calculating tokens.
"""
model: str = None
tokenizer: Optional[BaseTokenizer] = None
chunker: BaseChunker = field(init=False)
def __post_init__(self) -> None:
if self.tokenizer:
self.chunker = TextChunker(tokenizer=self.tokenizer)
def embed_text_artifact(
self, artifact: TextArtifact
) -> list[float]:
return self.embed_string(artifact.to_text())
def embed_string(self, string: str) -> list[float]:
for attempt in self.retrying():
with attempt:
if (
self.tokenizer
and self.tokenizer.count_tokens(string)
> self.tokenizer.max_tokens
):
return self._embed_long_string(string)
else:
return self.try_embed_chunk(string)
else:
raise RuntimeError("Failed to embed string.")
@abstractmethod
def try_embed_chunk(self, chunk: str) -> list[float]:
...
def _embed_long_string(self, string: str) -> list[float]:
"""Embeds a string that is too long to embed in one go."""
chunks = self.chunker.chunk(string)
embedding_chunks = []
length_chunks = []
for chunk in chunks:
embedding_chunks.append(self.try_embed_chunk(chunk.value))
length_chunks.append(len(chunk))
# generate weighted averages
embedding_chunks = np.average(
embedding_chunks, axis=0, weights=length_chunks
)
# normalize length to 1
embedding_chunks = embedding_chunks / np.linalg.norm(
embedding_chunks
)
return embedding_chunks.tolist()

@ -3,10 +3,21 @@ from swarms.tokenizers.r_tokenizers import (
HuggingFaceTokenizer,
Tokenizer,
)
from swarms.tokenizers.base_tokenizer import BaseTokenizer
from swarms.tokenizers.openai_tokenizers import OpenAITokenizer
from swarms.tokenizers.anthropic_tokenizer import (
import_optional_dependency,
AnthropicTokenizer,
)
from swarms.tokenizers.cohere_tokenizer import CohereTokenizer
__all__ = [
"SentencePieceTokenizer",
"HuggingFaceTokenizer",
"Tokenizer",
"BaseTokenizer",
"OpenAITokenizer",
"import_optional_dependency",
"AnthropicTokenizer",
"CohereTokenizer",
]

@ -0,0 +1,96 @@
from __future__ import annotations
from dataclasses import dataclass
from importlib import import_module
from types import ModuleType
from typing import Optional
from anthropic import Anthropic
from swarms.tokenizers.base_tokenizer import BaseTokenizer
INSTALL_MAPPING = {
"huggingface_hub": "huggingface-hub",
"pinecone": "pinecone-client",
"opensearchpy": "opensearch-py",
}
def import_optional_dependency(name: str) -> Optional[ModuleType]:
"""Import an optional dependency.
If a dependency is missing, an ImportError with a nice message will be raised.
Args:
name: The module name.
Returns:
The imported module, when found.
None is returned when the package is not found and `errors` is False.
"""
package_name = INSTALL_MAPPING.get(name)
install_name = package_name if package_name is not None else name
msg = (
f"Missing optional dependency: '{install_name}'. "
f"Use poetry or pip to install '{install_name}'."
)
try:
module = import_module(name)
except ImportError:
raise ImportError(msg)
return module
@dataclass
class AnthropicTokenizer(BaseTokenizer):
"""
Tokenizer class for Anthropic models.]
"""
max_tokens: int = 500
client: Anthropic = None
model: str = "claude-2.1"
def __post_init__(self):
self.DEFAULT_MODEL: str = "claude-2.1"
self.MODEL_PREFIXES_TO_MAX_TOKENS: dict[str, int] = {
"claude-2.1": 200000,
"claude": 100000,
}
self.model = self.model # or self.DEFAULT_MODEL
self.max_tokens = self.max_tokens or self.default_max_tokens()
self.client = (
self.client
or import_optional_dependency("anthropic").Anthropic()
)
def default_max_tokens(self) -> int:
"""
Returns the default maximum number of tokens based on the model prefix.
"""
tokens = next(
v
for k, v in self.MODEL_PREFIXES_TO_MAX_TOKENS.items()
if self.model.startswith(k)
)
return tokens
def count_tokens(self, text: str | list) -> int:
"""
Counts the number of tokens in the given text.
Args:
text: The input text.
Returns:
The number of tokens in the text.
Raises:
ValueError: If the input text is not a string.
"""
if isinstance(text, str):
return self.client.count_tokens(text)
else:
raise ValueError("Text must be a string.")

@ -0,0 +1,56 @@
from __future__ import annotations
from abc import ABC, abstractmethod
from dataclasses import dataclass, field
from typing import List, Union
@dataclass
class BaseTokenizer(ABC):
"""
Base class for tokenizers.
Attributes:
stop_sequences (List[str]): List of stop sequences.
max_tokens (int): Maximum number of tokens.
stop_token (str): Stop token.
"""
max_tokens: int
stop_token: str = "<|Response|>"
def __post_init__(self):
self.stop_sequences: List[str] = field(
default_factory=lambda: ["<|Response|>"],
init=False,
)
def count_tokens_left(self, text: Union[str, List[dict]]) -> int:
"""
Counts the number of tokens left based on the given text.
Args:
text (Union[str, List[dict]]): The text to count tokens from.
Returns:
int: The number of tokens left.
"""
diff = self.max_tokens - self.count_tokens(text)
if diff > 0:
return diff
else:
return 0
@abstractmethod
def count_tokens(self, text: Union[str, List[dict]]) -> int:
"""
Counts the number of tokens in the given text.
Args:
text (Union[str, List[dict]]): The text to count tokens from.
Returns:
int: The number of tokens.
"""
...

@ -0,0 +1,35 @@
from __future__ import annotations
from dataclasses import dataclass
from cohere import Client
@dataclass
class CohereTokenizer:
"""
A tokenizer class for Cohere models.
"""
model: str
client: Client
DEFAULT_MODEL: str = "command"
DEFAULT_MAX_TOKENS: int = 2048
max_tokens: int = DEFAULT_MAX_TOKENS
def count_tokens(self, text: str | list) -> int:
"""
Count the number of tokens in the given text.
Args:
text (str | list): The input text to tokenize.
Returns:
int: The number of tokens in the text.
Raises:
ValueError: If the input text is not a string.
"""
if isinstance(text, str):
return len(self.client.tokenize(text=text).tokens)
else:
raise ValueError("Text must be a string.")

@ -0,0 +1,179 @@
from __future__ import annotations
import logging
from dataclasses import dataclass, field
import tiktoken
from tiktoken import Encoding
from typing import Optional
from swarms.tokenizers.base_tokenizer import BaseTokenizer
@dataclass
class OpenAITokenizer(BaseTokenizer):
"""
A class representing an OpenAI tokenizer.
Attributes:
- DEFAULT_OPENAI_GPT_3_COMPLETION_MODEL (str): The default OpenAI GPT-3 completion model.
- DEFAULT_OPENAI_GPT_3_CHAT_MODEL (str): The default OpenAI GPT-3 chat model.
- DEFAULT_OPENAI_GPT_4_MODEL (str): The default OpenAI GPT-4 model.
- DEFAULT_ENCODING (str): The default encoding.
- DEFAULT_MAX_TOKENS (int): The default maximum number of tokens.
- TOKEN_OFFSET (int): The token offset.
- MODEL_PREFIXES_TO_MAX_TOKENS (dict): A dictionary mapping model prefixes to maximum tokens.
- EMBEDDING_MODELS (list): A list of embedding models.
- model (str): The model name.
Methods:
- __post_init__(): Initializes the OpenAITokenizer object.
- encoding(): Returns the encoding for the model.
- default_max_tokens(): Returns the default maximum number of tokens.
- count_tokens(text, model): Counts the number of tokens in the given text.
- len(text, model): Returns the length of the text in tokens.
"""
model: str = "gpt-2"
def __post_init__(self):
"""
Initializes the OpenAITokenizer object.
Sets the default maximum number of tokens.
"""
self.max_tokens: int = field(
default_factory=lambda: self.default_max_tokens()
)
self.DEFAULT_OPENAI_GPT_3_COMPLETION_MODEL = (
"text-davinci-003"
)
self.DEFAULT_OPENAI_GPT_3_CHAT_MODEL = "gpt-3.5-turbo"
self.DEFAULT_OPENAI_GPT_4_MODEL = "gpt-4"
self.DEFAULT_ENCODING = "cl100k_base"
self.EFAULT_MAX_TOKENS = 2049
self.TOKEN_OFFSET = 8
self.MODEL_PREFIXES_TO_MAX_TOKENS = {
"gpt-4-1106": 128000,
"gpt-4-32k": 32768,
"gpt-4": 8192,
"gpt-3.5-turbo-16k": 16384,
"gpt-3.5-turbo": 4096,
"gpt-35-turbo-16k": 16384,
"gpt-35-turbo": 4096,
"text-davinci-003": 4097,
"text-davinci-002": 4097,
"code-davinci-002": 8001,
"text-embedding-ada-002": 8191,
"text-embedding-ada-001": 2046,
}
self.EMBEDDING_MODELS = [
"text-embedding-ada-002",
"text-embedding-ada-001",
]
@property
def encoding(self) -> Encoding:
"""
Returns the encoding for the model.
If the model is not found, returns the default encoding.
"""
try:
return tiktoken.encoding_for_model(self.model)
except KeyError:
return tiktoken.get_encoding(self.DEFAULT_ENCODING)
def default_max_tokens(self) -> int:
"""
Returns the default maximum number of tokens based on the model.
"""
tokens = next(
v
for k, v in self.MODEL_PREFIXES_TO_MAX_TOKENS.items()
if self.model.startswith(k)
)
offset = (
0
if self.model in self.EMBEDDING_MODELS
else self.TOKEN_OFFSET
)
return (
tokens if tokens else self.DEFAULT_MAX_TOKENS
) - offset
def count_tokens(
self, text: str | list[dict], model: Optional[str] = None
) -> int:
"""
Counts the number of tokens in the given text.
If the text is a list of messages, counts the tokens for each message.
If a model is provided, uses that model for encoding.
"""
if isinstance(text, list):
model = model if model else self.model
try:
encoding = tiktoken.encoding_for_model(model)
except KeyError:
logging.warning(
"model not found. Using cl100k_base encoding."
)
encoding = tiktoken.get_encoding("cl100k_base")
if model in {
"gpt-3.5-turbo-0613",
"gpt-3.5-turbo-16k-0613",
"gpt-4-0314",
"gpt-4-32k-0314",
"gpt-4-0613",
"gpt-4-32k-0613",
}:
tokens_per_message = 3
tokens_per_name = 1
elif model == "gpt-3.5-turbo-0301":
tokens_per_message = 4
tokens_per_name = -1
elif "gpt-3.5-turbo" in model or "gpt-35-turbo" in model:
logging.info(
"gpt-3.5-turbo may update over time. Returning"
" num tokens assuming gpt-3.5-turbo-0613."
)
return self.count_tokens(
text, model="gpt-3.5-turbo-0613"
)
elif "gpt-4" in model:
logging.info(
"gpt-4 may update over time. Returning num tokens"
" assuming gpt-4-0613."
)
return self.count_tokens(text, model="gpt-4-0613")
else:
raise NotImplementedError(
"token_count() is not implemented for model"
f" {model}. See"
" https://github.com/openai/openai-python/blob/main/chatml.md"
" for information on how messages are converted"
" to tokens."
)
num_tokens = 0
for message in text:
num_tokens += tokens_per_message
for key, value in message.items():
num_tokens += len(encoding.encode(value))
if key == "name":
num_tokens += tokens_per_name
num_tokens += 3
return num_tokens
else:
return len(self.encoding.encode(text))
def len(self, text: str | list[dict], model: Optional[str]):
"""
Returns the length of the text in tokens.
If a model is provided, uses that model for encoding.
"""
return self.count_tokens(text, model)

@ -32,6 +32,7 @@ from swarms.utils.remove_json_whitespace import (
remove_whitespace_from_json,
remove_whitespace_from_yaml,
)
from swarms.utils.exponential_backoff import ExponentialBackoffMixin
__all__ = [
"SubprocessCodeInterpreter",
@ -57,4 +58,5 @@ __all__ = [
"JsonOutputParser",
"remove_whitespace_from_json",
"remove_whitespace_from_yaml",
"ExponentialBackoffMixin",
]

@ -0,0 +1,46 @@
import logging
from abc import ABC
from dataclasses import dataclass
from tenacity import Retrying, wait_exponential, stop_after_attempt
@dataclass
class ExponentialBackoffMixin(ABC):
"""
A mixin class that provides exponential backoff functionality.
"""
min_retry_delay: float = 2
"""
The minimum delay between retries in seconds.
"""
max_retry_delay: float = 10
"""
The maximum delay between retries in seconds.
"""
max_attempts: int = 10
"""
The maximum number of retry attempts.
"""
def after_hook(s: str) -> None:
return logging.warning(s)
"""
A callable that is executed after each retry attempt.
"""
def retrying(self) -> Retrying:
"""
Returns a Retrying object configured with the exponential backoff settings.
"""
return Retrying(
wait=wait_exponential(
min=self.min_retry_delay, max=self.max_retry_delay
),
stop=stop_after_attempt(self.max_attempts),
reraise=True,
after=self.after_hook,
)

@ -0,0 +1,39 @@
# AnthropicTokenizer
import pytest
from swarms.tokenizers.anthropic_tokenizer import AnthropicTokenizer
def test_post_init():
tokenizer = AnthropicTokenizer()
assert tokenizer.model == "claude-2.1"
assert tokenizer.max_tokens == 200000
def test_default_max_tokens():
tokenizer = AnthropicTokenizer(model="claude")
assert tokenizer.default_max_tokens() == 100000
@pytest.mark.parametrize(
"model,tokens", [("claude-2.1", 200000), ("claude", 100000)]
)
def test_default_max_tokens_models(model, tokens):
tokenizer = AnthropicTokenizer(model=model)
assert tokenizer.default_max_tokens() == tokens
def test_count_tokens_string():
# Insert mock instantiation of anthropic client and its count_tokens function
text = "This is a test string."
tokenizer = AnthropicTokenizer()
tokens = tokenizer.count_tokens(text)
assert tokens == 5
def test_count_tokens_list():
# Insert mock instantiation of anthropic client and its count_tokens function
text = ["This", "is", "a", "test", "string."]
tokenizer = AnthropicTokenizer()
with pytest.raises(ValueError):
tokenizer.count_tokens(text)

@ -0,0 +1,45 @@
# BaseTokenizer
import pytest
from swarms.tokenizers.base_tokenizer import BaseTokenizer
# 1. Fixture for BaseTokenizer instance.
@pytest.fixture
def base_tokenizer():
return BaseTokenizer(max_tokens=100)
# 2. Tests for __post_init__.
def test_post_init(base_tokenizer):
assert base_tokenizer.stop_sequences == ["<|Response|>"]
assert base_tokenizer.stop_token == "<|Response|>"
# 3. Tests for count_tokens_left with different inputs.
def test_count_tokens_left_with_positive_diff(
base_tokenizer, monkeypatch
):
# Mocking count_tokens to return a specific value
monkeypatch.setattr(
"swarms.tokenizers.BaseTokenizer.count_tokens",
lambda x, y: 50,
)
assert base_tokenizer.count_tokens_left("some text") == 50
def test_count_tokens_left_with_zero_diff(
base_tokenizer, monkeypatch
):
monkeypatch.setattr(
"swarms.tokenizers.BaseTokenizer.count_tokens",
lambda x, y: 100,
)
assert base_tokenizer.count_tokens_left("some text") == 0
# 4. Add tests for count_tokens. This method is an abstract one, so testing it
# will be dependent on the actual implementation in the subclass. Here is just
# a general idea how to test it (we assume that test_count_tokens is implemented in some subclass).
def test_count_tokens(subclass_tokenizer_instance):
assert subclass_tokenizer_instance.count_tokens("some text") == 6

@ -0,0 +1,35 @@
# CohereTokenizer
import pytest
from swarms.tokenizers.cohere_tokenizer import CohereTokenizer
from unittest.mock import MagicMock
@pytest.fixture
def cohere_tokenizer():
mock_client = MagicMock()
mock_client.tokenize.return_value.tokens = [
"token1",
"token2",
"token3",
]
return CohereTokenizer(model="<model-name>", client=mock_client)
def test_count_tokens_with_string(cohere_tokenizer):
tokens_count = cohere_tokenizer.count_tokens("valid string")
assert tokens_count == 3
def test_count_tokens_with_non_string(cohere_tokenizer):
with pytest.raises(ValueError):
cohere_tokenizer.count_tokens(["invalid", "input"])
def test_count_tokens_with_different_length(cohere_tokenizer):
cohere_tokenizer.client.tokenize.return_value.tokens = [
"token1",
"token2",
]
tokens_count = cohere_tokenizer.count_tokens("valid string")
assert tokens_count == 2

@ -0,0 +1,66 @@
# HuggingFaceTokenizer
import pytest
import os
from unittest.mock import patch
from swarms.tokenizers.r_tokenizers import HuggingFaceTokenizer
# Test class setup
@pytest.fixture
def hftokenizer():
dir_path = os.path.join(os.getcwd(), "modeldir")
tokenizer = HuggingFaceTokenizer(dir_path)
return tokenizer
# testing __init__
@patch("os.path")
@patch("swarms.tokenizers.get_logger")
def test___init__(mock_get_logger, mock_path, hftokenizer):
mock_path.exists.return_value = False
mock_path.join.return_value = "dummy_path"
mock_get_logger.return_value = "dummy_logger"
assert hftokenizer.model_dir == "dummy_path"
assert hftokenizer.logger == "dummy_logger"
assert hftokenizer._maybe_decode_bytes is False
assert hftokenizer._prefix_space_tokens is None
# testing vocab_size property
def test_vocab_size(hftokenizer):
assert hftokenizer.vocab_size == 30522
# testing bos_token_id property
def test_bos_token_id(hftokenizer):
assert hftokenizer.bos_token_id == 101
# testing eos_token_id property
def test_eos_token_id(hftokenizer):
assert hftokenizer.eos_token_id == 102
# testing prefix_space_tokens property
def test_prefix_space_tokens(hftokenizer):
assert len(hftokenizer.prefix_space_tokens) > 0
# testing _maybe_add_prefix_space method
def test__maybe_add_prefix_space(hftokenizer):
assert (
hftokenizer._maybe_add_prefix_space(
[101, 2003, 2010, 2050, 2001, 2339], " is why"
)
== " is why"
)
assert (
hftokenizer._maybe_add_prefix_space(
[2003, 2010, 2050, 2001, 2339], "is why"
)
== " is why"
)
# continuing tests for other methods...

@ -0,0 +1,47 @@
# OpenAITokenizer
import pytest
import swarms.tokenizers.openai_tokenizers as tokenizers
@pytest.fixture()
def openai_tokenizer():
return tokenizers.OpenAITokenizer("gpt-3")
def test_init(openai_tokenizer):
assert openai_tokenizer.model == "gpt-3"
def test_default_max_tokens(openai_tokenizer):
assert openai_tokenizer.default_max_tokens() == 4096
@pytest.mark.parametrize(
"text, expected_output", [("Hello, world!", 3), (["Hello"], 4)]
)
def test_count_tokens_single(openai_tokenizer, text, expected_output):
assert (
openai_tokenizer.count_tokens(text, "gpt-3")
== expected_output
)
@pytest.mark.parametrize(
"texts, expected_output",
[(["Hello, world!", "This is a test"], 6), (["Hello"], 4)],
)
def test_count_tokens_multiple(
openai_tokenizer, texts, expected_output
):
assert (
openai_tokenizer.count_tokens(texts, "gpt-3")
== expected_output
)
@pytest.mark.parametrize(
"text, expected_output", [("Hello, world!", 3), (["Hello"], 4)]
)
def test_len(openai_tokenizer, text, expected_output):
assert openai_tokenizer.len(text, "gpt-3") == expected_output

@ -0,0 +1,81 @@
# Tokenizer
from swarms.tokenizers.r_tokenizers import Tokenizer
from unittest.mock import patch
def test_initializer_existing_model_file():
with patch("os.path.exists", return_value=True):
with patch(
"swarms.tokenizers.SentencePieceTokenizer"
) as mock_model:
tokenizer = Tokenizer("tokenizers/my_model.model")
mock_model.assert_called_with("tokenizers/my_model.model")
assert tokenizer.model == mock_model.return_value
def test_initializer_model_folder():
with patch("os.path.exists", side_effect=[False, True]):
with patch(
"swarms.tokenizers.HuggingFaceTokenizer"
) as mock_model:
tokenizer = Tokenizer("my_model_directory")
mock_model.assert_called_with("my_model_directory")
assert tokenizer.model == mock_model.return_value
def test_vocab_size():
with patch(
"swarms.tokenizers.SentencePieceTokenizer"
) as mock_model:
tokenizer = Tokenizer("tokenizers/my_model.model")
assert (
tokenizer.vocab_size == mock_model.return_value.vocab_size
)
def test_bos_token_id():
with patch(
"swarms.tokenizers.SentencePieceTokenizer"
) as mock_model:
tokenizer = Tokenizer("tokenizers/my_model.model")
assert (
tokenizer.bos_token_id
== mock_model.return_value.bos_token_id
)
def test_encode():
with patch(
"swarms.tokenizers.SentencePieceTokenizer"
) as mock_model:
tokenizer = Tokenizer("tokenizers/my_model.model")
assert (
tokenizer.encode("hello")
== mock_model.return_value.encode.return_value
)
def test_decode():
with patch(
"swarms.tokenizers.SentencePieceTokenizer"
) as mock_model:
tokenizer = Tokenizer("tokenizers/my_model.model")
assert (
tokenizer.decode([1, 2, 3])
== mock_model.return_value.decode.return_value
)
def test_call():
with patch(
"swarms.tokenizers.SentencePieceTokenizer"
) as mock_model:
tokenizer = Tokenizer("tokenizers/my_model.model")
assert (
tokenizer("hello")
== mock_model.return_value.__call__.return_value
)
# More tests can be added here
Loading…
Cancel
Save