parent
7c4d663edd
commit
ff5406dd50
@ -0,0 +1,123 @@
|
||||
# AnthropicTokenizer Documentation
|
||||
|
||||
## Introduction
|
||||
|
||||
This documentation intends to provide a complete and in-depth guide for using the `AnthropicTokenizer` class within the `swarms.tokenizers` library. The `AnthropicTokenizer` is designed specifically to interface with Anthropic's AI models, primarily used for text tokenization and metadata handling.
|
||||
|
||||
Understanding how to use this tokenizer effectively is crucial for developers and researchers working with natural language processing, machine learning, and text analysis using Anthropic AI models.
|
||||
|
||||
The purpose of the `AnthropicTokenizer` is to convert raw text into a sequence of tokens that can be fed into Anthropic AI models for various tasks. Tokenization is a fundamental step in text processing pipelines and affects the performance of AI models.
|
||||
|
||||
## Class Definition: AnthropicTokenizer
|
||||
|
||||
`AnthropicTokenizer` extends the functionality of a base tokenizer to provide features specifically needed for Anthropic AI models. The class is designed to manage tokenization processes such as counting tokens and ensuring that the token count is under a specified limit, which is essential for effective and efficient model performance.
|
||||
|
||||
**Class Signature:**
|
||||
|
||||
**Parameters:**
|
||||
|
||||
| Parameter Name | Type | Description | Default Value |
|
||||
|---------------------------|-----------|-----------------------------------------------------------------|---------------|
|
||||
| `max_tokens` | `int` | Maximum number of tokens permitted. | `500` |
|
||||
| `client` | `Anthropic` | Instance of an `Anthropic` client for tokenization services. | `None` |
|
||||
| `model` | `str` | Identifier for the Anthropic model in use. | `"claude-2.1"`|
|
||||
|
||||
**Methods and their descriptions:**
|
||||
|
||||
| Method Name | Return Type | Description |
|
||||
|----------------------|-------------|--------------------------------------------------------------|
|
||||
| `__post_init__` | `None` | Initializes default parameters and client instance. |
|
||||
| `default_max_tokens` | `int` | Returns the default maximum number of tokens. |
|
||||
| `count_tokens` | `int` | Counts tokens in the input text. Raises a ValueError if the input is not a string. |
|
||||
|
||||
## Architecture and Mechanics
|
||||
|
||||
Upon instantiation, `AnthropicTokenizer` initializes its `max_tokens` limit and sets up a client to interact with the Anthropic services. The client is responsible for providing tokenization functions critical for processing the text inputs.
|
||||
|
||||
The tokenizer employs a dictionary to map specific model prefixes to their maximum token counts. This allows users to adapt the tokenizer's behavior to different models with varying token limits. The `default_max_tokens()` method dynamically retrieves the token limit based on the provided model name, ensuring compatibility and flexibility.
|
||||
|
||||
`count_tokens()` is a critical function that calculates the number of tokens in a given text. This functionality is essential for respecting the model's token limit and ensuring accurate processing by the Anthropic AI.
|
||||
|
||||
## Usage Examples
|
||||
|
||||
Before delving into detailed examples, make sure you have `swarms.tokenizers` installed and ready. If `anthropic` is an optional dependency, ensure that it's installed as well.
|
||||
|
||||
### 1. Tokenizing with Default Settings
|
||||
|
||||
```python
|
||||
from swarms.tokenizers import AnthropicTokenizer
|
||||
|
||||
# Initialize the tokenizer with default settings
|
||||
tokenizer = AnthropicTokenizer()
|
||||
|
||||
# Tokenize a sample text
|
||||
text = "Hello world! This is an example text to tokenize."
|
||||
token_count = tokenizer.count_tokens(text)
|
||||
|
||||
print(f"Number of tokens: {token_count}")
|
||||
```
|
||||
|
||||
In this example, we use the `AnthropicTokenizer` to count the number of tokens in a simple text. The token count can be crucial for managing inputs to the AI model.
|
||||
|
||||
### 2. Tokenizing with Custom Model
|
||||
|
||||
```python
|
||||
from swarms.tokenizers import AnthropicTokenizer
|
||||
|
||||
# Define a custom model
|
||||
custom_model = "claude"
|
||||
|
||||
# Initialize the tokenizer with a custom model and max_tokens
|
||||
tokenizer = AnthropicTokenizer(model=custom_model, max_tokens=1000)
|
||||
|
||||
# Process a larger text
|
||||
large_text = "..." # Assume large_text is a string with meaningful content
|
||||
|
||||
token_count = tokenizer.count_tokens(large_text)
|
||||
if token_count > tokenizer.max_tokens:
|
||||
print("Text exceeds the maximum token limit.")
|
||||
else:
|
||||
print(f"Token count within limit: {token_count}")
|
||||
```
|
||||
|
||||
This snippet demonstrates setting up the tokenizer for a custom model and a higher maximum token limit. It is helpful when dealing with texts larger than the default token limit.
|
||||
|
||||
### 3. Handling Error in Token Count Function
|
||||
|
||||
```python
|
||||
from swarms.tokenizers import AnthropicTokenizer
|
||||
|
||||
# Initialize the tokenizer
|
||||
tokenizer = AnthropicTokenizer()
|
||||
|
||||
# Attempt to tokenize a non-string input (which will raise an error)
|
||||
non_string_input = ["This", "is", "a", "list", "not", "a", "string"]
|
||||
|
||||
try:
|
||||
tokenizer.count_tokens(non_string_input)
|
||||
except ValueError as e:
|
||||
print(f"Error: {e}")
|
||||
```
|
||||
|
||||
This example illustrates the error management within the `count_tokens` method. It is important to handle exceptions gracefully, particularly when a non-string input is provided.
|
||||
|
||||
## Additional Tips and Considerations
|
||||
|
||||
- Always ensure the input text is a string before calling `count_tokens` to avoid unnecessary errors.
|
||||
- Be aware of the `max_tokens` limit since larger models might have significantly higher limits than defaults.
|
||||
- When tokenizing large datasets, batch processing with a loop or parallelization might provide better performance.
|
||||
|
||||
## Resources and References
|
||||
|
||||
Given that `AnthropicTokenizer` interacts with an AI model and optional dependencies, it is beneficial to refer to the official documentation and guides specific to those components:
|
||||
|
||||
- [Anthropic Model Documentation](#) (Link would be replaced with actual URL)
|
||||
- [swarms.tokenizers Installation Guide](#)
|
||||
- [Python `dataclasses` Documentation](https://docs.python.org/3/library/dataclasses.html)
|
||||
|
||||
Additionally, literature on best practices for tokenization and natural language processing will contribute to a more effective use of the tokenizer:
|
||||
|
||||
- Smith, B. (Year). "Advanced Tokenization Techniques for NLP Models." Journal of Machine Learning.
|
||||
- Caruthers, M. (Year). "Text Pre-processing and Tokenization for Deep Learning."
|
||||
|
||||
By following the provided documentation and recommended practices, developers and researchers can harness the power of `AnthropicTokenizer` to its full potential, facilitating optimal use of Anthropic's AI models for varied text processing tasks.
|
@ -0,0 +1,62 @@
|
||||
# Documentation for `swarms.tokenizers.BaseTokenizer`
|
||||
|
||||
## Overview and Introduction
|
||||
|
||||
The `swarms.tokenizers` library is designed to provide flexible and efficient tokenization utilities for natural language processing (NLP) tasks. The `BaseTokenizer` class serves as a foundational abstract class from which specific tokenizer implementations can be derived. This class outlines essential functions and properties all tokenizers should have, ensuring consistency and capturing common behaviors required for processing textual data.
|
||||
|
||||
## Class Definition: `BaseTokenizer`
|
||||
|
||||
### Attributes and Methods
|
||||
|
||||
| Name | Type | Description |
|
||||
| ---------------------- | ------------------------------- | ------------------------------------------------------------------------- |
|
||||
| `max_tokens` | `int` | Maximum number of tokens the tokenizer can process. |
|
||||
| `stop_token` | `str` | Token used to denote the end of processing. |
|
||||
| `stop_sequences` | `List[str]` (read-only) | List of stop sequences initialized post-instantiation. |
|
||||
| `count_tokens_left` | Method: `(text) -> int` | Computes the number of tokens that can still be added given the text. |
|
||||
| `count_tokens` | Abstract Method: `(text) -> int`| Returns the number of tokens in the given text. |
|
||||
|
||||
## Functionality and Usage
|
||||
|
||||
The `BaseTokenizer` class provides the structure for creating tokenizers. It includes methods for counting the tokens in a given text and determining how many more tokens can be added without exceeding the `max_tokens` limit. This class should be subclassed, and the `count_tokens` method must be implemented in subclasses to provide the specific token counting logic.
|
||||
|
||||
### Example: Subclassing `BaseTokenizer`
|
||||
|
||||
```python
|
||||
from swarms.tokenizers import BaseTokenizer
|
||||
|
||||
class SimpleTokenizer(BaseTokenizer):
|
||||
|
||||
def count_tokens(self, text: Union[str, List[dict]]) -> int:
|
||||
if isinstance(text, str):
|
||||
# Split text by spaces as a simple tokenization approach
|
||||
return len(text.split())
|
||||
elif isinstance(text, list):
|
||||
# Assume list of dictionaries with 'token' key
|
||||
return sum(len(item['token'].split()) for item in text)
|
||||
else:
|
||||
raise TypeError("Unsupported type for text")
|
||||
|
||||
# Usage example
|
||||
tokenizer = SimpleTokenizer(max_tokens=100)
|
||||
text = "This is an example sentence to tokenize."
|
||||
print(tokenizer.count_tokens(text)) # Outputs: 7 (assuming space tokenization)
|
||||
remaining_tokens = tokenizer.count_tokens_left(text)
|
||||
print(remaining_tokens) # Outputs: 93
|
||||
```
|
||||
|
||||
### Note:
|
||||
|
||||
Understand that the `stop_sequences` and `stop_token` in this particular implementation are placeholders to illustrate the pattern. The actual logic may differ based on specific tokenizer requirements.
|
||||
|
||||
## Additional Information and Tips
|
||||
|
||||
- Tokenization is a vital step in text processing for NLP. It should be tailored to the requirements of the application.
|
||||
- Ensure that tokenizer definitions are in sync with the models and datasets being used.
|
||||
|
||||
## References and Resources
|
||||
|
||||
For a deeper understanding of tokenization and its role in NLP, refer to:
|
||||
|
||||
- [Natural Language Processing (NLP) in Python — Tokenization](https://nlp.stanford.edu/IR-book/html/htmledition/tokenization-1.html)
|
||||
- [Hugging Face Tokenizers](https://huggingface.co/docs/tokenizers/python/latest/) - a popular library for tokenization, particularly in the context of transformer models.
|
@ -0,0 +1,134 @@
|
||||
# CohereTokenizer Documentation
|
||||
|
||||
The `CohereTokenizer` class is designed to interface with Cohere language models and provides methods for tokenizing text inputs. This tokenizer plays a crucial role in preparing data for a Cohere model, which operates on tokens rather than raw text.
|
||||
|
||||
---
|
||||
|
||||
### Class Name: `CohereTokenizer`
|
||||
|
||||
## Overview
|
||||
|
||||
The `CohereTokenizer` class is essential for interacting with Cohere models that require tokenized input. As models often operate on tokens, having an intuitive and efficient tokenizer directly linked to the model simplifies preprocessing tasks. This tokenizer counts the tokens in the given text, helping users to manage and understand the tokens they can work with, given limitations like the model's maximum token count.
|
||||
|
||||
## Architecture and How the Class Works
|
||||
|
||||
The `CohereTokenizer` is built as a data class, ensuring that it is lightweight and focused solely on its data attributes and methods related to tokenization. The class relies on an instance of a Cohere `Client`, which needs to be instantiated with an API key from Cohere before use.
|
||||
|
||||
Upon instantiation, the `CohereTokenizer` holds a reference to a specific Cohere model and interfaces with the `Client` to tokenize text accordingly. It provides a simple utility (`count_tokens`) to count the number of tokens that a string, or a list of strings, would be broken down into by the Cohere API.
|
||||
|
||||
## Purpose and Usage
|
||||
|
||||
The `CohereTokenizer` is specifically made for users who are working with Cohere language models. It's designed to help them in preprocessing steps by converting text into tokens and determining how many tokens their text segments contain. This is crucial for ensuring that inputs do not exceed the model's maximum token count, as exceeding this limit can result in errors or truncated text.
|
||||
|
||||
---
|
||||
|
||||
# Class Definition
|
||||
|
||||
```python
|
||||
@dataclass
|
||||
class CohereTokenizer:
|
||||
model: str
|
||||
client: Client
|
||||
DEFAULT_MODEL: str = "command"
|
||||
DEFAULT_MAX_TOKENS: int = 2048
|
||||
max_tokens: int = DEFAULT_MAX_TOKENS
|
||||
```
|
||||
|
||||
## Parameters
|
||||
|
||||
| Parameter | Type | Description | Default Value |
|
||||
| ------------------ | -------------- | ------------------------------------------------------------- | ------------- |
|
||||
| `model` | `str` | Specifies the Cohere model to be used for tokenization. | None |
|
||||
| `client` | `Client` | An instance of the Cohere client, initialized with an API key.| None |
|
||||
| `DEFAULT_MODEL` | `str` | The default model to use if none is specified. | "command" |
|
||||
| `DEFAULT_MAX_TOKENS`| `int` | Default maximum number of tokens the model accepts. | 2048 |
|
||||
| `max_tokens` | `int` | Maximum number of tokens; it can be altered to fit the model. | `DEFAULT_MAX_TOKENS`|
|
||||
|
||||
### Methods
|
||||
|
||||
The `CohereTokenizer` class contains the following method:
|
||||
|
||||
#### `count_tokens`
|
||||
|
||||
```python
|
||||
def count_tokens(self, text: str | list) -> int:
|
||||
"""
|
||||
Count the number of tokens in the given text.
|
||||
|
||||
Args:
|
||||
text (str | list): The input text to tokenize.
|
||||
|
||||
Returns:
|
||||
int: The number of tokens in the text.
|
||||
|
||||
Raises:
|
||||
ValueError: If the input text is not a string.
|
||||
"""
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
# Functionality and Usage Example
|
||||
|
||||
Below are examples demonstrating how to use `CohereTokenizer`.
|
||||
|
||||
---
|
||||
|
||||
## Counting Tokens
|
||||
|
||||
### Initialization
|
||||
|
||||
First, the Cohere client must be initialized and passed in to create an instance of `CohereTokenizer`.
|
||||
|
||||
```python
|
||||
from cohere import Client
|
||||
from swarms.tokenizers import CohereTokenizer
|
||||
|
||||
# Initialize Cohere client with your API key
|
||||
cohere_client = Client('your-api-key')
|
||||
|
||||
# Instantiate the tokenizer
|
||||
tokenizer = CohereTokenizer(model='your-model-name', client=cohere_client)
|
||||
```
|
||||
|
||||
### Count Tokens Example 1
|
||||
|
||||
Counting tokens for a single string.
|
||||
|
||||
```python
|
||||
text_to_tokenize = "Hello, World!"
|
||||
token_count = tokenizer.count_tokens(text_to_tokenize)
|
||||
print(f"Number of tokens: {token_count}")
|
||||
```
|
||||
|
||||
### Count Tokens Example 2
|
||||
|
||||
Trying to pass a list instead of a single string, which would raise an error.
|
||||
|
||||
```python
|
||||
texts_to_tokenize = ["Hello, World!", "Another piece of text."]
|
||||
try:
|
||||
token_count = tokenizer.count_tokens(texts_to_tokenize)
|
||||
except ValueError as e:
|
||||
print(f"Error: {e}")
|
||||
```
|
||||
|
||||
The above code would print `Error: Text must be a string.` as the `count_tokens` function expects a string, not a list.
|
||||
|
||||
---
|
||||
|
||||
# Additional Information and Tips
|
||||
|
||||
When working with the `CohereTokenizer`, here are some key points to keep in mind:
|
||||
|
||||
- The token count is important to know because Cohere models have a maximum token limit for input. If your text exceeds this limit, it must be split or truncated before being passed to the model.
|
||||
- It is always a good practice to catch exceptions when using methods like `count_tokens` to handle unexpected inputs gracefully.
|
||||
- Remember to replace `'your-api-key'` and `'your-model-name'` with your actual Cohere API key and desired model name.
|
||||
|
||||
# References and Resources
|
||||
|
||||
For more detailed information, refer to the following resources:
|
||||
|
||||
- [Cohere API documentation](https://docs.cohere.ai/)
|
||||
- [Data Classes in Python](https://docs.python.org/3/library/dataclasses.html)
|
||||
|
@ -0,0 +1,127 @@
|
||||
# HuggingFaceTokenizer Documentation
|
||||
|
||||
`HuggingFaceTokenizer` is a comprehensive Python class that leverages the Hugging Face `transformers` library to tokenize text using the SentencePiece tokenization mechanism. This class serves as a convenient wrapper for initializing and using tokenizer models from Hugging Face's transformer models, enabling easy integration of tokenizer functionality in various NLP tasks.
|
||||
|
||||
**Purpose and Architecture:**
|
||||
|
||||
Tokenization is a critical step in processing natural language wherein text is broken down into smaller elements (tokens), which can be further used for text analysis, language modeling, and other computational linguistics tasks. The `HuggingFaceTokenizer` provides methods to encode text (turning strings into lists of token IDs) and decode lists of token IDs back into human-readable text.
|
||||
|
||||
**Table of Contents:**
|
||||
|
||||
- [Overview](#overview)
|
||||
- [Initialization](#initialization)
|
||||
- [Properties](#properties)
|
||||
- [Methods](#methods)
|
||||
- [Usage Examples](#usage-examples)
|
||||
- [References and Resources](#references-and-resources)
|
||||
|
||||
## Overview
|
||||
|
||||
The `HuggingFaceTokenizer` class is designed to streamline the process of tokenizing text for natural language processing (NLP). It encapsulates various functionalities, such as encoding text into tokens, decoding tokens into text, and identifying token IDs for special tokens.
|
||||
|
||||
## Initialization
|
||||
|
||||
`HuggingFaceTokenizer` is initialized by providing the directory containing the pretrained tokenizer model files. During its initialization, it configures its internal state for tokenization processes, prepares access to vocabulary, and establishes necessary properties for subsequent tokenization tasks.
|
||||
|
||||
### Constructor Parameters
|
||||
|
||||
| Parameter | Data Type | Description | Default |
|
||||
|------------|-----------|--------------------------------------------|---------|
|
||||
| model_dir | `str` | The directory containing the tokenizer model files. | None |
|
||||
|
||||
### Attributes
|
||||
|
||||
| Attribute | Data Type | Description |
|
||||
|-------------------|---------------------|--------------------------------------------------------|
|
||||
| vocab_size | `int` | The size of the vocabulary used by the tokenizer. |
|
||||
| bos_token_id | `int` | The token ID representing the beginning of sequence token. |
|
||||
| eos_token_id | `int` | The token ID representing the end of sequence token. |
|
||||
| prefix_space_tokens | `Set[int]` | A set of token IDs without a prefix space. |
|
||||
|
||||
## Methods
|
||||
|
||||
### Vocabulary Related Methods
|
||||
|
||||
#### `vocab_size`
|
||||
Returns the size of the tokenizer's vocabulary.
|
||||
|
||||
#### `bos_token_id`
|
||||
Returns the token ID used for the beginning of a sentence.
|
||||
|
||||
#### `eos_token_id`
|
||||
Returns the token ID used for the end of a sentence.
|
||||
|
||||
#### `prefix_space_tokens`
|
||||
Returns a set of token IDs that start without prefix spaces.
|
||||
|
||||
### Tokenization Methods
|
||||
|
||||
#### `encode`
|
||||
Encodes a given text into a sequence of token IDs.
|
||||
|
||||
#### `decode`
|
||||
Decodes a given sequence of token IDs into human-readable text.
|
||||
|
||||
#### `indexes_containing_token`
|
||||
Returns a list of token IDs that potentially could be decoded into the given token.
|
||||
|
||||
#### `__call__`
|
||||
Tokenizes given text when the object is called like a function.
|
||||
|
||||
## Usage Examples
|
||||
|
||||
### 1. Initializing the Tokenizer
|
||||
|
||||
```python
|
||||
from swarms.tokenizers import HuggingFaceTokenizer
|
||||
|
||||
# Initialize the tokenizer with the path to your tokenizer model.
|
||||
tokenizer = HuggingFaceTokenizer('/path/to/your/model_dir')
|
||||
```
|
||||
|
||||
### 2. Encoding Text
|
||||
|
||||
```python
|
||||
# Tokenize a single sentence.
|
||||
sentence = "The quick brown fox jumps over the lazy dog."
|
||||
token_ids = tokenizer.encode(sentence)
|
||||
print(token_ids)
|
||||
```
|
||||
|
||||
### 3. Decoding Tokens
|
||||
|
||||
```python
|
||||
# Assuming 'token_ids' contains a list of token IDs
|
||||
decoded_text = tokenizer.decode(token_ids)
|
||||
print(decoded_text)
|
||||
```
|
||||
|
||||
### 4. Getting Special Token IDs
|
||||
|
||||
```python
|
||||
# Get the beginning of sequence token ID
|
||||
bos_id = tokenizer.bos_token_id
|
||||
print(f"BOS token ID: {bos_id}")
|
||||
|
||||
# Get the end of sequence token ID
|
||||
eos_id = tokenizer.eos_token_id
|
||||
print(f"EOS token ID: {eos_id}")
|
||||
```
|
||||
|
||||
### 5. Using the Tokenizer
|
||||
|
||||
```python
|
||||
# Tokenize a prompt directly by calling the object with a string.
|
||||
text = "Hello, world!"
|
||||
token_ids = tokenizer(text)
|
||||
print(token_ids)
|
||||
```
|
||||
|
||||
## References and Resources
|
||||
|
||||
For more in-depth information on the Hugging Face `transformers` library and SentencePiece, refer to the following resources:
|
||||
|
||||
- Hugging Face `transformers` library documentation: https://huggingface.co/docs/transformers/index
|
||||
- SentencePiece repository and documentation: https://github.com/google/sentencepiece
|
||||
|
||||
This documentation provides an introductory overview of the `HuggingFaceTokenizer` class. For a more extensive guide on the various parameters, functionalities, and advanced usage scenarios, users should refer to the detailed library documentation and external resources provided above.
|
@ -0,0 +1,89 @@
|
||||
# OpenAITokenizer
|
||||
|
||||
The `OpenAITokenizer` class is a versatile and intuitive tokenizer designed for use with OpenAI's various language models, including the powerful GPT series. This class addresses the need to efficiently tokenize text for submission to OpenAI's API endpoints, managing different models and their unique tokenization schemes with ease.
|
||||
|
||||
Utility of `OpenAITokenizer` centers around its key features:
|
||||
- Support for multiple OpenAI models including GPT-3 and GPT-4.
|
||||
- Dynamic token counting that considers model-specific details.
|
||||
- Straightforward API intended for easy integration with larger systems.
|
||||
|
||||
## Architecture and Design
|
||||
|
||||
The class adheres to a simple yet effective design, offering methods for calculating token lengths and embedded properties that manage model-specific characteristics such as maximum tokens and encodings. A data class structure is used for clean initializations and better management of class data.
|
||||
|
||||
The `OpenAITokenizer` uses a property-based approach and a method-based approach to provide users with a variety of functionalities essential for preparing text input for OpenAI models.
|
||||
|
||||
## Attributes
|
||||
|
||||
The class contains several key constants and properties that define defaults and settings for use with different models:
|
||||
|
||||
| Attribute | Type | Description |
|
||||
|-----------------------------------------------------|-----------------|-------------------------------------------------------------|
|
||||
| `DEFAULT_OPENAI_GPT_3_COMPLETION_MODEL` | `str` | Default completion model for OpenAI GPT-3. |
|
||||
| `DEFAULT_OPENAI_GPT_3_CHAT_MODEL` | `str` | Default chat model for OpenAI GPT-3. |
|
||||
| `DEFAULT_OPENAI_GPT_4_MODEL` | `str` | Default model for OpenAI GPT-4. |
|
||||
| `DEFAULT_ENCODING` | `str` | Default encoding for text. |
|
||||
| `DEFAULT_MAX_TOKENS` | `int` | Default maximum number of tokens based on the model. |
|
||||
| `TOKEN_OFFSET` | `int` | Token offset applicable to some models. |
|
||||
| `MODEL_PREFIXES_TO_MAX_TOKENS` | `dict` | Mapping of model prefixes to their respective max tokens. |
|
||||
| `EMBEDDING_MODELS` | `list` | List of embedding models supported. |
|
||||
| `model` | `str` | Name of the model currently being used. |
|
||||
|
||||
## Methods
|
||||
|
||||
The `OpenAITokenizer` class offers a variety of methods:
|
||||
|
||||
| Method | Arguments | Return Type | Description |
|
||||
|-----------------------|---------------------------------------------|-------------|------------------------------------------------------------------------------------------------|
|
||||
| `__post_init__` | None | `None` | Method called after the class has been initialized to set up default values. |
|
||||
| `encoding` | None | `Encoding` | Getter method that retrieves the encoding based on the specified model. |
|
||||
| `default_max_tokens` | None | `int` | Calculates the default max tokens based on the current model or defaults if not model-specific.|
|
||||
| `count_tokens` | `text: str \| list[dict]`, `model: str` | `int` | Counts the number of tokens within a given text or a list of messages. |
|
||||
| `len` | `text: str \| list[dict]`, `model: str` | `int` | Wrapper for `count_tokens`, providing a more intuitive naming convention. |
|
||||
|
||||
### Usage Examples
|
||||
|
||||
Given the extensive nature of this class, several examples are provided for each method, detailing how to use the `OpenAITokenizer` in different contexts.
|
||||
|
||||
#### Example 1: Initializing the Tokenizer
|
||||
|
||||
```python
|
||||
from swarms.tokenizers import OpenAITokenizer
|
||||
|
||||
tokenizer = OpenAITokenizer(model='gpt-4')
|
||||
```
|
||||
|
||||
This example creates a new instance of `OpenAITokenizer` set to work with the GPT-4 model.
|
||||
|
||||
#### Example 2: Counting Tokens
|
||||
|
||||
```python
|
||||
text = "Hello, this is an example text to tokenize."
|
||||
|
||||
# Initialize the tokenizer
|
||||
tokenizer = OpenAITokenizer(model='gpt-4')
|
||||
|
||||
# Count tokens
|
||||
num_tokens = tokenizer.count_tokens(text)
|
||||
print(f"Number of tokens: {num_tokens}")
|
||||
```
|
||||
|
||||
This code snippet demonstrates how to count the number of tokens in a string of text using the specified model's encoding.
|
||||
|
||||
#### Example 3: Custom Model Token Counting
|
||||
|
||||
```python
|
||||
messages = [
|
||||
{"name": "Alice", "message": "Hello! How are you?"},
|
||||
{"name": "Bob", "message": "I'm good! Just working on some code."},
|
||||
]
|
||||
|
||||
tokenizer = OpenAITokenizer(model='gpt-3.5-turbo')
|
||||
|
||||
# Count tokens for a list of messages
|
||||
num_tokens = tokenizer.len(messages, model="gpt-3.5-turbo-0613")
|
||||
print(f"Total tokens for messages: {num_tokens}")
|
||||
```
|
||||
|
||||
In this example, we're invoking the `len` method to count the tokens in a conversation thread. Each message is represented as a dictionary with a `name` and `message` field.
|
||||
|
@ -0,0 +1,163 @@
|
||||
# swarms.tokenizers Documentation
|
||||
|
||||
`swarms.tokenizers` is a PyTorch-like tokenization library designed to facilitate natural language processing (NLP) tasks by converting text inputs into a form that machine learning models can interpret. In this documentation, we will outline how to utilize the `SentencePieceTokenizer` class from the `swarms.tokenizers` library, which offers sentencepiece tokenization, a language-independent subword tokenizer and detokenizer.
|
||||
|
||||
## Purpose and Architecture of `SentencePieceTokenizer`
|
||||
|
||||
The `SentencePieceTokenizer` class uses a pre-trained sentencepiece model to tokenize and detokenize texts. SentencePiece is an unsupervised text tokenizer and detokenizer that allows the generation of a subword vocabulary from raw data. By breaking text down into subword units (like wordpieces or byte-pair-encodings), SentencePiece handles languages without a clear word boundary and can improve the performance of text processing in neural network models.
|
||||
|
||||
In `SentencePieceTokenizer`, the tokenization process is language-agnostic and encompasses a range of tokenization strategies, such as byte pair encoding (BPE), unigram, or a combination of both. The class is designed with ease of use in mind, allowing seamless integration with other components of the NLP pipeline.
|
||||
|
||||
## Class Definition
|
||||
|
||||
```python
|
||||
class SentencePieceTokenizer:
|
||||
"""
|
||||
Tokenizer of sentencepiece.
|
||||
|
||||
Args:
|
||||
model_file (str): the path of the tokenizer model
|
||||
"""
|
||||
```
|
||||
|
||||
## Initialization Parameters
|
||||
|
||||
Property/Method | Type | Description
|
||||
----------------|------|-------------
|
||||
`model_file` | `str` | The path to the pretrained sentencepiece model file.
|
||||
|
||||
## Methods and Usage
|
||||
|
||||
Below, we detail the methods available in `SentencePieceTokenizer`, including their parameters, their functionality, and usage examples.
|
||||
|
||||
### Method: `__init__`
|
||||
|
||||
Instantiates an instance of the `SentencePieceTokenizer` with the specified sentencepiece model.
|
||||
|
||||
#### Parameters
|
||||
|
||||
Parameter | Type | Description
|
||||
----------|------|-------------
|
||||
`model_file` | `str` | The path to the pretrained sentencepiece model file.
|
||||
|
||||
#### Example
|
||||
|
||||
```python
|
||||
from swarms.tokenizers import SentencePieceTokenizer
|
||||
|
||||
tokenizer = SentencePieceTokenizer(model_file='your_model.model')
|
||||
```
|
||||
|
||||
### Properties: Vocabulary Information
|
||||
|
||||
These properties provide access to various vocabulary-specific information.
|
||||
|
||||
#### `vocab_size`
|
||||
#### `bos_token_id`
|
||||
#### `eos_token_id`
|
||||
|
||||
##### Example
|
||||
|
||||
```python
|
||||
vocab_size = tokenizer.vocab_size
|
||||
print(f"Vocabulary size: {vocab_size}")
|
||||
|
||||
bos_id = tokenizer.bos_token_id
|
||||
eos_id = tokenizer.eos_token_id
|
||||
print(f"BOS token ID: {bos_id}, EOS token ID: {eos_id}")
|
||||
```
|
||||
|
||||
### Method: `indexes_containing_token`
|
||||
|
||||
Finds possible tokenizer indexes that, when decoded, may contain the input token.
|
||||
|
||||
#### Parameters
|
||||
|
||||
Parameter | Type | Description
|
||||
----------|------|-------------
|
||||
`token` | `str` | The token for which possible indexes are to be found.
|
||||
|
||||
#### Returns
|
||||
|
||||
- `List[int]`: List of tokenizer indexes that might contain the token.
|
||||
|
||||
#### Example
|
||||
|
||||
```python
|
||||
indexes = tokenizer.indexes_containing_token("▁the")
|
||||
print(f"Indexes containing '▁the': {indexes}")
|
||||
```
|
||||
|
||||
### Method: `encode`
|
||||
|
||||
Tokenizes a text prompt into a list of token IDs.
|
||||
|
||||
#### Parameters
|
||||
|
||||
Parameter | Type | Description
|
||||
----------|------|-------------
|
||||
`s` | `str` | The text prompt to tokenize.
|
||||
`add_bos` | `bool` | If `True`, it adds the beginning-of-sentence token. (default: `True`)
|
||||
|
||||
#### Returns
|
||||
- `List[int]`: List of token IDs representing the text prompt.
|
||||
|
||||
#### Example
|
||||
|
||||
```python
|
||||
encoded_ids = tokenizer.encode("Hello, world!", add_bos=True)
|
||||
print(f"Encoded token IDs: {encoded_ids}")
|
||||
```
|
||||
|
||||
### Method: `decode`
|
||||
|
||||
Detokenizes a list of token IDs into text.
|
||||
|
||||
#### Parameters
|
||||
|
||||
Parameter | Type | Description
|
||||
----------|------|-------------
|
||||
`t` | `List[int]` | A list of token IDs to detokenize.
|
||||
`offset` | `Optional[int]` | For incremental decoding. Defaults to `None`, which means it is not applied.
|
||||
|
||||
#### Returns
|
||||
|
||||
- `str`: Text representation of the decoded token IDs.
|
||||
|
||||
#### Example
|
||||
|
||||
```python
|
||||
decoded_text = tokenizer.decode([bos_id] + encoded_ids)
|
||||
print(f"Decoded text: {decoded_text}")
|
||||
```
|
||||
|
||||
### Method: `__call__`
|
||||
|
||||
Tokenizes prompts when the class instance is used as a callable.
|
||||
|
||||
#### Parameters
|
||||
|
||||
Parameter | Type | Description
|
||||
----------|------|-------------
|
||||
`s` | `Union[str, Sequence[str]]` | Text prompts to tokenize.
|
||||
`add_bos` | `bool` | If `True`, it adds the beginning-of-sentence token. (default: `False`)
|
||||
`add_eos` | `bool` | If `True`, it adds the end-of-sentence token. (default: `False`)
|
||||
|
||||
#### Returns
|
||||
|
||||
- `addict.Addict`: Object with `input_ids` containing the list of token IDs.
|
||||
|
||||
#### Example
|
||||
|
||||
```python
|
||||
input_data = tokenizer("Let's tokenize this sentence.")
|
||||
print(f"Tokenized input IDs: {input_data.input_ids}")
|
||||
```
|
||||
|
||||
## Additional Information and Tips
|
||||
|
||||
The library has efficient internals that cache information for performance benefits. For example, `indexes_containing_token` uses a deque to store the most recent lookups, which saves computation time by avoiding re-traversing the vocabulary.
|
||||
|
||||
## Conclusion
|
||||
|
||||
This documentation provides an in-depth explanation of `swarms.tokenizers` with a focus on the `SentencePieceTokenizer` class. By following the examples and guidance detailed above, users should be able to effectively use the tokenizers for their NLP tasks. Users are also encouraged to refer to further resources and the official SentencePiece documentation for more advanced use cases and configurations.
|
@ -0,0 +1,78 @@
|
||||
# `Tokenizer` Class Documentation
|
||||
|
||||
The `Tokenizer` class is a flexible and robust tokenization tool designed to efficiently tokenize prompts into a sequence of token IDs or convert token IDs back into readable text. The class works by initializing with a path to a pretrained tokenization model and supports different tokenization backends based on the availability of configs and pretrained models.
|
||||
|
||||
## Initialization & Configuration
|
||||
|
||||
### Parameters:
|
||||
|
||||
| Parameter | Type | Description | Required |
|
||||
|------------|------|------------------------------------------|----------|
|
||||
| model_file | str | Path to the tokenizer model or directory | Yes |
|
||||
|
||||
### Attributes:
|
||||
|
||||
| Attribute | Type | Description |
|
||||
|------------------|------|------------------------------------|
|
||||
| vocab_size | int | Size of the tokenizer's vocabulary |
|
||||
| bos_token_id | int | ID of the beginning-of-sequence token |
|
||||
| eos_token_id | int | ID of the end-of-sequence token |
|
||||
|
||||
### Methods:
|
||||
|
||||
| Method | Returns | Description |
|
||||
|--------------------------------|---------|--------------------------------------------------------------|
|
||||
| encode(s, add_bos=True, **kwargs) | list[int] | Tokenizes a prompt and returns token IDs. |
|
||||
| decode(t, offset=None) | str | Decodes a list of token IDs to a string. |
|
||||
| __call__(s) | list[int] | Tokenize prompts when the instance is called directly. |
|
||||
| indexes_containing_token(token) | list[int] | Returns indexes in the vocabulary that may contain the token. |
|
||||
|
||||
## Usage Examples
|
||||
|
||||
### Tokenizing a Prompt
|
||||
|
||||
```python
|
||||
from swarms.tokenizers import Tokenizer
|
||||
|
||||
tokenizer = Tokenizer("/path/to/tokenizer.model")
|
||||
|
||||
# Tokenize a single prompt string
|
||||
prompt = "Hello, world!"
|
||||
token_ids = tokenizer.encode(prompt)
|
||||
print(token_ids)
|
||||
```
|
||||
|
||||
### Decoding Token IDs
|
||||
|
||||
```python
|
||||
# Decode token IDs back into text
|
||||
decoded_text = tokenizer.decode(token_ids)
|
||||
print(decoded_text)
|
||||
```
|
||||
|
||||
### Incremental Decoding
|
||||
|
||||
```python
|
||||
# Incremental decoding with offset (useful for streaming applications)
|
||||
partial_tokens = [token_ids[0]] # simulate partially received tokens
|
||||
decoded_partial = tokenizer.decode(partial_tokens, offset=0)
|
||||
print(decoded_partial)
|
||||
```
|
||||
|
||||
### Properties Access
|
||||
|
||||
```python
|
||||
# Access vocabulary size and special token IDs
|
||||
print("Vocabulary Size:", tokenizer.vocab_size)
|
||||
print("BOS Token ID:", tokenizer.bos_token_id)
|
||||
print("EOS Token ID:", tokenizer.eos_token_id)
|
||||
```
|
||||
|
||||
### Indexes Containing Token
|
||||
|
||||
```python
|
||||
# Find indexes that may output a specific token during decoding
|
||||
token = "world"
|
||||
indexes = tokenizer.indexes_containing_token(token)
|
||||
print("Token Indexes:", indexes)
|
||||
```
|
@ -0,0 +1,7 @@
|
||||
from swarms.artifacts.base_artifact import BaseArtifact
|
||||
from swarms.artifacts.text_artifact import TextArtifact
|
||||
|
||||
__all__ = [
|
||||
"BaseArtifact",
|
||||
"TextArtifact",
|
||||
]
|
@ -0,0 +1,76 @@
|
||||
from __future__ import annotations
|
||||
from typing import Any
|
||||
import json
|
||||
import uuid
|
||||
from abc import ABC, abstractmethod
|
||||
from dataclasses import dataclass
|
||||
|
||||
|
||||
@dataclass
|
||||
class BaseArtifact(ABC):
|
||||
"""
|
||||
Base class for artifacts.
|
||||
"""
|
||||
|
||||
id: str
|
||||
name: str
|
||||
value: Any
|
||||
|
||||
def __post_init__(self):
|
||||
if self.id is None:
|
||||
self.id = uuid.uuid4().hex
|
||||
if self.name is None:
|
||||
self.name = self.id
|
||||
|
||||
@classmethod
|
||||
def value_to_bytes(cls, value: Any) -> bytes:
|
||||
"""
|
||||
Convert the value to bytes.
|
||||
"""
|
||||
if isinstance(value, bytes):
|
||||
return value
|
||||
else:
|
||||
return str(value).encode()
|
||||
|
||||
@classmethod
|
||||
def value_to_dict(cls, value: Any) -> dict:
|
||||
"""
|
||||
Convert the value to a dictionary.
|
||||
"""
|
||||
if isinstance(value, dict):
|
||||
dict_value = value
|
||||
else:
|
||||
dict_value = json.loads(value)
|
||||
|
||||
return {k: v for k, v in dict_value.items()}
|
||||
|
||||
def to_text(self) -> str:
|
||||
"""
|
||||
Convert the value to text.
|
||||
"""
|
||||
return str(self.value)
|
||||
|
||||
def __str__(self) -> str:
|
||||
"""
|
||||
Return a string representation of the artifact.
|
||||
"""
|
||||
return self.to_text()
|
||||
|
||||
def __bool__(self) -> bool:
|
||||
"""
|
||||
Return the boolean value of the artifact.
|
||||
"""
|
||||
return bool(self.value)
|
||||
|
||||
def __len__(self) -> int:
|
||||
"""
|
||||
Return the length of the artifact.
|
||||
"""
|
||||
return len(self.value)
|
||||
|
||||
@abstractmethod
|
||||
def __add__(self, other: BaseArtifact) -> BaseArtifact:
|
||||
"""
|
||||
Add two artifacts together.
|
||||
"""
|
||||
...
|
@ -0,0 +1,57 @@
|
||||
from __future__ import annotations
|
||||
from typing import Optional
|
||||
from dataclasses import dataclass, field
|
||||
from swarms.artifacts.base_artifact import BaseArtifact
|
||||
from swarms.tokenizers.base_tokenizer import BaseTokenizer
|
||||
|
||||
|
||||
@dataclass
|
||||
class TextArtifact(BaseArtifact):
|
||||
"""
|
||||
Represents a text artifact.
|
||||
|
||||
Attributes:
|
||||
value (str): The text value of the artifact.
|
||||
encoding (str, optional): The encoding of the text (default is "utf-8").
|
||||
encoding_error_handler (str, optional): The error handler for encoding errors (default is "strict").
|
||||
_embedding (list[float]): The embedding of the text artifact (default is an empty list).
|
||||
|
||||
Properties:
|
||||
embedding (Optional[list[float]]): The embedding of the text artifact.
|
||||
|
||||
Methods:
|
||||
__add__(self, other: BaseArtifact) -> TextArtifact: Concatenates the text value of the artifact with another artifact.
|
||||
__bool__(self) -> bool: Checks if the text value of the artifact is non-empty.
|
||||
generate_embedding(self, driver: BaseEmbeddingModel) -> Optional[list[float]]: Generates the embedding of the text artifact using a given embedding model.
|
||||
token_count(self, tokenizer: BaseTokenizer) -> int: Counts the number of tokens in the text artifact using a given tokenizer.
|
||||
to_bytes(self) -> bytes: Converts the text value of the artifact to bytes using the specified encoding and error handler.
|
||||
"""
|
||||
|
||||
value: str
|
||||
encoding: str = "utf-8"
|
||||
encoding_error_handler: str = "strict"
|
||||
_embedding: list[float] = field(default_factory=list)
|
||||
|
||||
@property
|
||||
def embedding(self) -> Optional[list[float]]:
|
||||
return None if len(self._embedding) == 0 else self._embedding
|
||||
|
||||
def __add__(self, other: BaseArtifact) -> TextArtifact:
|
||||
return TextArtifact(self.value + other.value)
|
||||
|
||||
def __bool__(self) -> bool:
|
||||
return bool(self.value.strip())
|
||||
|
||||
def generate_embedding(self, model) -> Optional[list[float]]:
|
||||
self._embedding.clear()
|
||||
self._embedding.extend(model.embed_string(str(self.value)))
|
||||
|
||||
return self.embedding
|
||||
|
||||
def token_count(self, tokenizer: BaseTokenizer) -> int:
|
||||
return tokenizer.count_tokens(str(self.value))
|
||||
|
||||
def to_bytes(self) -> bytes:
|
||||
return self.value.encode(
|
||||
encoding=self.encoding, errors=self.encoding_error_handler
|
||||
)
|
@ -0,0 +1,5 @@
|
||||
from swarms.chunkers.chunk_seperator import ChunkSeparator
|
||||
from swarms.chunkers.base_chunker import BaseChunker
|
||||
from swarms.chunkers.text_chunker import TextChunker
|
||||
|
||||
__all__ = ["ChunkSeparator", "BaseChunker", "TextChunker"]
|
@ -0,0 +1,162 @@
|
||||
from __future__ import annotations
|
||||
from abc import ABC
|
||||
from typing import Optional
|
||||
from dataclasses import dataclass, field
|
||||
from swarms.artifacts.text_artifact import TextArtifact
|
||||
from swarms.chunkers.chunk_seperator import ChunkSeparator
|
||||
from swarms.tokenizers.base_tokenizer import BaseTokenizer
|
||||
from swarms.tokenizers.openai_tokenizers import OpenAITokenizer
|
||||
|
||||
|
||||
@dataclass
|
||||
class BaseChunker(ABC):
|
||||
"""
|
||||
Base class for chunking text into smaller chunks.
|
||||
"""
|
||||
|
||||
DEFAULT_SEPARATORS = [ChunkSeparator(" ")]
|
||||
|
||||
separators: list[ChunkSeparator] = field(
|
||||
default_factory=lambda: BaseChunker.DEFAULT_SEPARATORS
|
||||
)
|
||||
tokenizer: BaseTokenizer = field(
|
||||
default_factory=lambda: OpenAITokenizer(
|
||||
model=OpenAITokenizer.DEFAULT_OPENAI_GPT_3_CHAT_MODEL
|
||||
)
|
||||
)
|
||||
max_tokens: int = field(
|
||||
default_factory=lambda: BaseChunker.tokenizer.max_tokens
|
||||
)
|
||||
|
||||
def chunk(self, text: str | str) -> list[str]:
|
||||
"""
|
||||
Chunk the given text into smaller chunks.
|
||||
|
||||
Args:
|
||||
text (TextArtifact | str): The text to be chunked.
|
||||
|
||||
Returns:
|
||||
list[TextArtifact]: The list of chunked text artifacts.
|
||||
"""
|
||||
text = text.value if isinstance(text, str) else text
|
||||
|
||||
return [
|
||||
TextArtifact(c) for c in self._chunk_recursively(text)
|
||||
]
|
||||
|
||||
def _chunk_recursively(
|
||||
self,
|
||||
chunk: str,
|
||||
current_separator: Optional[ChunkSeparator] = None,
|
||||
) -> list[str]:
|
||||
"""
|
||||
Recursively chunk the given chunk into smaller subchunks.
|
||||
|
||||
Args:
|
||||
chunk (str): The chunk to be recursively chunked.
|
||||
current_separator (Optional[ChunkSeparator], optional): The current separator to be used. Defaults to None.
|
||||
|
||||
Returns:
|
||||
list[str]: The list of recursively chunked subchunks.
|
||||
"""
|
||||
token_count = self.tokenizer.count_tokens(chunk)
|
||||
|
||||
if token_count <= self.max_tokens:
|
||||
return [chunk]
|
||||
else:
|
||||
balance_index = -1
|
||||
balance_diff = float("inf")
|
||||
tokens_count = 0
|
||||
half_token_count = token_count // 2
|
||||
|
||||
# If a separator is provided, only use separators after it.
|
||||
if current_separator:
|
||||
separators = self.separators[
|
||||
self.separators.index(current_separator) :
|
||||
]
|
||||
else:
|
||||
separators = self.separators
|
||||
|
||||
# Loop through available separators to find the best split.
|
||||
for separator in separators:
|
||||
# Split the chunk into subchunks using the current separator.
|
||||
subchunks = list(
|
||||
filter(None, chunk.split(separator.value))
|
||||
)
|
||||
|
||||
# Check if the split resulted in more than one subchunk.
|
||||
if len(subchunks) > 1:
|
||||
# Iterate through the subchunks and calculate token counts.
|
||||
for index, subchunk in enumerate(subchunks):
|
||||
if index < len(subchunks):
|
||||
if separator.is_prefix:
|
||||
subchunk = separator.value + subchunk
|
||||
else:
|
||||
subchunk = subchunk + separator.value
|
||||
|
||||
tokens_count += self.tokenizer.count_tokens(
|
||||
subchunk
|
||||
)
|
||||
|
||||
# Update the best split if the current one is more balanced.
|
||||
if (
|
||||
abs(tokens_count - half_token_count)
|
||||
< balance_diff
|
||||
):
|
||||
balance_index = index
|
||||
balance_diff = abs(
|
||||
tokens_count - half_token_count
|
||||
)
|
||||
|
||||
# Create the two subchunks based on the best separator.
|
||||
if separator.is_prefix:
|
||||
# If the separator is a prefix, append it before this subchunk.
|
||||
first_subchunk = (
|
||||
separator.value
|
||||
+ separator.value.join(
|
||||
subchunks[: balance_index + 1]
|
||||
)
|
||||
)
|
||||
second_subchunk = (
|
||||
separator.value
|
||||
+ separator.value.join(
|
||||
subchunks[balance_index + 1 :]
|
||||
)
|
||||
)
|
||||
else:
|
||||
# If the separator is not a prefix, append it after this subchunk.
|
||||
first_subchunk = (
|
||||
separator.value.join(
|
||||
subchunks[: balance_index + 1]
|
||||
)
|
||||
+ separator.value
|
||||
)
|
||||
second_subchunk = separator.value.join(
|
||||
subchunks[balance_index + 1 :]
|
||||
)
|
||||
|
||||
# Continue recursively chunking the subchunks.
|
||||
first_subchunk_rec = self._chunk_recursively(
|
||||
first_subchunk.strip(), separator
|
||||
)
|
||||
second_subchunk_rec = self._chunk_recursively(
|
||||
second_subchunk.strip(), separator
|
||||
)
|
||||
|
||||
# Return the concatenated results of the subchunks if both are non-empty.
|
||||
if first_subchunk_rec and second_subchunk_rec:
|
||||
return (
|
||||
first_subchunk_rec + second_subchunk_rec
|
||||
)
|
||||
# If only one subchunk is non-empty, return it.
|
||||
elif first_subchunk_rec:
|
||||
return first_subchunk_rec
|
||||
elif second_subchunk_rec:
|
||||
return second_subchunk_rec
|
||||
else:
|
||||
return []
|
||||
# If none of the separators result in a balanced split, split the chunk in half.
|
||||
midpoint = len(chunk) // 2
|
||||
return self._chunk_recursively(
|
||||
chunk[:midpoint]
|
||||
) + self._chunk_recursively(chunk[midpoint:])
|
@ -0,0 +1,7 @@
|
||||
from dataclasses import dataclass
|
||||
|
||||
|
||||
@dataclass
|
||||
class ChunkSeparator:
|
||||
value: str
|
||||
is_prefix: bool = False
|
@ -0,0 +1,13 @@
|
||||
from swarms.chunkers.base_chunker import BaseChunker
|
||||
from swarms.chunkers.chunk_seperator import ChunkSeparator
|
||||
|
||||
|
||||
class TextChunker(BaseChunker):
|
||||
DEFAULT_SEPARATORS = [
|
||||
ChunkSeparator("\n\n"),
|
||||
ChunkSeparator("\n"),
|
||||
ChunkSeparator(". "),
|
||||
ChunkSeparator("! "),
|
||||
ChunkSeparator("? "),
|
||||
ChunkSeparator(" "),
|
||||
]
|
@ -0,0 +1,77 @@
|
||||
from __future__ import annotations
|
||||
|
||||
from dataclasses import dataclass
|
||||
from pathlib import Path
|
||||
from typing import IO, Dict, List, Optional
|
||||
|
||||
from PyPDF2 import PdfReader
|
||||
|
||||
from swarms.utils.hash import str_to_hash
|
||||
|
||||
|
||||
@dataclass
|
||||
class TextArtifact:
|
||||
text: str
|
||||
|
||||
|
||||
@dataclass
|
||||
class PDFLoader:
|
||||
"""
|
||||
A class for loading PDF files and extracting text artifacts.
|
||||
|
||||
Args:
|
||||
tokenizer (str): The tokenizer to use for chunking the text.
|
||||
max_tokens (int): The maximum number of tokens per chunk.
|
||||
|
||||
Methods:
|
||||
load(source, password=None, *args, **kwargs):
|
||||
Load a single PDF file and extract text artifacts.
|
||||
|
||||
load_collection(sources, password=None, *args, **kwargs):
|
||||
Load a collection of PDF files and extract text artifacts.
|
||||
|
||||
Private Methods:
|
||||
_load_pdf(stream, password=None):
|
||||
Load a PDF file and extract text artifacts.
|
||||
|
||||
Attributes:
|
||||
tokenizer (str): The tokenizer used for chunking the text.
|
||||
max_tokens (int): The maximum number of tokens per chunk.
|
||||
"""
|
||||
|
||||
tokenizer: str
|
||||
max_tokens: int
|
||||
|
||||
def __post_init__(self):
|
||||
self.chunker = PdfChunker(
|
||||
tokenizer=self.tokenizer, max_tokens=self.max_tokens
|
||||
)
|
||||
|
||||
def load(
|
||||
self,
|
||||
source: str | IO | Path,
|
||||
password: Optional[str] = None,
|
||||
*args,
|
||||
**kwargs,
|
||||
) -> List[TextArtifact]:
|
||||
return self._load_pdf(source, password)
|
||||
|
||||
def load_collection(
|
||||
self,
|
||||
sources: List[str | IO | Path],
|
||||
password: Optional[str] = None,
|
||||
*args,
|
||||
**kwargs,
|
||||
) -> Dict[str, List[TextArtifact]]:
|
||||
return {
|
||||
str_to_hash(str(s)): self._load_pdf(s, password)
|
||||
for s in sources
|
||||
}
|
||||
|
||||
def _load_pdf(
|
||||
self, stream: str | IO | Path, password: Optional[str]
|
||||
) -> List[TextArtifact]:
|
||||
reader = PdfReader(stream, strict=True, password=password)
|
||||
return [
|
||||
TextArtifact(text=p.extract_text()) for p in reader.pages
|
||||
]
|
@ -0,0 +1,16 @@
|
||||
from dataclasses import dataclass
|
||||
|
||||
|
||||
@dataclass
|
||||
class ActionSubtaskEntry:
|
||||
"""Used to store ActionSubtask data to preserve TaskMemory pointers and context in the form of thought and action.
|
||||
|
||||
Attributes:
|
||||
thought: CoT thought string from the LLM.
|
||||
action: ReAct action JSON string from the LLM.
|
||||
answer: tool-generated and memory-processed response from Griptape.
|
||||
"""
|
||||
|
||||
thought: str
|
||||
action: str
|
||||
answer: str
|
@ -0,0 +1,80 @@
|
||||
from __future__ import annotations
|
||||
|
||||
from abc import ABC, abstractmethod
|
||||
from dataclasses import dataclass, field
|
||||
from typing import Optional
|
||||
|
||||
import numpy as np
|
||||
from swarms.chunkers.base_chunker import BaseChunker
|
||||
from swarms.chunkers.text_chunker import TextChunker
|
||||
from swarms.utils.exponential_backoff import ExponentialBackoffMixin
|
||||
|
||||
from swarms.artifacts.text_artifact import TextArtifact
|
||||
from swarms.tokenizers.base_tokenizer import BaseTokenizer
|
||||
|
||||
|
||||
@dataclass
|
||||
class BaseEmbeddingModel(
|
||||
ExponentialBackoffMixin,
|
||||
ABC,
|
||||
# SerializableMixin
|
||||
):
|
||||
"""
|
||||
Attributes:
|
||||
model: The name of the model to use.
|
||||
tokenizer: An instance of `BaseTokenizer` to use when calculating tokens.
|
||||
"""
|
||||
|
||||
model: str = None
|
||||
tokenizer: Optional[BaseTokenizer] = None
|
||||
chunker: BaseChunker = field(init=False)
|
||||
|
||||
def __post_init__(self) -> None:
|
||||
if self.tokenizer:
|
||||
self.chunker = TextChunker(tokenizer=self.tokenizer)
|
||||
|
||||
def embed_text_artifact(
|
||||
self, artifact: TextArtifact
|
||||
) -> list[float]:
|
||||
return self.embed_string(artifact.to_text())
|
||||
|
||||
def embed_string(self, string: str) -> list[float]:
|
||||
for attempt in self.retrying():
|
||||
with attempt:
|
||||
if (
|
||||
self.tokenizer
|
||||
and self.tokenizer.count_tokens(string)
|
||||
> self.tokenizer.max_tokens
|
||||
):
|
||||
return self._embed_long_string(string)
|
||||
else:
|
||||
return self.try_embed_chunk(string)
|
||||
|
||||
else:
|
||||
raise RuntimeError("Failed to embed string.")
|
||||
|
||||
@abstractmethod
|
||||
def try_embed_chunk(self, chunk: str) -> list[float]:
|
||||
...
|
||||
|
||||
def _embed_long_string(self, string: str) -> list[float]:
|
||||
"""Embeds a string that is too long to embed in one go."""
|
||||
chunks = self.chunker.chunk(string)
|
||||
|
||||
embedding_chunks = []
|
||||
length_chunks = []
|
||||
for chunk in chunks:
|
||||
embedding_chunks.append(self.try_embed_chunk(chunk.value))
|
||||
length_chunks.append(len(chunk))
|
||||
|
||||
# generate weighted averages
|
||||
embedding_chunks = np.average(
|
||||
embedding_chunks, axis=0, weights=length_chunks
|
||||
)
|
||||
|
||||
# normalize length to 1
|
||||
embedding_chunks = embedding_chunks / np.linalg.norm(
|
||||
embedding_chunks
|
||||
)
|
||||
|
||||
return embedding_chunks.tolist()
|
@ -0,0 +1,96 @@
|
||||
from __future__ import annotations
|
||||
|
||||
from dataclasses import dataclass
|
||||
from importlib import import_module
|
||||
from types import ModuleType
|
||||
from typing import Optional
|
||||
|
||||
from anthropic import Anthropic
|
||||
|
||||
from swarms.tokenizers.base_tokenizer import BaseTokenizer
|
||||
|
||||
INSTALL_MAPPING = {
|
||||
"huggingface_hub": "huggingface-hub",
|
||||
"pinecone": "pinecone-client",
|
||||
"opensearchpy": "opensearch-py",
|
||||
}
|
||||
|
||||
|
||||
def import_optional_dependency(name: str) -> Optional[ModuleType]:
|
||||
"""Import an optional dependency.
|
||||
|
||||
If a dependency is missing, an ImportError with a nice message will be raised.
|
||||
|
||||
Args:
|
||||
name: The module name.
|
||||
Returns:
|
||||
The imported module, when found.
|
||||
None is returned when the package is not found and `errors` is False.
|
||||
"""
|
||||
|
||||
package_name = INSTALL_MAPPING.get(name)
|
||||
install_name = package_name if package_name is not None else name
|
||||
|
||||
msg = (
|
||||
f"Missing optional dependency: '{install_name}'. "
|
||||
f"Use poetry or pip to install '{install_name}'."
|
||||
)
|
||||
try:
|
||||
module = import_module(name)
|
||||
except ImportError:
|
||||
raise ImportError(msg)
|
||||
|
||||
return module
|
||||
|
||||
|
||||
@dataclass
|
||||
class AnthropicTokenizer(BaseTokenizer):
|
||||
"""
|
||||
Tokenizer class for Anthropic models.]
|
||||
"""
|
||||
|
||||
max_tokens: int = 500
|
||||
client: Anthropic = None
|
||||
model: str = "claude-2.1"
|
||||
|
||||
def __post_init__(self):
|
||||
self.DEFAULT_MODEL: str = "claude-2.1"
|
||||
self.MODEL_PREFIXES_TO_MAX_TOKENS: dict[str, int] = {
|
||||
"claude-2.1": 200000,
|
||||
"claude": 100000,
|
||||
}
|
||||
self.model = self.model # or self.DEFAULT_MODEL
|
||||
self.max_tokens = self.max_tokens or self.default_max_tokens()
|
||||
self.client = (
|
||||
self.client
|
||||
or import_optional_dependency("anthropic").Anthropic()
|
||||
)
|
||||
|
||||
def default_max_tokens(self) -> int:
|
||||
"""
|
||||
Returns the default maximum number of tokens based on the model prefix.
|
||||
"""
|
||||
tokens = next(
|
||||
v
|
||||
for k, v in self.MODEL_PREFIXES_TO_MAX_TOKENS.items()
|
||||
if self.model.startswith(k)
|
||||
)
|
||||
return tokens
|
||||
|
||||
def count_tokens(self, text: str | list) -> int:
|
||||
"""
|
||||
Counts the number of tokens in the given text.
|
||||
|
||||
Args:
|
||||
text: The input text.
|
||||
|
||||
Returns:
|
||||
The number of tokens in the text.
|
||||
|
||||
Raises:
|
||||
ValueError: If the input text is not a string.
|
||||
"""
|
||||
if isinstance(text, str):
|
||||
return self.client.count_tokens(text)
|
||||
else:
|
||||
raise ValueError("Text must be a string.")
|
@ -0,0 +1,56 @@
|
||||
from __future__ import annotations
|
||||
|
||||
from abc import ABC, abstractmethod
|
||||
from dataclasses import dataclass, field
|
||||
from typing import List, Union
|
||||
|
||||
|
||||
@dataclass
|
||||
class BaseTokenizer(ABC):
|
||||
"""
|
||||
Base class for tokenizers.
|
||||
|
||||
Attributes:
|
||||
stop_sequences (List[str]): List of stop sequences.
|
||||
max_tokens (int): Maximum number of tokens.
|
||||
stop_token (str): Stop token.
|
||||
"""
|
||||
|
||||
max_tokens: int
|
||||
stop_token: str = "<|Response|>"
|
||||
|
||||
def __post_init__(self):
|
||||
self.stop_sequences: List[str] = field(
|
||||
default_factory=lambda: ["<|Response|>"],
|
||||
init=False,
|
||||
)
|
||||
|
||||
def count_tokens_left(self, text: Union[str, List[dict]]) -> int:
|
||||
"""
|
||||
Counts the number of tokens left based on the given text.
|
||||
|
||||
Args:
|
||||
text (Union[str, List[dict]]): The text to count tokens from.
|
||||
|
||||
Returns:
|
||||
int: The number of tokens left.
|
||||
"""
|
||||
diff = self.max_tokens - self.count_tokens(text)
|
||||
|
||||
if diff > 0:
|
||||
return diff
|
||||
else:
|
||||
return 0
|
||||
|
||||
@abstractmethod
|
||||
def count_tokens(self, text: Union[str, List[dict]]) -> int:
|
||||
"""
|
||||
Counts the number of tokens in the given text.
|
||||
|
||||
Args:
|
||||
text (Union[str, List[dict]]): The text to count tokens from.
|
||||
|
||||
Returns:
|
||||
int: The number of tokens.
|
||||
"""
|
||||
...
|
@ -0,0 +1,35 @@
|
||||
from __future__ import annotations
|
||||
|
||||
from dataclasses import dataclass
|
||||
from cohere import Client
|
||||
|
||||
|
||||
@dataclass
|
||||
class CohereTokenizer:
|
||||
"""
|
||||
A tokenizer class for Cohere models.
|
||||
"""
|
||||
|
||||
model: str
|
||||
client: Client
|
||||
DEFAULT_MODEL: str = "command"
|
||||
DEFAULT_MAX_TOKENS: int = 2048
|
||||
max_tokens: int = DEFAULT_MAX_TOKENS
|
||||
|
||||
def count_tokens(self, text: str | list) -> int:
|
||||
"""
|
||||
Count the number of tokens in the given text.
|
||||
|
||||
Args:
|
||||
text (str | list): The input text to tokenize.
|
||||
|
||||
Returns:
|
||||
int: The number of tokens in the text.
|
||||
|
||||
Raises:
|
||||
ValueError: If the input text is not a string.
|
||||
"""
|
||||
if isinstance(text, str):
|
||||
return len(self.client.tokenize(text=text).tokens)
|
||||
else:
|
||||
raise ValueError("Text must be a string.")
|
@ -0,0 +1,179 @@
|
||||
from __future__ import annotations
|
||||
import logging
|
||||
from dataclasses import dataclass, field
|
||||
import tiktoken
|
||||
from tiktoken import Encoding
|
||||
from typing import Optional
|
||||
from swarms.tokenizers.base_tokenizer import BaseTokenizer
|
||||
|
||||
|
||||
@dataclass
|
||||
class OpenAITokenizer(BaseTokenizer):
|
||||
"""
|
||||
A class representing an OpenAI tokenizer.
|
||||
|
||||
Attributes:
|
||||
- DEFAULT_OPENAI_GPT_3_COMPLETION_MODEL (str): The default OpenAI GPT-3 completion model.
|
||||
- DEFAULT_OPENAI_GPT_3_CHAT_MODEL (str): The default OpenAI GPT-3 chat model.
|
||||
- DEFAULT_OPENAI_GPT_4_MODEL (str): The default OpenAI GPT-4 model.
|
||||
- DEFAULT_ENCODING (str): The default encoding.
|
||||
- DEFAULT_MAX_TOKENS (int): The default maximum number of tokens.
|
||||
- TOKEN_OFFSET (int): The token offset.
|
||||
- MODEL_PREFIXES_TO_MAX_TOKENS (dict): A dictionary mapping model prefixes to maximum tokens.
|
||||
- EMBEDDING_MODELS (list): A list of embedding models.
|
||||
- model (str): The model name.
|
||||
|
||||
Methods:
|
||||
- __post_init__(): Initializes the OpenAITokenizer object.
|
||||
- encoding(): Returns the encoding for the model.
|
||||
- default_max_tokens(): Returns the default maximum number of tokens.
|
||||
- count_tokens(text, model): Counts the number of tokens in the given text.
|
||||
- len(text, model): Returns the length of the text in tokens.
|
||||
"""
|
||||
|
||||
model: str = "gpt-2"
|
||||
|
||||
def __post_init__(self):
|
||||
"""
|
||||
Initializes the OpenAITokenizer object.
|
||||
Sets the default maximum number of tokens.
|
||||
"""
|
||||
self.max_tokens: int = field(
|
||||
default_factory=lambda: self.default_max_tokens()
|
||||
)
|
||||
|
||||
self.DEFAULT_OPENAI_GPT_3_COMPLETION_MODEL = (
|
||||
"text-davinci-003"
|
||||
)
|
||||
self.DEFAULT_OPENAI_GPT_3_CHAT_MODEL = "gpt-3.5-turbo"
|
||||
self.DEFAULT_OPENAI_GPT_4_MODEL = "gpt-4"
|
||||
self.DEFAULT_ENCODING = "cl100k_base"
|
||||
self.EFAULT_MAX_TOKENS = 2049
|
||||
self.TOKEN_OFFSET = 8
|
||||
|
||||
self.MODEL_PREFIXES_TO_MAX_TOKENS = {
|
||||
"gpt-4-1106": 128000,
|
||||
"gpt-4-32k": 32768,
|
||||
"gpt-4": 8192,
|
||||
"gpt-3.5-turbo-16k": 16384,
|
||||
"gpt-3.5-turbo": 4096,
|
||||
"gpt-35-turbo-16k": 16384,
|
||||
"gpt-35-turbo": 4096,
|
||||
"text-davinci-003": 4097,
|
||||
"text-davinci-002": 4097,
|
||||
"code-davinci-002": 8001,
|
||||
"text-embedding-ada-002": 8191,
|
||||
"text-embedding-ada-001": 2046,
|
||||
}
|
||||
|
||||
self.EMBEDDING_MODELS = [
|
||||
"text-embedding-ada-002",
|
||||
"text-embedding-ada-001",
|
||||
]
|
||||
|
||||
@property
|
||||
def encoding(self) -> Encoding:
|
||||
"""
|
||||
Returns the encoding for the model.
|
||||
If the model is not found, returns the default encoding.
|
||||
"""
|
||||
try:
|
||||
return tiktoken.encoding_for_model(self.model)
|
||||
except KeyError:
|
||||
return tiktoken.get_encoding(self.DEFAULT_ENCODING)
|
||||
|
||||
def default_max_tokens(self) -> int:
|
||||
"""
|
||||
Returns the default maximum number of tokens based on the model.
|
||||
"""
|
||||
tokens = next(
|
||||
v
|
||||
for k, v in self.MODEL_PREFIXES_TO_MAX_TOKENS.items()
|
||||
if self.model.startswith(k)
|
||||
)
|
||||
offset = (
|
||||
0
|
||||
if self.model in self.EMBEDDING_MODELS
|
||||
else self.TOKEN_OFFSET
|
||||
)
|
||||
|
||||
return (
|
||||
tokens if tokens else self.DEFAULT_MAX_TOKENS
|
||||
) - offset
|
||||
|
||||
def count_tokens(
|
||||
self, text: str | list[dict], model: Optional[str] = None
|
||||
) -> int:
|
||||
"""
|
||||
Counts the number of tokens in the given text.
|
||||
If the text is a list of messages, counts the tokens for each message.
|
||||
If a model is provided, uses that model for encoding.
|
||||
"""
|
||||
if isinstance(text, list):
|
||||
model = model if model else self.model
|
||||
|
||||
try:
|
||||
encoding = tiktoken.encoding_for_model(model)
|
||||
except KeyError:
|
||||
logging.warning(
|
||||
"model not found. Using cl100k_base encoding."
|
||||
)
|
||||
encoding = tiktoken.get_encoding("cl100k_base")
|
||||
|
||||
if model in {
|
||||
"gpt-3.5-turbo-0613",
|
||||
"gpt-3.5-turbo-16k-0613",
|
||||
"gpt-4-0314",
|
||||
"gpt-4-32k-0314",
|
||||
"gpt-4-0613",
|
||||
"gpt-4-32k-0613",
|
||||
}:
|
||||
tokens_per_message = 3
|
||||
tokens_per_name = 1
|
||||
elif model == "gpt-3.5-turbo-0301":
|
||||
tokens_per_message = 4
|
||||
tokens_per_name = -1
|
||||
elif "gpt-3.5-turbo" in model or "gpt-35-turbo" in model:
|
||||
logging.info(
|
||||
"gpt-3.5-turbo may update over time. Returning"
|
||||
" num tokens assuming gpt-3.5-turbo-0613."
|
||||
)
|
||||
return self.count_tokens(
|
||||
text, model="gpt-3.5-turbo-0613"
|
||||
)
|
||||
elif "gpt-4" in model:
|
||||
logging.info(
|
||||
"gpt-4 may update over time. Returning num tokens"
|
||||
" assuming gpt-4-0613."
|
||||
)
|
||||
return self.count_tokens(text, model="gpt-4-0613")
|
||||
else:
|
||||
raise NotImplementedError(
|
||||
"token_count() is not implemented for model"
|
||||
f" {model}. See"
|
||||
" https://github.com/openai/openai-python/blob/main/chatml.md"
|
||||
" for information on how messages are converted"
|
||||
" to tokens."
|
||||
)
|
||||
|
||||
num_tokens = 0
|
||||
|
||||
for message in text:
|
||||
num_tokens += tokens_per_message
|
||||
for key, value in message.items():
|
||||
num_tokens += len(encoding.encode(value))
|
||||
if key == "name":
|
||||
num_tokens += tokens_per_name
|
||||
|
||||
num_tokens += 3
|
||||
|
||||
return num_tokens
|
||||
else:
|
||||
return len(self.encoding.encode(text))
|
||||
|
||||
def len(self, text: str | list[dict], model: Optional[str]):
|
||||
"""
|
||||
Returns the length of the text in tokens.
|
||||
If a model is provided, uses that model for encoding.
|
||||
"""
|
||||
return self.count_tokens(text, model)
|
@ -0,0 +1,46 @@
|
||||
import logging
|
||||
from abc import ABC
|
||||
from dataclasses import dataclass
|
||||
from tenacity import Retrying, wait_exponential, stop_after_attempt
|
||||
|
||||
|
||||
@dataclass
|
||||
class ExponentialBackoffMixin(ABC):
|
||||
"""
|
||||
A mixin class that provides exponential backoff functionality.
|
||||
"""
|
||||
|
||||
min_retry_delay: float = 2
|
||||
"""
|
||||
The minimum delay between retries in seconds.
|
||||
"""
|
||||
|
||||
max_retry_delay: float = 10
|
||||
"""
|
||||
The maximum delay between retries in seconds.
|
||||
"""
|
||||
|
||||
max_attempts: int = 10
|
||||
"""
|
||||
The maximum number of retry attempts.
|
||||
"""
|
||||
|
||||
def after_hook(s: str) -> None:
|
||||
return logging.warning(s)
|
||||
|
||||
"""
|
||||
A callable that is executed after each retry attempt.
|
||||
"""
|
||||
|
||||
def retrying(self) -> Retrying:
|
||||
"""
|
||||
Returns a Retrying object configured with the exponential backoff settings.
|
||||
"""
|
||||
return Retrying(
|
||||
wait=wait_exponential(
|
||||
min=self.min_retry_delay, max=self.max_retry_delay
|
||||
),
|
||||
stop=stop_after_attempt(self.max_attempts),
|
||||
reraise=True,
|
||||
after=self.after_hook,
|
||||
)
|
@ -0,0 +1,39 @@
|
||||
# AnthropicTokenizer
|
||||
|
||||
import pytest
|
||||
from swarms.tokenizers.anthropic_tokenizer import AnthropicTokenizer
|
||||
|
||||
|
||||
def test_post_init():
|
||||
tokenizer = AnthropicTokenizer()
|
||||
assert tokenizer.model == "claude-2.1"
|
||||
assert tokenizer.max_tokens == 200000
|
||||
|
||||
|
||||
def test_default_max_tokens():
|
||||
tokenizer = AnthropicTokenizer(model="claude")
|
||||
assert tokenizer.default_max_tokens() == 100000
|
||||
|
||||
|
||||
@pytest.mark.parametrize(
|
||||
"model,tokens", [("claude-2.1", 200000), ("claude", 100000)]
|
||||
)
|
||||
def test_default_max_tokens_models(model, tokens):
|
||||
tokenizer = AnthropicTokenizer(model=model)
|
||||
assert tokenizer.default_max_tokens() == tokens
|
||||
|
||||
|
||||
def test_count_tokens_string():
|
||||
# Insert mock instantiation of anthropic client and its count_tokens function
|
||||
text = "This is a test string."
|
||||
tokenizer = AnthropicTokenizer()
|
||||
tokens = tokenizer.count_tokens(text)
|
||||
assert tokens == 5
|
||||
|
||||
|
||||
def test_count_tokens_list():
|
||||
# Insert mock instantiation of anthropic client and its count_tokens function
|
||||
text = ["This", "is", "a", "test", "string."]
|
||||
tokenizer = AnthropicTokenizer()
|
||||
with pytest.raises(ValueError):
|
||||
tokenizer.count_tokens(text)
|
@ -0,0 +1,45 @@
|
||||
# BaseTokenizer
|
||||
|
||||
import pytest
|
||||
from swarms.tokenizers.base_tokenizer import BaseTokenizer
|
||||
|
||||
|
||||
# 1. Fixture for BaseTokenizer instance.
|
||||
@pytest.fixture
|
||||
def base_tokenizer():
|
||||
return BaseTokenizer(max_tokens=100)
|
||||
|
||||
|
||||
# 2. Tests for __post_init__.
|
||||
def test_post_init(base_tokenizer):
|
||||
assert base_tokenizer.stop_sequences == ["<|Response|>"]
|
||||
assert base_tokenizer.stop_token == "<|Response|>"
|
||||
|
||||
|
||||
# 3. Tests for count_tokens_left with different inputs.
|
||||
def test_count_tokens_left_with_positive_diff(
|
||||
base_tokenizer, monkeypatch
|
||||
):
|
||||
# Mocking count_tokens to return a specific value
|
||||
monkeypatch.setattr(
|
||||
"swarms.tokenizers.BaseTokenizer.count_tokens",
|
||||
lambda x, y: 50,
|
||||
)
|
||||
assert base_tokenizer.count_tokens_left("some text") == 50
|
||||
|
||||
|
||||
def test_count_tokens_left_with_zero_diff(
|
||||
base_tokenizer, monkeypatch
|
||||
):
|
||||
monkeypatch.setattr(
|
||||
"swarms.tokenizers.BaseTokenizer.count_tokens",
|
||||
lambda x, y: 100,
|
||||
)
|
||||
assert base_tokenizer.count_tokens_left("some text") == 0
|
||||
|
||||
|
||||
# 4. Add tests for count_tokens. This method is an abstract one, so testing it
|
||||
# will be dependent on the actual implementation in the subclass. Here is just
|
||||
# a general idea how to test it (we assume that test_count_tokens is implemented in some subclass).
|
||||
def test_count_tokens(subclass_tokenizer_instance):
|
||||
assert subclass_tokenizer_instance.count_tokens("some text") == 6
|
@ -0,0 +1,35 @@
|
||||
# CohereTokenizer
|
||||
|
||||
import pytest
|
||||
from swarms.tokenizers.cohere_tokenizer import CohereTokenizer
|
||||
from unittest.mock import MagicMock
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def cohere_tokenizer():
|
||||
mock_client = MagicMock()
|
||||
mock_client.tokenize.return_value.tokens = [
|
||||
"token1",
|
||||
"token2",
|
||||
"token3",
|
||||
]
|
||||
return CohereTokenizer(model="<model-name>", client=mock_client)
|
||||
|
||||
|
||||
def test_count_tokens_with_string(cohere_tokenizer):
|
||||
tokens_count = cohere_tokenizer.count_tokens("valid string")
|
||||
assert tokens_count == 3
|
||||
|
||||
|
||||
def test_count_tokens_with_non_string(cohere_tokenizer):
|
||||
with pytest.raises(ValueError):
|
||||
cohere_tokenizer.count_tokens(["invalid", "input"])
|
||||
|
||||
|
||||
def test_count_tokens_with_different_length(cohere_tokenizer):
|
||||
cohere_tokenizer.client.tokenize.return_value.tokens = [
|
||||
"token1",
|
||||
"token2",
|
||||
]
|
||||
tokens_count = cohere_tokenizer.count_tokens("valid string")
|
||||
assert tokens_count == 2
|
@ -0,0 +1,66 @@
|
||||
# HuggingFaceTokenizer
|
||||
|
||||
import pytest
|
||||
import os
|
||||
from unittest.mock import patch
|
||||
from swarms.tokenizers.r_tokenizers import HuggingFaceTokenizer
|
||||
|
||||
|
||||
# Test class setup
|
||||
@pytest.fixture
|
||||
def hftokenizer():
|
||||
dir_path = os.path.join(os.getcwd(), "modeldir")
|
||||
tokenizer = HuggingFaceTokenizer(dir_path)
|
||||
return tokenizer
|
||||
|
||||
|
||||
# testing __init__
|
||||
@patch("os.path")
|
||||
@patch("swarms.tokenizers.get_logger")
|
||||
def test___init__(mock_get_logger, mock_path, hftokenizer):
|
||||
mock_path.exists.return_value = False
|
||||
mock_path.join.return_value = "dummy_path"
|
||||
mock_get_logger.return_value = "dummy_logger"
|
||||
assert hftokenizer.model_dir == "dummy_path"
|
||||
assert hftokenizer.logger == "dummy_logger"
|
||||
assert hftokenizer._maybe_decode_bytes is False
|
||||
assert hftokenizer._prefix_space_tokens is None
|
||||
|
||||
|
||||
# testing vocab_size property
|
||||
def test_vocab_size(hftokenizer):
|
||||
assert hftokenizer.vocab_size == 30522
|
||||
|
||||
|
||||
# testing bos_token_id property
|
||||
def test_bos_token_id(hftokenizer):
|
||||
assert hftokenizer.bos_token_id == 101
|
||||
|
||||
|
||||
# testing eos_token_id property
|
||||
def test_eos_token_id(hftokenizer):
|
||||
assert hftokenizer.eos_token_id == 102
|
||||
|
||||
|
||||
# testing prefix_space_tokens property
|
||||
def test_prefix_space_tokens(hftokenizer):
|
||||
assert len(hftokenizer.prefix_space_tokens) > 0
|
||||
|
||||
|
||||
# testing _maybe_add_prefix_space method
|
||||
def test__maybe_add_prefix_space(hftokenizer):
|
||||
assert (
|
||||
hftokenizer._maybe_add_prefix_space(
|
||||
[101, 2003, 2010, 2050, 2001, 2339], " is why"
|
||||
)
|
||||
== " is why"
|
||||
)
|
||||
assert (
|
||||
hftokenizer._maybe_add_prefix_space(
|
||||
[2003, 2010, 2050, 2001, 2339], "is why"
|
||||
)
|
||||
== " is why"
|
||||
)
|
||||
|
||||
|
||||
# continuing tests for other methods...
|
@ -0,0 +1,47 @@
|
||||
# OpenAITokenizer
|
||||
|
||||
import pytest
|
||||
import swarms.tokenizers.openai_tokenizers as tokenizers
|
||||
|
||||
|
||||
@pytest.fixture()
|
||||
def openai_tokenizer():
|
||||
return tokenizers.OpenAITokenizer("gpt-3")
|
||||
|
||||
|
||||
def test_init(openai_tokenizer):
|
||||
assert openai_tokenizer.model == "gpt-3"
|
||||
|
||||
|
||||
def test_default_max_tokens(openai_tokenizer):
|
||||
assert openai_tokenizer.default_max_tokens() == 4096
|
||||
|
||||
|
||||
@pytest.mark.parametrize(
|
||||
"text, expected_output", [("Hello, world!", 3), (["Hello"], 4)]
|
||||
)
|
||||
def test_count_tokens_single(openai_tokenizer, text, expected_output):
|
||||
assert (
|
||||
openai_tokenizer.count_tokens(text, "gpt-3")
|
||||
== expected_output
|
||||
)
|
||||
|
||||
|
||||
@pytest.mark.parametrize(
|
||||
"texts, expected_output",
|
||||
[(["Hello, world!", "This is a test"], 6), (["Hello"], 4)],
|
||||
)
|
||||
def test_count_tokens_multiple(
|
||||
openai_tokenizer, texts, expected_output
|
||||
):
|
||||
assert (
|
||||
openai_tokenizer.count_tokens(texts, "gpt-3")
|
||||
== expected_output
|
||||
)
|
||||
|
||||
|
||||
@pytest.mark.parametrize(
|
||||
"text, expected_output", [("Hello, world!", 3), (["Hello"], 4)]
|
||||
)
|
||||
def test_len(openai_tokenizer, text, expected_output):
|
||||
assert openai_tokenizer.len(text, "gpt-3") == expected_output
|
@ -0,0 +1 @@
|
||||
# SentencePieceTokenizer
|
@ -0,0 +1,81 @@
|
||||
# Tokenizer
|
||||
|
||||
from swarms.tokenizers.r_tokenizers import Tokenizer
|
||||
from unittest.mock import patch
|
||||
|
||||
|
||||
def test_initializer_existing_model_file():
|
||||
with patch("os.path.exists", return_value=True):
|
||||
with patch(
|
||||
"swarms.tokenizers.SentencePieceTokenizer"
|
||||
) as mock_model:
|
||||
tokenizer = Tokenizer("tokenizers/my_model.model")
|
||||
mock_model.assert_called_with("tokenizers/my_model.model")
|
||||
assert tokenizer.model == mock_model.return_value
|
||||
|
||||
|
||||
def test_initializer_model_folder():
|
||||
with patch("os.path.exists", side_effect=[False, True]):
|
||||
with patch(
|
||||
"swarms.tokenizers.HuggingFaceTokenizer"
|
||||
) as mock_model:
|
||||
tokenizer = Tokenizer("my_model_directory")
|
||||
mock_model.assert_called_with("my_model_directory")
|
||||
assert tokenizer.model == mock_model.return_value
|
||||
|
||||
|
||||
def test_vocab_size():
|
||||
with patch(
|
||||
"swarms.tokenizers.SentencePieceTokenizer"
|
||||
) as mock_model:
|
||||
tokenizer = Tokenizer("tokenizers/my_model.model")
|
||||
assert (
|
||||
tokenizer.vocab_size == mock_model.return_value.vocab_size
|
||||
)
|
||||
|
||||
|
||||
def test_bos_token_id():
|
||||
with patch(
|
||||
"swarms.tokenizers.SentencePieceTokenizer"
|
||||
) as mock_model:
|
||||
tokenizer = Tokenizer("tokenizers/my_model.model")
|
||||
assert (
|
||||
tokenizer.bos_token_id
|
||||
== mock_model.return_value.bos_token_id
|
||||
)
|
||||
|
||||
|
||||
def test_encode():
|
||||
with patch(
|
||||
"swarms.tokenizers.SentencePieceTokenizer"
|
||||
) as mock_model:
|
||||
tokenizer = Tokenizer("tokenizers/my_model.model")
|
||||
assert (
|
||||
tokenizer.encode("hello")
|
||||
== mock_model.return_value.encode.return_value
|
||||
)
|
||||
|
||||
|
||||
def test_decode():
|
||||
with patch(
|
||||
"swarms.tokenizers.SentencePieceTokenizer"
|
||||
) as mock_model:
|
||||
tokenizer = Tokenizer("tokenizers/my_model.model")
|
||||
assert (
|
||||
tokenizer.decode([1, 2, 3])
|
||||
== mock_model.return_value.decode.return_value
|
||||
)
|
||||
|
||||
|
||||
def test_call():
|
||||
with patch(
|
||||
"swarms.tokenizers.SentencePieceTokenizer"
|
||||
) as mock_model:
|
||||
tokenizer = Tokenizer("tokenizers/my_model.model")
|
||||
assert (
|
||||
tokenizer("hello")
|
||||
== mock_model.return_value.__call__.return_value
|
||||
)
|
||||
|
||||
|
||||
# More tests can be added here
|
Loading…
Reference in new issue