parent
0d73d6dd99
commit
2b13e86cf8
@ -1,123 +0,0 @@
|
|||||||
# AnthropicTokenizer Documentation
|
|
||||||
|
|
||||||
## Introduction
|
|
||||||
|
|
||||||
This documentation intends to provide a complete and in-depth guide for using the `AnthropicTokenizer` class within the `swarms.tokenizers` library. The `AnthropicTokenizer` is designed specifically to interface with Anthropic's AI models, primarily used for text tokenization and metadata handling.
|
|
||||||
|
|
||||||
Understanding how to use this tokenizer effectively is crucial for developers and researchers working with natural language processing, machine learning, and text analysis using Anthropic AI models.
|
|
||||||
|
|
||||||
The purpose of the `AnthropicTokenizer` is to convert raw text into a sequence of tokens that can be fed into Anthropic AI models for various tasks. Tokenization is a fundamental step in text processing pipelines and affects the performance of AI models.
|
|
||||||
|
|
||||||
## Class Definition: AnthropicTokenizer
|
|
||||||
|
|
||||||
`AnthropicTokenizer` extends the functionality of a base tokenizer to provide features specifically needed for Anthropic AI models. The class is designed to manage tokenization processes such as counting tokens and ensuring that the token count is under a specified limit, which is essential for effective and efficient model performance.
|
|
||||||
|
|
||||||
**Class Signature:**
|
|
||||||
|
|
||||||
**Parameters:**
|
|
||||||
|
|
||||||
| Parameter Name | Type | Description | Default Value |
|
|
||||||
|---------------------------|-----------|-----------------------------------------------------------------|---------------|
|
|
||||||
| `max_tokens` | `int` | Maximum number of tokens permitted. | `500` |
|
|
||||||
| `client` | `Anthropic` | Instance of an `Anthropic` client for tokenization services. | `None` |
|
|
||||||
| `model` | `str` | Identifier for the Anthropic model in use. | `"claude-2.1"`|
|
|
||||||
|
|
||||||
**Methods and their descriptions:**
|
|
||||||
|
|
||||||
| Method Name | Return Type | Description |
|
|
||||||
|----------------------|-------------|--------------------------------------------------------------|
|
|
||||||
| `__post_init__` | `None` | Initializes default parameters and client instance. |
|
|
||||||
| `default_max_tokens` | `int` | Returns the default maximum number of tokens. |
|
|
||||||
| `count_tokens` | `int` | Counts tokens in the input text. Raises a ValueError if the input is not a string. |
|
|
||||||
|
|
||||||
## Architecture and Mechanics
|
|
||||||
|
|
||||||
Upon instantiation, `AnthropicTokenizer` initializes its `max_tokens` limit and sets up a client to interact with the Anthropic services. The client is responsible for providing tokenization functions critical for processing the text inputs.
|
|
||||||
|
|
||||||
The tokenizer employs a dictionary to map specific model prefixes to their maximum token counts. This allows users to adapt the tokenizer's behavior to different models with varying token limits. The `default_max_tokens()` method dynamically retrieves the token limit based on the provided model name, ensuring compatibility and flexibility.
|
|
||||||
|
|
||||||
`count_tokens()` is a critical function that calculates the number of tokens in a given text. This functionality is essential for respecting the model's token limit and ensuring accurate processing by the Anthropic AI.
|
|
||||||
|
|
||||||
## Usage Examples
|
|
||||||
|
|
||||||
Before delving into detailed examples, make sure you have `swarms.tokenizers` installed and ready. If `anthropic` is an optional dependency, ensure that it's installed as well.
|
|
||||||
|
|
||||||
### 1. Tokenizing with Default Settings
|
|
||||||
|
|
||||||
```python
|
|
||||||
from swarms.tokenizers import AnthropicTokenizer
|
|
||||||
|
|
||||||
# Initialize the tokenizer with default settings
|
|
||||||
tokenizer = AnthropicTokenizer()
|
|
||||||
|
|
||||||
# Tokenize a sample text
|
|
||||||
text = "Hello world! This is an example text to tokenize."
|
|
||||||
token_count = tokenizer.count_tokens(text)
|
|
||||||
|
|
||||||
print(f"Number of tokens: {token_count}")
|
|
||||||
```
|
|
||||||
|
|
||||||
In this example, we use the `AnthropicTokenizer` to count the number of tokens in a simple text. The token count can be crucial for managing inputs to the AI model.
|
|
||||||
|
|
||||||
### 2. Tokenizing with Custom Model
|
|
||||||
|
|
||||||
```python
|
|
||||||
from swarms.tokenizers import AnthropicTokenizer
|
|
||||||
|
|
||||||
# Define a custom model
|
|
||||||
custom_model = "claude"
|
|
||||||
|
|
||||||
# Initialize the tokenizer with a custom model and max_tokens
|
|
||||||
tokenizer = AnthropicTokenizer(model=custom_model, max_tokens=1000)
|
|
||||||
|
|
||||||
# Process a larger text
|
|
||||||
large_text = "..." # Assume large_text is a string with meaningful content
|
|
||||||
|
|
||||||
token_count = tokenizer.count_tokens(large_text)
|
|
||||||
if token_count > tokenizer.max_tokens:
|
|
||||||
print("Text exceeds the maximum token limit.")
|
|
||||||
else:
|
|
||||||
print(f"Token count within limit: {token_count}")
|
|
||||||
```
|
|
||||||
|
|
||||||
This snippet demonstrates setting up the tokenizer for a custom model and a higher maximum token limit. It is helpful when dealing with texts larger than the default token limit.
|
|
||||||
|
|
||||||
### 3. Handling Error in Token Count Function
|
|
||||||
|
|
||||||
```python
|
|
||||||
from swarms.tokenizers import AnthropicTokenizer
|
|
||||||
|
|
||||||
# Initialize the tokenizer
|
|
||||||
tokenizer = AnthropicTokenizer()
|
|
||||||
|
|
||||||
# Attempt to tokenize a non-string input (which will raise an error)
|
|
||||||
non_string_input = ["This", "is", "a", "list", "not", "a", "string"]
|
|
||||||
|
|
||||||
try:
|
|
||||||
tokenizer.count_tokens(non_string_input)
|
|
||||||
except ValueError as e:
|
|
||||||
print(f"Error: {e}")
|
|
||||||
```
|
|
||||||
|
|
||||||
This example illustrates the error management within the `count_tokens` method. It is important to handle exceptions gracefully, particularly when a non-string input is provided.
|
|
||||||
|
|
||||||
## Additional Tips and Considerations
|
|
||||||
|
|
||||||
- Always ensure the input text is a string before calling `count_tokens` to avoid unnecessary errors.
|
|
||||||
- Be aware of the `max_tokens` limit since larger models might have significantly higher limits than defaults.
|
|
||||||
- When tokenizing large datasets, batch processing with a loop or parallelization might provide better performance.
|
|
||||||
|
|
||||||
## Resources and References
|
|
||||||
|
|
||||||
Given that `AnthropicTokenizer` interacts with an AI model and optional dependencies, it is beneficial to refer to the official documentation and guides specific to those components:
|
|
||||||
|
|
||||||
- [Anthropic Model Documentation](#) (Link would be replaced with actual URL)
|
|
||||||
- [swarms.tokenizers Installation Guide](#)
|
|
||||||
- [Python `dataclasses` Documentation](https://docs.python.org/3/library/dataclasses.html)
|
|
||||||
|
|
||||||
Additionally, literature on best practices for tokenization and natural language processing will contribute to a more effective use of the tokenizer:
|
|
||||||
|
|
||||||
- Smith, B. (Year). "Advanced Tokenization Techniques for NLP Models." Journal of Machine Learning.
|
|
||||||
- Caruthers, M. (Year). "Text Pre-processing and Tokenization for Deep Learning."
|
|
||||||
|
|
||||||
By following the provided documentation and recommended practices, developers and researchers can harness the power of `AnthropicTokenizer` to its full potential, facilitating optimal use of Anthropic's AI models for varied text processing tasks.
|
|
@ -1,63 +0,0 @@
|
|||||||
# Documentation for `swarms.tokenizers.BaseTokenizer`
|
|
||||||
|
|
||||||
## Overview and Introduction
|
|
||||||
|
|
||||||
The `swarms.tokenizers` library is designed to provide flexible and efficient tokenization utilities for natural language processing (NLP) tasks. The `BaseTokenizer` class serves as a foundational abstract class from which specific tokenizer implementations can be derived. This class outlines essential functions and properties all tokenizers should have, ensuring consistency and capturing common behaviors required for processing textual data.
|
|
||||||
|
|
||||||
## Class Definition: `BaseTokenizer`
|
|
||||||
|
|
||||||
### Attributes and Methods
|
|
||||||
|
|
||||||
| Name | Type | Description |
|
|
||||||
| ---------------------- | ------------------------------- | ------------------------------------------------------------------------- |
|
|
||||||
| `max_tokens` | `int` | Maximum number of tokens the tokenizer can process. |
|
|
||||||
| `stop_token` | `str` | Token used to denote the end of processing. |
|
|
||||||
| `stop_sequences` | `List[str]` (read-only) | List of stop sequences initialized post-instantiation. |
|
|
||||||
| `count_tokens_left` | Method: `(text) -> int` | Computes the number of tokens that can still be added given the text. |
|
|
||||||
| `count_tokens` | Abstract Method: `(text) -> int`| Returns the number of tokens in the given text. |
|
|
||||||
|
|
||||||
## Functionality and Usage
|
|
||||||
|
|
||||||
The `BaseTokenizer` class provides the structure for creating tokenizers. It includes methods for counting the tokens in a given text and determining how many more tokens can be added without exceeding the `max_tokens` limit. This class should be subclassed, and the `count_tokens` method must be implemented in subclasses to provide the specific token counting logic.
|
|
||||||
|
|
||||||
### Example: Subclassing `BaseTokenizer`
|
|
||||||
|
|
||||||
```python
|
|
||||||
from swarms.tokenizers import BaseTokenizer
|
|
||||||
|
|
||||||
|
|
||||||
class SimpleTokenizer(BaseTokenizer):
|
|
||||||
def count_tokens(self, text: Union[str, List[dict]]) -> int:
|
|
||||||
if isinstance(text, str):
|
|
||||||
# Split text by spaces as a simple tokenization approach
|
|
||||||
return len(text.split())
|
|
||||||
elif isinstance(text, list):
|
|
||||||
# Assume list of dictionaries with 'token' key
|
|
||||||
return sum(len(item["token"].split()) for item in text)
|
|
||||||
else:
|
|
||||||
raise TypeError("Unsupported type for text")
|
|
||||||
|
|
||||||
|
|
||||||
# Usage example
|
|
||||||
tokenizer = SimpleTokenizer(max_tokens=100)
|
|
||||||
text = "This is an example sentence to tokenize."
|
|
||||||
print(tokenizer.count_tokens(text)) # Outputs: 7 (assuming space tokenization)
|
|
||||||
remaining_tokens = tokenizer.count_tokens_left(text)
|
|
||||||
print(remaining_tokens) # Outputs: 93
|
|
||||||
```
|
|
||||||
|
|
||||||
### Note:
|
|
||||||
|
|
||||||
Understand that the `stop_sequences` and `stop_token` in this particular implementation are placeholders to illustrate the pattern. The actual logic may differ based on specific tokenizer requirements.
|
|
||||||
|
|
||||||
## Additional Information and Tips
|
|
||||||
|
|
||||||
- Tokenization is a vital step in text processing for NLP. It should be tailored to the requirements of the application.
|
|
||||||
- Ensure that tokenizer definitions are in sync with the models and datasets being used.
|
|
||||||
|
|
||||||
## References and Resources
|
|
||||||
|
|
||||||
For a deeper understanding of tokenization and its role in NLP, refer to:
|
|
||||||
|
|
||||||
- [Natural Language Processing (NLP) in Python — Tokenization](https://nlp.stanford.edu/IR-book/html/htmledition/tokenization-1.html)
|
|
||||||
- [Hugging Face Tokenizers](https://huggingface.co/docs/tokenizers/python/latest/) - a popular library for tokenization, particularly in the context of transformer models.
|
|
@ -1,135 +0,0 @@
|
|||||||
# CohereTokenizer Documentation
|
|
||||||
|
|
||||||
The `CohereTokenizer` class is designed to interface with Cohere language models and provides methods for tokenizing text inputs. This tokenizer plays a crucial role in preparing data for a Cohere model, which operates on tokens rather than raw text.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### Class Name: `CohereTokenizer`
|
|
||||||
|
|
||||||
## Overview
|
|
||||||
|
|
||||||
The `CohereTokenizer` class is essential for interacting with Cohere models that require tokenized input. As models often operate on tokens, having an intuitive and efficient tokenizer directly linked to the model simplifies preprocessing tasks. This tokenizer counts the tokens in the given text, helping users to manage and understand the tokens they can work with, given limitations like the model's maximum token count.
|
|
||||||
|
|
||||||
## Architecture and How the Class Works
|
|
||||||
|
|
||||||
The `CohereTokenizer` is built as a data class, ensuring that it is lightweight and focused solely on its data attributes and methods related to tokenization. The class relies on an instance of a Cohere `Client`, which needs to be instantiated with an API key from Cohere before use.
|
|
||||||
|
|
||||||
Upon instantiation, the `CohereTokenizer` holds a reference to a specific Cohere model and interfaces with the `Client` to tokenize text accordingly. It provides a simple utility (`count_tokens`) to count the number of tokens that a string, or a list of strings, would be broken down into by the Cohere API.
|
|
||||||
|
|
||||||
## Purpose and Usage
|
|
||||||
|
|
||||||
The `CohereTokenizer` is specifically made for users who are working with Cohere language models. It's designed to help them in preprocessing steps by converting text into tokens and determining how many tokens their text segments contain. This is crucial for ensuring that inputs do not exceed the model's maximum token count, as exceeding this limit can result in errors or truncated text.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
# Class Definition
|
|
||||||
|
|
||||||
```python
|
|
||||||
@dataclass
|
|
||||||
class CohereTokenizer:
|
|
||||||
model: str
|
|
||||||
client: Client
|
|
||||||
DEFAULT_MODEL: str = "command"
|
|
||||||
DEFAULT_MAX_TOKENS: int = 2048
|
|
||||||
max_tokens: int = DEFAULT_MAX_TOKENS
|
|
||||||
```
|
|
||||||
|
|
||||||
## Parameters
|
|
||||||
|
|
||||||
| Parameter | Type | Description | Default Value |
|
|
||||||
| ------------------ | -------------- | ------------------------------------------------------------- | ------------- |
|
|
||||||
| `model` | `str` | Specifies the Cohere model to be used for tokenization. | None |
|
|
||||||
| `client` | `Client` | An instance of the Cohere client, initialized with an API key.| None |
|
|
||||||
| `DEFAULT_MODEL` | `str` | The default model to use if none is specified. | "command" |
|
|
||||||
| `DEFAULT_MAX_TOKENS`| `int` | Default maximum number of tokens the model accepts. | 2048 |
|
|
||||||
| `max_tokens` | `int` | Maximum number of tokens; it can be altered to fit the model. | `DEFAULT_MAX_TOKENS`|
|
|
||||||
|
|
||||||
### Methods
|
|
||||||
|
|
||||||
The `CohereTokenizer` class contains the following method:
|
|
||||||
|
|
||||||
#### `count_tokens`
|
|
||||||
|
|
||||||
```python
|
|
||||||
def count_tokens(self, text: str | list) -> int:
|
|
||||||
"""
|
|
||||||
Count the number of tokens in the given text.
|
|
||||||
|
|
||||||
Args:
|
|
||||||
text (str | list): The input text to tokenize.
|
|
||||||
|
|
||||||
Returns:
|
|
||||||
int: The number of tokens in the text.
|
|
||||||
|
|
||||||
Raises:
|
|
||||||
ValueError: If the input text is not a string.
|
|
||||||
"""
|
|
||||||
```
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
# Functionality and Usage Example
|
|
||||||
|
|
||||||
Below are examples demonstrating how to use `CohereTokenizer`.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Counting Tokens
|
|
||||||
|
|
||||||
### Initialization
|
|
||||||
|
|
||||||
First, the Cohere client must be initialized and passed in to create an instance of `CohereTokenizer`.
|
|
||||||
|
|
||||||
```python
|
|
||||||
from cohere import Client
|
|
||||||
|
|
||||||
from swarms.tokenizers import CohereTokenizer
|
|
||||||
|
|
||||||
# Initialize Cohere client with your API key
|
|
||||||
cohere_client = Client("your-api-key")
|
|
||||||
|
|
||||||
# Instantiate the tokenizer
|
|
||||||
tokenizer = CohereTokenizer(model="your-model-name", client=cohere_client)
|
|
||||||
```
|
|
||||||
|
|
||||||
### Count Tokens Example 1
|
|
||||||
|
|
||||||
Counting tokens for a single string.
|
|
||||||
|
|
||||||
```python
|
|
||||||
text_to_tokenize = "Hello, World!"
|
|
||||||
token_count = tokenizer.count_tokens(text_to_tokenize)
|
|
||||||
print(f"Number of tokens: {token_count}")
|
|
||||||
```
|
|
||||||
|
|
||||||
### Count Tokens Example 2
|
|
||||||
|
|
||||||
Trying to pass a list instead of a single string, which would raise an error.
|
|
||||||
|
|
||||||
```python
|
|
||||||
texts_to_tokenize = ["Hello, World!", "Another piece of text."]
|
|
||||||
try:
|
|
||||||
token_count = tokenizer.count_tokens(texts_to_tokenize)
|
|
||||||
except ValueError as e:
|
|
||||||
print(f"Error: {e}")
|
|
||||||
```
|
|
||||||
|
|
||||||
The above code would print `Error: Text must be a string.` as the `count_tokens` function expects a string, not a list.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
# Additional Information and Tips
|
|
||||||
|
|
||||||
When working with the `CohereTokenizer`, here are some key points to keep in mind:
|
|
||||||
|
|
||||||
- The token count is important to know because Cohere models have a maximum token limit for input. If your text exceeds this limit, it must be split or truncated before being passed to the model.
|
|
||||||
- It is always a good practice to catch exceptions when using methods like `count_tokens` to handle unexpected inputs gracefully.
|
|
||||||
- Remember to replace `'your-api-key'` and `'your-model-name'` with your actual Cohere API key and desired model name.
|
|
||||||
|
|
||||||
# References and Resources
|
|
||||||
|
|
||||||
For more detailed information, refer to the following resources:
|
|
||||||
|
|
||||||
- [Cohere API documentation](https://docs.cohere.ai/)
|
|
||||||
- [Data Classes in Python](https://docs.python.org/3/library/dataclasses.html)
|
|
||||||
|
|
@ -1,127 +0,0 @@
|
|||||||
# HuggingFaceTokenizer Documentation
|
|
||||||
|
|
||||||
`HuggingFaceTokenizer` is a comprehensive Python class that leverages the Hugging Face `transformers` library to tokenize text using the SentencePiece tokenization mechanism. This class serves as a convenient wrapper for initializing and using tokenizer models from Hugging Face's transformer models, enabling easy integration of tokenizer functionality in various NLP tasks.
|
|
||||||
|
|
||||||
**Purpose and Architecture:**
|
|
||||||
|
|
||||||
Tokenization is a critical step in processing natural language wherein text is broken down into smaller elements (tokens), which can be further used for text analysis, language modeling, and other computational linguistics tasks. The `HuggingFaceTokenizer` provides methods to encode text (turning strings into lists of token IDs) and decode lists of token IDs back into human-readable text.
|
|
||||||
|
|
||||||
**Table of Contents:**
|
|
||||||
|
|
||||||
- [Overview](#overview)
|
|
||||||
- [Initialization](#initialization)
|
|
||||||
- [Properties](#properties)
|
|
||||||
- [Methods](#methods)
|
|
||||||
- [Usage Examples](#usage-examples)
|
|
||||||
- [References and Resources](#references-and-resources)
|
|
||||||
|
|
||||||
## Overview
|
|
||||||
|
|
||||||
The `HuggingFaceTokenizer` class is designed to streamline the process of tokenizing text for natural language processing (NLP). It encapsulates various functionalities, such as encoding text into tokens, decoding tokens into text, and identifying token IDs for special tokens.
|
|
||||||
|
|
||||||
## Initialization
|
|
||||||
|
|
||||||
`HuggingFaceTokenizer` is initialized by providing the directory containing the pretrained tokenizer model files. During its initialization, it configures its internal state for tokenization processes, prepares access to vocabulary, and establishes necessary properties for subsequent tokenization tasks.
|
|
||||||
|
|
||||||
### Constructor Parameters
|
|
||||||
|
|
||||||
| Parameter | Data Type | Description | Default |
|
|
||||||
|------------|-----------|--------------------------------------------|---------|
|
|
||||||
| model_dir | `str` | The directory containing the tokenizer model files. | None |
|
|
||||||
|
|
||||||
### Attributes
|
|
||||||
|
|
||||||
| Attribute | Data Type | Description |
|
|
||||||
|-------------------|---------------------|--------------------------------------------------------|
|
|
||||||
| vocab_size | `int` | The size of the vocabulary used by the tokenizer. |
|
|
||||||
| bos_token_id | `int` | The token ID representing the beginning of sequence token. |
|
|
||||||
| eos_token_id | `int` | The token ID representing the end of sequence token. |
|
|
||||||
| prefix_space_tokens | `Set[int]` | A set of token IDs without a prefix space. |
|
|
||||||
|
|
||||||
## Methods
|
|
||||||
|
|
||||||
### Vocabulary Related Methods
|
|
||||||
|
|
||||||
#### `vocab_size`
|
|
||||||
Returns the size of the tokenizer's vocabulary.
|
|
||||||
|
|
||||||
#### `bos_token_id`
|
|
||||||
Returns the token ID used for the beginning of a sentence.
|
|
||||||
|
|
||||||
#### `eos_token_id`
|
|
||||||
Returns the token ID used for the end of a sentence.
|
|
||||||
|
|
||||||
#### `prefix_space_tokens`
|
|
||||||
Returns a set of token IDs that start without prefix spaces.
|
|
||||||
|
|
||||||
### Tokenization Methods
|
|
||||||
|
|
||||||
#### `encode`
|
|
||||||
Encodes a given text into a sequence of token IDs.
|
|
||||||
|
|
||||||
#### `decode`
|
|
||||||
Decodes a given sequence of token IDs into human-readable text.
|
|
||||||
|
|
||||||
#### `indexes_containing_token`
|
|
||||||
Returns a list of token IDs that potentially could be decoded into the given token.
|
|
||||||
|
|
||||||
#### `__call__`
|
|
||||||
Tokenizes given text when the object is called like a function.
|
|
||||||
|
|
||||||
## Usage Examples
|
|
||||||
|
|
||||||
### 1. Initializing the Tokenizer
|
|
||||||
|
|
||||||
```python
|
|
||||||
from swarms.tokenizers import HuggingFaceTokenizer
|
|
||||||
|
|
||||||
# Initialize the tokenizer with the path to your tokenizer model.
|
|
||||||
tokenizer = HuggingFaceTokenizer("/path/to/your/model_dir")
|
|
||||||
```
|
|
||||||
|
|
||||||
### 2. Encoding Text
|
|
||||||
|
|
||||||
```python
|
|
||||||
# Tokenize a single sentence.
|
|
||||||
sentence = "The quick brown fox jumps over the lazy dog."
|
|
||||||
token_ids = tokenizer.encode(sentence)
|
|
||||||
print(token_ids)
|
|
||||||
```
|
|
||||||
|
|
||||||
### 3. Decoding Tokens
|
|
||||||
|
|
||||||
```python
|
|
||||||
# Assuming 'token_ids' contains a list of token IDs
|
|
||||||
decoded_text = tokenizer.decode(token_ids)
|
|
||||||
print(decoded_text)
|
|
||||||
```
|
|
||||||
|
|
||||||
### 4. Getting Special Token IDs
|
|
||||||
|
|
||||||
```python
|
|
||||||
# Get the beginning of sequence token ID
|
|
||||||
bos_id = tokenizer.bos_token_id
|
|
||||||
print(f"BOS token ID: {bos_id}")
|
|
||||||
|
|
||||||
# Get the end of sequence token ID
|
|
||||||
eos_id = tokenizer.eos_token_id
|
|
||||||
print(f"EOS token ID: {eos_id}")
|
|
||||||
```
|
|
||||||
|
|
||||||
### 5. Using the Tokenizer
|
|
||||||
|
|
||||||
```python
|
|
||||||
# Tokenize a prompt directly by calling the object with a string.
|
|
||||||
text = "Hello, world!"
|
|
||||||
token_ids = tokenizer(text)
|
|
||||||
print(token_ids)
|
|
||||||
```
|
|
||||||
|
|
||||||
## References and Resources
|
|
||||||
|
|
||||||
For more in-depth information on the Hugging Face `transformers` library and SentencePiece, refer to the following resources:
|
|
||||||
|
|
||||||
- Hugging Face `transformers` library documentation: https://huggingface.co/docs/transformers/index
|
|
||||||
- SentencePiece repository and documentation: https://github.com/google/sentencepiece
|
|
||||||
|
|
||||||
This documentation provides an introductory overview of the `HuggingFaceTokenizer` class. For a more extensive guide on the various parameters, functionalities, and advanced usage scenarios, users should refer to the detailed library documentation and external resources provided above.
|
|
@ -1,89 +0,0 @@
|
|||||||
# OpenAITokenizer
|
|
||||||
|
|
||||||
The `OpenAITokenizer` class is a versatile and intuitive tokenizer designed for use with OpenAI's various language models, including the powerful GPT series. This class addresses the need to efficiently tokenize text for submission to OpenAI's API endpoints, managing different models and their unique tokenization schemes with ease.
|
|
||||||
|
|
||||||
Utility of `OpenAITokenizer` centers around its key features:
|
|
||||||
- Support for multiple OpenAI models including GPT-3 and GPT-4.
|
|
||||||
- Dynamic token counting that considers model-specific details.
|
|
||||||
- Straightforward API intended for easy integration with larger systems.
|
|
||||||
|
|
||||||
## Architecture and Design
|
|
||||||
|
|
||||||
The class adheres to a simple yet effective design, offering methods for calculating token lengths and embedded properties that manage model-specific characteristics such as maximum tokens and encodings. A data class structure is used for clean initializations and better management of class data.
|
|
||||||
|
|
||||||
The `OpenAITokenizer` uses a property-based approach and a method-based approach to provide users with a variety of functionalities essential for preparing text input for OpenAI models.
|
|
||||||
|
|
||||||
## Attributes
|
|
||||||
|
|
||||||
The class contains several key constants and properties that define defaults and settings for use with different models:
|
|
||||||
|
|
||||||
| Attribute | Type | Description |
|
|
||||||
|-----------------------------------------------------|-----------------|-------------------------------------------------------------|
|
|
||||||
| `DEFAULT_OPENAI_GPT_3_COMPLETION_MODEL` | `str` | Default completion model for OpenAI GPT-3. |
|
|
||||||
| `DEFAULT_OPENAI_GPT_3_CHAT_MODEL` | `str` | Default chat model for OpenAI GPT-3. |
|
|
||||||
| `DEFAULT_OPENAI_GPT_4_MODEL` | `str` | Default model for OpenAI GPT-4. |
|
|
||||||
| `DEFAULT_ENCODING` | `str` | Default encoding for text. |
|
|
||||||
| `DEFAULT_MAX_TOKENS` | `int` | Default maximum number of tokens based on the model. |
|
|
||||||
| `TOKEN_OFFSET` | `int` | Token offset applicable to some models. |
|
|
||||||
| `MODEL_PREFIXES_TO_MAX_TOKENS` | `dict` | Mapping of model prefixes to their respective max tokens. |
|
|
||||||
| `EMBEDDING_MODELS` | `list` | List of embedding models supported. |
|
|
||||||
| `model` | `str` | Name of the model currently being used. |
|
|
||||||
|
|
||||||
## Methods
|
|
||||||
|
|
||||||
The `OpenAITokenizer` class offers a variety of methods:
|
|
||||||
|
|
||||||
| Method | Arguments | Return Type | Description |
|
|
||||||
|-----------------------|---------------------------------------------|-------------|------------------------------------------------------------------------------------------------|
|
|
||||||
| `__post_init__` | None | `None` | Method called after the class has been initialized to set up default values. |
|
|
||||||
| `encoding` | None | `Encoding` | Getter method that retrieves the encoding based on the specified model. |
|
|
||||||
| `default_max_tokens` | None | `int` | Calculates the default max tokens based on the current model or defaults if not model-specific.|
|
|
||||||
| `count_tokens` | `text: str \| list[dict]`, `model: str` | `int` | Counts the number of tokens within a given text or a list of messages. |
|
|
||||||
| `len` | `text: str \| list[dict]`, `model: str` | `int` | Wrapper for `count_tokens`, providing a more intuitive naming convention. |
|
|
||||||
|
|
||||||
### Usage Examples
|
|
||||||
|
|
||||||
Given the extensive nature of this class, several examples are provided for each method, detailing how to use the `OpenAITokenizer` in different contexts.
|
|
||||||
|
|
||||||
#### Example 1: Initializing the Tokenizer
|
|
||||||
|
|
||||||
```python
|
|
||||||
from swarms.tokenizers import OpenAITokenizer
|
|
||||||
|
|
||||||
tokenizer = OpenAITokenizer(model="gpt-4")
|
|
||||||
```
|
|
||||||
|
|
||||||
This example creates a new instance of `OpenAITokenizer` set to work with the GPT-4 model.
|
|
||||||
|
|
||||||
#### Example 2: Counting Tokens
|
|
||||||
|
|
||||||
```python
|
|
||||||
text = "Hello, this is an example text to tokenize."
|
|
||||||
|
|
||||||
# Initialize the tokenizer
|
|
||||||
tokenizer = OpenAITokenizer(model="gpt-4")
|
|
||||||
|
|
||||||
# Count tokens
|
|
||||||
num_tokens = tokenizer.count_tokens(text)
|
|
||||||
print(f"Number of tokens: {num_tokens}")
|
|
||||||
```
|
|
||||||
|
|
||||||
This code snippet demonstrates how to count the number of tokens in a string of text using the specified model's encoding.
|
|
||||||
|
|
||||||
#### Example 3: Custom Model Token Counting
|
|
||||||
|
|
||||||
```python
|
|
||||||
messages = [
|
|
||||||
{"name": "Alice", "message": "Hello! How are you?"},
|
|
||||||
{"name": "Bob", "message": "I'm good! Just working on some code."},
|
|
||||||
]
|
|
||||||
|
|
||||||
tokenizer = OpenAITokenizer(model="gpt-3.5-turbo")
|
|
||||||
|
|
||||||
# Count tokens for a list of messages
|
|
||||||
num_tokens = tokenizer.len(messages, model="gpt-3.5-turbo-0613")
|
|
||||||
print(f"Total tokens for messages: {num_tokens}")
|
|
||||||
```
|
|
||||||
|
|
||||||
In this example, we're invoking the `len` method to count the tokens in a conversation thread. Each message is represented as a dictionary with a `name` and `message` field.
|
|
||||||
|
|
@ -1,163 +0,0 @@
|
|||||||
# swarms.tokenizers Documentation
|
|
||||||
|
|
||||||
`swarms.tokenizers` is a PyTorch-like tokenization library designed to facilitate natural language processing (NLP) tasks by converting text inputs into a form that machine learning models can interpret. In this documentation, we will outline how to utilize the `SentencePieceTokenizer` class from the `swarms.tokenizers` library, which offers sentencepiece tokenization, a language-independent subword tokenizer and detokenizer.
|
|
||||||
|
|
||||||
## Purpose and Architecture of `SentencePieceTokenizer`
|
|
||||||
|
|
||||||
The `SentencePieceTokenizer` class uses a pre-trained sentencepiece model to tokenize and detokenize texts. SentencePiece is an unsupervised text tokenizer and detokenizer that allows the generation of a subword vocabulary from raw data. By breaking text down into subword units (like wordpieces or byte-pair-encodings), SentencePiece handles languages without a clear word boundary and can improve the performance of text processing in neural network models.
|
|
||||||
|
|
||||||
In `SentencePieceTokenizer`, the tokenization process is language-agnostic and encompasses a range of tokenization strategies, such as byte pair encoding (BPE), unigram, or a combination of both. The class is designed with ease of use in mind, allowing seamless integration with other components of the NLP pipeline.
|
|
||||||
|
|
||||||
## Class Definition
|
|
||||||
|
|
||||||
```python
|
|
||||||
class SentencePieceTokenizer:
|
|
||||||
"""
|
|
||||||
Tokenizer of sentencepiece.
|
|
||||||
|
|
||||||
Args:
|
|
||||||
model_file (str): the path of the tokenizer model
|
|
||||||
"""
|
|
||||||
```
|
|
||||||
|
|
||||||
## Initialization Parameters
|
|
||||||
|
|
||||||
Property/Method | Type | Description
|
|
||||||
----------------|------|-------------
|
|
||||||
`model_file` | `str` | The path to the pretrained sentencepiece model file.
|
|
||||||
|
|
||||||
## Methods and Usage
|
|
||||||
|
|
||||||
Below, we detail the methods available in `SentencePieceTokenizer`, including their parameters, their functionality, and usage examples.
|
|
||||||
|
|
||||||
### Method: `__init__`
|
|
||||||
|
|
||||||
Instantiates an instance of the `SentencePieceTokenizer` with the specified sentencepiece model.
|
|
||||||
|
|
||||||
#### Parameters
|
|
||||||
|
|
||||||
Parameter | Type | Description
|
|
||||||
----------|------|-------------
|
|
||||||
`model_file` | `str` | The path to the pretrained sentencepiece model file.
|
|
||||||
|
|
||||||
#### Example
|
|
||||||
|
|
||||||
```python
|
|
||||||
from swarms.tokenizers import SentencePieceTokenizer
|
|
||||||
|
|
||||||
tokenizer = SentencePieceTokenizer(model_file="your_model.model")
|
|
||||||
```
|
|
||||||
|
|
||||||
### Properties: Vocabulary Information
|
|
||||||
|
|
||||||
These properties provide access to various vocabulary-specific information.
|
|
||||||
|
|
||||||
#### `vocab_size`
|
|
||||||
#### `bos_token_id`
|
|
||||||
#### `eos_token_id`
|
|
||||||
|
|
||||||
##### Example
|
|
||||||
|
|
||||||
```python
|
|
||||||
vocab_size = tokenizer.vocab_size
|
|
||||||
print(f"Vocabulary size: {vocab_size}")
|
|
||||||
|
|
||||||
bos_id = tokenizer.bos_token_id
|
|
||||||
eos_id = tokenizer.eos_token_id
|
|
||||||
print(f"BOS token ID: {bos_id}, EOS token ID: {eos_id}")
|
|
||||||
```
|
|
||||||
|
|
||||||
### Method: `indexes_containing_token`
|
|
||||||
|
|
||||||
Finds possible tokenizer indexes that, when decoded, may contain the input token.
|
|
||||||
|
|
||||||
#### Parameters
|
|
||||||
|
|
||||||
Parameter | Type | Description
|
|
||||||
----------|------|-------------
|
|
||||||
`token` | `str` | The token for which possible indexes are to be found.
|
|
||||||
|
|
||||||
#### Returns
|
|
||||||
|
|
||||||
- `List[int]`: List of tokenizer indexes that might contain the token.
|
|
||||||
|
|
||||||
#### Example
|
|
||||||
|
|
||||||
```python
|
|
||||||
indexes = tokenizer.indexes_containing_token("▁the")
|
|
||||||
print(f"Indexes containing '▁the': {indexes}")
|
|
||||||
```
|
|
||||||
|
|
||||||
### Method: `encode`
|
|
||||||
|
|
||||||
Tokenizes a text prompt into a list of token IDs.
|
|
||||||
|
|
||||||
#### Parameters
|
|
||||||
|
|
||||||
Parameter | Type | Description
|
|
||||||
----------|------|-------------
|
|
||||||
`s` | `str` | The text prompt to tokenize.
|
|
||||||
`add_bos` | `bool` | If `True`, it adds the beginning-of-sentence token. (default: `True`)
|
|
||||||
|
|
||||||
#### Returns
|
|
||||||
- `List[int]`: List of token IDs representing the text prompt.
|
|
||||||
|
|
||||||
#### Example
|
|
||||||
|
|
||||||
```python
|
|
||||||
encoded_ids = tokenizer.encode("Hello, world!", add_bos=True)
|
|
||||||
print(f"Encoded token IDs: {encoded_ids}")
|
|
||||||
```
|
|
||||||
|
|
||||||
### Method: `decode`
|
|
||||||
|
|
||||||
Detokenizes a list of token IDs into text.
|
|
||||||
|
|
||||||
#### Parameters
|
|
||||||
|
|
||||||
Parameter | Type | Description
|
|
||||||
----------|------|-------------
|
|
||||||
`t` | `List[int]` | A list of token IDs to detokenize.
|
|
||||||
`offset` | `Optional[int]` | For incremental decoding. Defaults to `None`, which means it is not applied.
|
|
||||||
|
|
||||||
#### Returns
|
|
||||||
|
|
||||||
- `str`: Text representation of the decoded token IDs.
|
|
||||||
|
|
||||||
#### Example
|
|
||||||
|
|
||||||
```python
|
|
||||||
decoded_text = tokenizer.decode([bos_id] + encoded_ids)
|
|
||||||
print(f"Decoded text: {decoded_text}")
|
|
||||||
```
|
|
||||||
|
|
||||||
### Method: `__call__`
|
|
||||||
|
|
||||||
Tokenizes prompts when the class instance is used as a callable.
|
|
||||||
|
|
||||||
#### Parameters
|
|
||||||
|
|
||||||
Parameter | Type | Description
|
|
||||||
----------|------|-------------
|
|
||||||
`s` | `Union[str, Sequence[str]]` | Text prompts to tokenize.
|
|
||||||
`add_bos` | `bool` | If `True`, it adds the beginning-of-sentence token. (default: `False`)
|
|
||||||
`add_eos` | `bool` | If `True`, it adds the end-of-sentence token. (default: `False`)
|
|
||||||
|
|
||||||
#### Returns
|
|
||||||
|
|
||||||
- `addict.Addict`: Object with `input_ids` containing the list of token IDs.
|
|
||||||
|
|
||||||
#### Example
|
|
||||||
|
|
||||||
```python
|
|
||||||
input_data = tokenizer("Let's tokenize this sentence.")
|
|
||||||
print(f"Tokenized input IDs: {input_data.input_ids}")
|
|
||||||
```
|
|
||||||
|
|
||||||
## Additional Information and Tips
|
|
||||||
|
|
||||||
The library has efficient internals that cache information for performance benefits. For example, `indexes_containing_token` uses a deque to store the most recent lookups, which saves computation time by avoiding re-traversing the vocabulary.
|
|
||||||
|
|
||||||
## Conclusion
|
|
||||||
|
|
||||||
This documentation provides an in-depth explanation of `swarms.tokenizers` with a focus on the `SentencePieceTokenizer` class. By following the examples and guidance detailed above, users should be able to effectively use the tokenizers for their NLP tasks. Users are also encouraged to refer to further resources and the official SentencePiece documentation for more advanced use cases and configurations.
|
|
@ -1,78 +0,0 @@
|
|||||||
# `Tokenizer` Class Documentation
|
|
||||||
|
|
||||||
The `Tokenizer` class is a flexible and robust tokenization tool designed to efficiently tokenize prompts into a sequence of token IDs or convert token IDs back into readable text. The class works by initializing with a path to a pretrained tokenization model and supports different tokenization backends based on the availability of configs and pretrained models.
|
|
||||||
|
|
||||||
## Initialization & Configuration
|
|
||||||
|
|
||||||
### Parameters:
|
|
||||||
|
|
||||||
| Parameter | Type | Description | Required |
|
|
||||||
|------------|------|------------------------------------------|----------|
|
|
||||||
| model_file | str | Path to the tokenizer model or directory | Yes |
|
|
||||||
|
|
||||||
### Attributes:
|
|
||||||
|
|
||||||
| Attribute | Type | Description |
|
|
||||||
|------------------|------|------------------------------------|
|
|
||||||
| vocab_size | int | Size of the tokenizer's vocabulary |
|
|
||||||
| bos_token_id | int | ID of the beginning-of-sequence token |
|
|
||||||
| eos_token_id | int | ID of the end-of-sequence token |
|
|
||||||
|
|
||||||
### Methods:
|
|
||||||
|
|
||||||
| Method | Returns | Description |
|
|
||||||
|--------------------------------|---------|--------------------------------------------------------------|
|
|
||||||
| encode(s, add_bos=True, **kwargs) | list[int] | Tokenizes a prompt and returns token IDs. |
|
|
||||||
| decode(t, offset=None) | str | Decodes a list of token IDs to a string. |
|
|
||||||
| __call__(s) | list[int] | Tokenize prompts when the instance is called directly. |
|
|
||||||
| indexes_containing_token(token) | list[int] | Returns indexes in the vocabulary that may contain the token. |
|
|
||||||
|
|
||||||
## Usage Examples
|
|
||||||
|
|
||||||
### Tokenizing a Prompt
|
|
||||||
|
|
||||||
```python
|
|
||||||
from swarms.tokenizers import Tokenizer
|
|
||||||
|
|
||||||
tokenizer = Tokenizer("/path/to/tokenizer.model")
|
|
||||||
|
|
||||||
# Tokenize a single prompt string
|
|
||||||
prompt = "Hello, world!"
|
|
||||||
token_ids = tokenizer.encode(prompt)
|
|
||||||
print(token_ids)
|
|
||||||
```
|
|
||||||
|
|
||||||
### Decoding Token IDs
|
|
||||||
|
|
||||||
```python
|
|
||||||
# Decode token IDs back into text
|
|
||||||
decoded_text = tokenizer.decode(token_ids)
|
|
||||||
print(decoded_text)
|
|
||||||
```
|
|
||||||
|
|
||||||
### Incremental Decoding
|
|
||||||
|
|
||||||
```python
|
|
||||||
# Incremental decoding with offset (useful for streaming applications)
|
|
||||||
partial_tokens = [token_ids[0]] # simulate partially received tokens
|
|
||||||
decoded_partial = tokenizer.decode(partial_tokens, offset=0)
|
|
||||||
print(decoded_partial)
|
|
||||||
```
|
|
||||||
|
|
||||||
### Properties Access
|
|
||||||
|
|
||||||
```python
|
|
||||||
# Access vocabulary size and special token IDs
|
|
||||||
print("Vocabulary Size:", tokenizer.vocab_size)
|
|
||||||
print("BOS Token ID:", tokenizer.bos_token_id)
|
|
||||||
print("EOS Token ID:", tokenizer.eos_token_id)
|
|
||||||
```
|
|
||||||
|
|
||||||
### Indexes Containing Token
|
|
||||||
|
|
||||||
```python
|
|
||||||
# Find indexes that may output a specific token during decoding
|
|
||||||
token = "world"
|
|
||||||
indexes = tokenizer.indexes_containing_token(token)
|
|
||||||
print("Token Indexes:", indexes)
|
|
||||||
```
|
|
Loading…
Reference in new issue