5.2 KiB
CohereTokenizer Documentation
The CohereTokenizer
class is designed to interface with Cohere language models and provides methods for tokenizing text inputs. This tokenizer plays a crucial role in preparing data for a Cohere model, which operates on tokens rather than raw text.
Class Name: CohereTokenizer
Overview
The CohereTokenizer
class is essential for interacting with Cohere models that require tokenized input. As models often operate on tokens, having an intuitive and efficient tokenizer directly linked to the model simplifies preprocessing tasks. This tokenizer counts the tokens in the given text, helping users to manage and understand the tokens they can work with, given limitations like the model's maximum token count.
Architecture and How the Class Works
The CohereTokenizer
is built as a data class, ensuring that it is lightweight and focused solely on its data attributes and methods related to tokenization. The class relies on an instance of a Cohere Client
, which needs to be instantiated with an API key from Cohere before use.
Upon instantiation, the CohereTokenizer
holds a reference to a specific Cohere model and interfaces with the Client
to tokenize text accordingly. It provides a simple utility (count_tokens
) to count the number of tokens that a string, or a list of strings, would be broken down into by the Cohere API.
Purpose and Usage
The CohereTokenizer
is specifically made for users who are working with Cohere language models. It's designed to help them in preprocessing steps by converting text into tokens and determining how many tokens their text segments contain. This is crucial for ensuring that inputs do not exceed the model's maximum token count, as exceeding this limit can result in errors or truncated text.
Class Definition
@dataclass
class CohereTokenizer:
model: str
client: Client
DEFAULT_MODEL: str = "command"
DEFAULT_MAX_TOKENS: int = 2048
max_tokens: int = DEFAULT_MAX_TOKENS
Parameters
Parameter | Type | Description | Default Value |
---|---|---|---|
model |
str |
Specifies the Cohere model to be used for tokenization. | None |
client |
Client |
An instance of the Cohere client, initialized with an API key. | None |
DEFAULT_MODEL |
str |
The default model to use if none is specified. | "command" |
DEFAULT_MAX_TOKENS |
int |
Default maximum number of tokens the model accepts. | 2048 |
max_tokens |
int |
Maximum number of tokens; it can be altered to fit the model. | DEFAULT_MAX_TOKENS |
Methods
The CohereTokenizer
class contains the following method:
count_tokens
def count_tokens(self, text: str | list) -> int:
"""
Count the number of tokens in the given text.
Args:
text (str | list): The input text to tokenize.
Returns:
int: The number of tokens in the text.
Raises:
ValueError: If the input text is not a string.
"""
Functionality and Usage Example
Below are examples demonstrating how to use CohereTokenizer
.
Counting Tokens
Initialization
First, the Cohere client must be initialized and passed in to create an instance of CohereTokenizer
.
from cohere import Client
from swarms.tokenizers import CohereTokenizer
# Initialize Cohere client with your API key
cohere_client = Client("your-api-key")
# Instantiate the tokenizer
tokenizer = CohereTokenizer(model="your-model-name", client=cohere_client)
Count Tokens Example 1
Counting tokens for a single string.
text_to_tokenize = "Hello, World!"
token_count = tokenizer.count_tokens(text_to_tokenize)
print(f"Number of tokens: {token_count}")
Count Tokens Example 2
Trying to pass a list instead of a single string, which would raise an error.
texts_to_tokenize = ["Hello, World!", "Another piece of text."]
try:
token_count = tokenizer.count_tokens(texts_to_tokenize)
except ValueError as e:
print(f"Error: {e}")
The above code would print Error: Text must be a string.
as the count_tokens
function expects a string, not a list.
Additional Information and Tips
When working with the CohereTokenizer
, here are some key points to keep in mind:
- The token count is important to know because Cohere models have a maximum token limit for input. If your text exceeds this limit, it must be split or truncated before being passed to the model.
- It is always a good practice to catch exceptions when using methods like
count_tokens
to handle unexpected inputs gracefully. - Remember to replace
'your-api-key'
and'your-model-name'
with your actual Cohere API key and desired model name.
References and Resources
For more detailed information, refer to the following resources: