You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
swarms/docs/swarms/tokenizers/openaitokenizer.md

5.4 KiB

OpenAITokenizer

The OpenAITokenizer class is a versatile and intuitive tokenizer designed for use with OpenAI's various language models, including the powerful GPT series. This class addresses the need to efficiently tokenize text for submission to OpenAI's API endpoints, managing different models and their unique tokenization schemes with ease.

Utility of OpenAITokenizer centers around its key features:

  • Support for multiple OpenAI models including GPT-3 and GPT-4.
  • Dynamic token counting that considers model-specific details.
  • Straightforward API intended for easy integration with larger systems.

Architecture and Design

The class adheres to a simple yet effective design, offering methods for calculating token lengths and embedded properties that manage model-specific characteristics such as maximum tokens and encodings. A data class structure is used for clean initializations and better management of class data.

The OpenAITokenizer uses a property-based approach and a method-based approach to provide users with a variety of functionalities essential for preparing text input for OpenAI models.

Attributes

The class contains several key constants and properties that define defaults and settings for use with different models:

Attribute Type Description
DEFAULT_OPENAI_GPT_3_COMPLETION_MODEL str Default completion model for OpenAI GPT-3.
DEFAULT_OPENAI_GPT_3_CHAT_MODEL str Default chat model for OpenAI GPT-3.
DEFAULT_OPENAI_GPT_4_MODEL str Default model for OpenAI GPT-4.
DEFAULT_ENCODING str Default encoding for text.
DEFAULT_MAX_TOKENS int Default maximum number of tokens based on the model.
TOKEN_OFFSET int Token offset applicable to some models.
MODEL_PREFIXES_TO_MAX_TOKENS dict Mapping of model prefixes to their respective max tokens.
EMBEDDING_MODELS list List of embedding models supported.
model str Name of the model currently being used.

Methods

The OpenAITokenizer class offers a variety of methods:

Method Arguments Return Type Description
__post_init__ None None Method called after the class has been initialized to set up default values.
encoding None Encoding Getter method that retrieves the encoding based on the specified model.
default_max_tokens None int Calculates the default max tokens based on the current model or defaults if not model-specific.
count_tokens text: str | list[dict], model: str int Counts the number of tokens within a given text or a list of messages.
len text: str | list[dict], model: str int Wrapper for count_tokens, providing a more intuitive naming convention.

Usage Examples

Given the extensive nature of this class, several examples are provided for each method, detailing how to use the OpenAITokenizer in different contexts.

Example 1: Initializing the Tokenizer

from swarms.tokenizers import OpenAITokenizer

tokenizer = OpenAITokenizer(model="gpt-4")

This example creates a new instance of OpenAITokenizer set to work with the GPT-4 model.

Example 2: Counting Tokens

text = "Hello, this is an example text to tokenize."

# Initialize the tokenizer
tokenizer = OpenAITokenizer(model="gpt-4")

# Count tokens
num_tokens = tokenizer.count_tokens(text)
print(f"Number of tokens: {num_tokens}")

This code snippet demonstrates how to count the number of tokens in a string of text using the specified model's encoding.

Example 3: Custom Model Token Counting

messages = [
    {"name": "Alice", "message": "Hello! How are you?"},
    {"name": "Bob", "message": "I'm good! Just working on some code."},
]

tokenizer = OpenAITokenizer(model="gpt-3.5-turbo")

# Count tokens for a list of messages
num_tokens = tokenizer.len(messages, model="gpt-3.5-turbo-0613")
print(f"Total tokens for messages: {num_tokens}")

In this example, we're invoking the len method to count the tokens in a conversation thread. Each message is represented as a dictionary with a name and message field.