You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
83 lines
3.5 KiB
83 lines
3.5 KiB
# limit_tokens_from_string
|
|
|
|
## Introduction
|
|
The `Swarms.utils` library contains utility functions used across codes that handle machine learning and other operations. The `Swarms.utils` library includes a notable function named `limit_tokens_from_string()`. This function particularly limits the number of tokens in a given string.
|
|
|
|
# Function: limit_tokens_from_string()
|
|
Within the `Swarms.utils` library, there is a method `limit_tokens_from_string(string: str, model: str = "gpt-4", limit: int = 500) -> str:`
|
|
|
|
## Description
|
|
The function `limit_tokens_from_string()` limits the number of tokens in a given string based on the specified threshold. It is primarily useful when you are handling large text data and need to chunk or limit your text to a certain length. Limiting token length could be useful in various scenarios such as when working with data with limited computational resources, or when dealing with models that accept a specific maximum limit of text.
|
|
|
|
## Parameters
|
|
|
|
| Parameter | Type | Default Value | Description
|
|
| :-----------| :----------- | :------------ | :------------|
|
|
| `string` | `str` | `None` | The input string from which the tokens need to be limited. |
|
|
| `model` | `str` | `"gpt-4"` | The model used to encode and decode the token. The function defaults to `gpt-4` but you can specify any model supported by `tiktoken`. If a model is not found, it falls back to use `gpt2` |
|
|
| `limit` | `int` | `500` | The limit up to which the tokens have to be sliced. Default limit is 500.|
|
|
|
|
## Returns
|
|
|
|
| Return | Type | Description
|
|
| :-----------| :----------- | :------------
|
|
| `out` | `str` | A string that is constructed back from the encoded tokens that have been limited to a count of `limit` |
|
|
|
|
## Method Detail and Usage Examples
|
|
|
|
The method `limit_tokens_from_string()` takes in three parameters - `string`, `model`, and `limit`.
|
|
|
|
|
|
First, it tries to get the encoding for the model specified in the `model` argument using `tiktoken.encoding_for_model(model)`. In case the specified model is not found, the function uses `gpt2` model encoding as a fallback.
|
|
|
|
Next, the input `string` is tokenized using the `encode` method on the `encoding` tensor. This results in the `encoded` tensor.
|
|
|
|
Then, the function slices the `encoded` tensor to get the first `limit` number of tokens.
|
|
|
|
Finally, the function converts back the tokens into the string using the `decode` method of the `encoding` tensor. The resulting string `out` is returned.
|
|
|
|
### Example 1:
|
|
|
|
```python
|
|
from swarms.utils import limit_tokens_from_string
|
|
|
|
# longer input string
|
|
string = "This is a very long string that needs to be tokenized. This string might exceed the maximum token limit, so it will need to be truncated."
|
|
|
|
# lower token limit
|
|
limit = 10
|
|
|
|
output = limit_tokens_from_string(string, limit=limit)
|
|
```
|
|
|
|
### Example 2:
|
|
|
|
```python
|
|
from swarms.utils import limit_tokens_from_string
|
|
|
|
# longer input string with different model
|
|
string = "This string will be tokenized using gpt2 model. If the string is too long, it will be truncated."
|
|
|
|
# model
|
|
model = "gpt2"
|
|
|
|
output = limit_tokens_from_string(string, model=model)
|
|
```
|
|
|
|
### Example 3:
|
|
|
|
```python
|
|
from swarms.utils import limit_tokens_from_string
|
|
|
|
# try with a random model string
|
|
string = "In case the method does not find the specified model, it will fall back to gpt2 model."
|
|
|
|
# model
|
|
model = "gpt-4"
|
|
|
|
output = limit_tokens_from_string(string, model=model)
|
|
```
|
|
|
|
**Note:** If specifying a model not supported by `tiktoken` intentionally, it will fall back to `gpt2` model for encoding.
|
|
|