5.4 KiB

Raw Blame History

swarms.tokenizers Documentation

swarms.tokenizers is a PyTorch-like tokenization library designed to facilitate natural language processing (NLP) tasks by converting text inputs into a form that machine learning models can interpret. In this documentation, we will outline how to utilize the SentencePieceTokenizer class from the swarms.tokenizers library, which offers sentencepiece tokenization, a language-independent subword tokenizer and detokenizer.

Purpose and Architecture of `SentencePieceTokenizer`

The SentencePieceTokenizer class uses a pre-trained sentencepiece model to tokenize and detokenize texts. SentencePiece is an unsupervised text tokenizer and detokenizer that allows the generation of a subword vocabulary from raw data. By breaking text down into subword units (like wordpieces or byte-pair-encodings), SentencePiece handles languages without a clear word boundary and can improve the performance of text processing in neural network models.

In SentencePieceTokenizer, the tokenization process is language-agnostic and encompasses a range of tokenization strategies, such as byte pair encoding (BPE), unigram, or a combination of both. The class is designed with ease of use in mind, allowing seamless integration with other components of the NLP pipeline.

Class Definition

class SentencePieceTokenizer:
    """
    Tokenizer of sentencepiece.
    
    Args:
        model_file (str): the path of the tokenizer model
    """

Initialization Parameters

Property/Method	Type	Description
`model_file`	`str`	The path to the pretrained sentencepiece model file.

Methods and Usage

Below, we detail the methods available in SentencePieceTokenizer, including their parameters, their functionality, and usage examples.

Method: `init`

Instantiates an instance of the SentencePieceTokenizer with the specified sentencepiece model.

Parameters

Parameter	Type	Description
`model_file`	`str`	The path to the pretrained sentencepiece model file.

Example

from swarms.tokenizers import SentencePieceTokenizer

tokenizer = SentencePieceTokenizer(model_file='your_model.model')

Properties: Vocabulary Information

These properties provide access to various vocabulary-specific information.

`vocab_size`

`bos_token_id`

`eos_token_id`

Example

vocab_size = tokenizer.vocab_size
print(f"Vocabulary size: {vocab_size}")

bos_id = tokenizer.bos_token_id
eos_id = tokenizer.eos_token_id
print(f"BOS token ID: {bos_id}, EOS token ID: {eos_id}")

Method: `indexes_containing_token`

Finds possible tokenizer indexes that, when decoded, may contain the input token.

Parameters

Parameter	Type	Description
`token`	`str`	The token for which possible indexes are to be found.

Returns

List[int]: List of tokenizer indexes that might contain the token.

Example

indexes = tokenizer.indexes_containing_token("▁the")
print(f"Indexes containing '▁the': {indexes}")

Method: `encode`

Tokenizes a text prompt into a list of token IDs.

Parameters

Parameter	Type	Description
`s`	`str`	The text prompt to tokenize.
`add_bos`	`bool`	If `True`, it adds the beginning-of-sentence token. (default: `True`)

Returns

List[int]: List of token IDs representing the text prompt.

Example

encoded_ids = tokenizer.encode("Hello, world!", add_bos=True)
print(f"Encoded token IDs: {encoded_ids}")

Method: `decode`

Detokenizes a list of token IDs into text.

Parameters

Parameter	Type	Description
`t`	`List[int]`	A list of token IDs to detokenize.
`offset`	`Optional[int]`	For incremental decoding. Defaults to `None`, which means it is not applied.

Returns

str: Text representation of the decoded token IDs.

Example

decoded_text = tokenizer.decode([bos_id] + encoded_ids)
print(f"Decoded text: {decoded_text}")

Method: `call`

Tokenizes prompts when the class instance is used as a callable.

Parameters

Parameter	Type	Description
`s`	`Union[str, Sequence[str]]`	Text prompts to tokenize.
`add_bos`	`bool`	If `True`, it adds the beginning-of-sentence token. (default: `False`)
`add_eos`	`bool`	If `True`, it adds the end-of-sentence token. (default: `False`)

Returns

addict.Addict: Object with input_ids containing the list of token IDs.

Example

input_data = tokenizer("Let's tokenize this sentence.")
print(f"Tokenized input IDs: {input_data.input_ids}")

Additional Information and Tips

The library has efficient internals that cache information for performance benefits. For example, indexes_containing_token uses a deque to store the most recent lookups, which saves computation time by avoiding re-traversing the vocabulary.

Conclusion

This documentation provides an in-depth explanation of swarms.tokenizers with a focus on the SentencePieceTokenizer class. By following the examples and guidance detailed above, users should be able to effectively use the tokenizers for their NLP tasks. Users are also encouraged to refer to further resources and the official SentencePiece documentation for more advanced use cases and configurations.

5.4 KiB Raw Blame History

swarms.tokenizers Documentation

Purpose and Architecture of SentencePieceTokenizer

Class Definition

Initialization Parameters

Methods and Usage

Method: __init__

Parameters

Example

Properties: Vocabulary Information

vocab_size

bos_token_id

eos_token_id

Example

Method: indexes_containing_token

Parameters

Returns

Example

Method: encode

Parameters

Returns

Example

Method: decode

Parameters

Returns

Example

Method: __call__

Parameters

Returns

Example

Additional Information and Tips

Conclusion

5.4 KiB

Raw Blame History

Purpose and Architecture of `SentencePieceTokenizer`

Method: `init`

`vocab_size`

`bos_token_id`

`eos_token_id`

Method: `indexes_containing_token`

Method: `encode`

Method: `decode`

Method: `call`