[FEAT][Docs for AbstractLLM] [FEAT][Docs for BaseMultiModalModel]

2 years ago · 3ae305ee09
parent 84a8449626
commit 3ae305ee09
4 changed files with 525 additions and 1 deletions
--- a/docs/swarms/models/base_llm.md
+++ b/docs/swarms/models/base_llm.md
@ -0,0 +1,227 @@
 # Language Model Interface Documentation
 ## Table of Contents
 1. [Introduction](#introduction)
 2. [Abstract Language Model](#abstract-language-model)
   - [Initialization](#initialization)
   - [Attributes](#attributes)
   - [Methods](#methods)
 3. [Implementation](#implementation)
 4. [Usage Examples](#usage-examples)
 5. [Additional Features](#additional-features)
 6. [Performance Metrics](#performance-metrics)
 7. [Logging and Checkpoints](#logging-and-checkpoints)
 8. [Resource Utilization Tracking](#resource-utilization-tracking)
 9. [Conclusion](#conclusion)
 ---
 ## 1. Introduction <a name="introduction"></a>
 The Language Model Interface (`AbstractLLM`) is a flexible and extensible framework for working with various language models. This documentation provides a comprehensive guide to the interface, its attributes, methods, and usage examples. Whether you're using a pre-trained language model or building your own, this interface can help streamline the process of text generation, chatbots, summarization, and more.
 ## 2. Abstract Language Model <a name="abstract-language-model"></a>
 ### Initialization <a name="initialization"></a>
 The `AbstractLLM` class provides a common interface for language models. It can be initialized with various parameters to customize model behavior. Here are the initialization parameters:
 | Parameter              | Description                                                                                     | Default Value |
 |------------------------|-------------------------------------------------------------------------------------------------|---------------|
 | `model_name`           | The name of the language model to use.                                                         | None          |
 | `max_tokens`           | The maximum number of tokens in the generated text.                                              | None          |
 | `temperature`          | The temperature parameter for controlling randomness in text generation.                        | None          |
 | `top_k`                | The top-k parameter for filtering words in text generation.                                      | None          |
 | `top_p`                | The top-p parameter for filtering words in text generation.                                      | None          |
 | `system_prompt`        | A system-level prompt to set context for generation.                                             | None          |
 | `beam_width`           | The beam width for beam search.                                                                 | None          |
 | `num_return_sequences` | The number of sequences to return in the output.                                                 | None          |
 | `seed`                 | The random seed for reproducibility.                                                            | None          |
 | `frequency_penalty`    | The frequency penalty parameter for promoting word diversity.                                    | None          |
 | `presence_penalty`     | The presence penalty parameter for discouraging repetitions.                                     | None          |
 | `stop_token`           | A stop token to indicate the end of generated text.                                              | None          |
 | `length_penalty`       | The length penalty parameter for controlling the output length.                                   | None          |
 | `role`                 | The role of the language model (e.g., assistant, user, etc.).                                    | None          |
 | `max_length`           | The maximum length of generated sequences.                                                       | None          |
 | `do_sample`            | Whether to use sampling during text generation.                                                  | None          |
 | `early_stopping`       | Whether to use early stopping during text generation.                                            | None          |
 | `num_beams`            | The number of beams to use in beam search.                                                       | None          |
 | `repition_penalty`     | The repetition penalty parameter for discouraging repeated tokens.                                | None          |
 | `pad_token_id`         | The token ID for padding.                                                                       | None          |
 | `eos_token_id`         | The token ID for the end of a sequence.                                                         | None          |
 | `bos_token_id`         | The token ID for the beginning of a sequence.                                                   | None          |
 | `device`               | The device to run the model on (e.g., 'cpu' or 'cuda').                                          | None          |
 ### Attributes <a name="attributes"></a>
 - `model_name`: The name of the language model being used.
 - `max_tokens`: The maximum number of tokens in generated text.
 - `temperature`: The temperature parameter controlling randomness.
 - `top_k`: The top-k parameter for word filtering.
 - `top_p`: The top-p parameter for word filtering.
 - `system_prompt`: A system-level prompt for context.
 - `beam_width`: The beam width for beam search.
 - `num_return_sequences`: The number of output sequences.
 - `seed`: The random seed for reproducibility.
 - `frequency_penalty`: The frequency penalty parameter.
 - `presence_penalty`: The presence penalty parameter.
 - `stop_token`: The stop token to indicate text end.
 - `length_penalty`: The length penalty parameter.
 - `role`: The role of the language model.
 - `max_length`: The maximum length of generated sequences.
 - `do_sample`: Whether to use sampling during generation.
 - `early_stopping`: Whether to use early stopping.
 - `num_beams`: The number of beams in beam search.
 - `repition_penalty`: The repetition penalty parameter.
 - `pad_token_id`: The token ID for padding.
 - `eos_token_id`: The token ID for the end of a sequence.
 - `bos_token_id`: The token ID for the beginning of a sequence.
 - `device`: The device used for model execution.
 - `history`: A list of conversation history.
 ### Methods <a name="methods"></a>
 The `AbstractLLM` class defines several methods for working with language models:
 - `run(task: Optional[str] = None, *args, **kwargs) -> str`: Generate text using the language model. This method is abstract and must be implemented by subclasses.
 - `arun(task: Optional[str] = None, *args, **kwargs)`: An asynchronous version of `run` for concurrent text generation.
 - `batch_run(tasks: List[str], *args, **kwargs)`: Generate text for a batch of tasks.
 - `abatch_run(tasks: List[str], *args, **kwargs)`: An asynchronous version of `batch_run` for concurrent batch generation.
 - `chat(task: str, history: str = "") -> str`: Conduct a chat with the model, providing a conversation history.
 - `__call__(task: str) -> str`: Call the model to generate text.
 - `_tokens_per_second() -> float`: Calculate tokens generated per second.
 - `_num_tokens(text: str) -> int`: Calculate the number of tokens in a text.
 - `_time_for_generation(task: str) -> float`: Measure the time taken for text generation.
 - `generate_summary(text: str) -> str`: Generate a summary of the provided text.
 - `set_temperature(value: float)`: Set the temperature parameter.
 - `set_max_tokens(value: int)`: Set the maximum number of tokens.
 - `clear_history()`: Clear the conversation history.
 - `enable_logging(log_file: str = "model.log")`: Initialize logging for the model.
 - `log_event(message: str)`: Log an event.
 - `save_checkpoint(checkpoint_dir: str = "checkpoints")`: Save the model state as a checkpoint.
 - `load_checkpoint(checkpoint_path: str)`: Load the model state from a checkpoint.
 - `toggle_creative_mode(enable: bool)`: Toggle creative mode for the model.
 - `track_resource_utilization()`: Track and report resource utilization.
 - `
 get_generation_time() -> float`: Get the time taken for text generation.
 - `set_max_length(max_length: int)`: Set the maximum length of generated sequences.
 - `set_model_name(model_name: str)`: Set the model name.
 - `set_frequency_penalty(frequency_penalty: float)`: Set the frequency penalty parameter.
 - `set_presence_penalty(presence_penalty: float)`: Set the presence penalty parameter.
 - `set_stop_token(stop_token: str)`: Set the stop token.
 - `set_length_penalty(length_penalty: float)`: Set the length penalty parameter.
 - `set_role(role: str)`: Set the role of the model.
 - `set_top_k(top_k: int)`: Set the top-k parameter.
 - `set_top_p(top_p: float)`: Set the top-p parameter.
 - `set_num_beams(num_beams: int)`: Set the number of beams.
 - `set_do_sample(do_sample: bool)`: Set whether to use sampling.
 - `set_early_stopping(early_stopping: bool)`: Set whether to use early stopping.
 - `set_seed(seed: int)`: Set the random seed.
 - `set_device(device: str)`: Set the device for model execution.
 ## 3. Implementation <a name="implementation"></a>
 The `AbstractLLM` class serves as the base for implementing specific language models. Subclasses of `AbstractLLM` should implement the `run` method to define how text is generated for a given task. This design allows flexibility in integrating different language models while maintaining a common interface.
 ## 4. Usage Examples <a name="usage-examples"></a>
 To demonstrate how to use the `AbstractLLM` interface, let's create an example using a hypothetical language model. We'll initialize an instance of the model and generate text for a simple task.
 ```python
 # Import the AbstractLLM class
 from swarms.models import AbstractLLM
 # Create an instance of the language model
 language_model = AbstractLLM(
    model_name="my_language_model",
    max_tokens=50,
    temperature=0.7,
    top_k=50,
    top_p=0.9,
    device="cuda",
 )
 # Generate text for a task
 task = "Translate the following English text to French: 'Hello, world.'"
 generated_text = language_model.run(task)
 # Print the generated text
 print(generated_text)
 ```
 In this example, we've created an instance of our hypothetical language model, configured its parameters, and used the `run` method to generate text for a translation task.
 ## 5. Additional Features <a name="additional-features"></a>
 The `AbstractLLM` interface provides additional features for customization and control:
 - `batch_run`: Generate text for a batch of tasks efficiently.
 - `arun` and `abatch_run`: Asynchronous versions of `run` and `batch_run` for concurrent text generation.
 - `chat`: Conduct a conversation with the model by providing a history of the conversation.
 - `__call__`: Allow the model to be called directly to generate text.
 These features enhance the flexibility and utility of the interface in various applications, including chatbots, language translation, and content generation.
 ## 6. Performance Metrics <a name="performance-metrics"></a>
 The `AbstractLLM` class offers methods for tracking performance metrics:
 - `_tokens_per_second`: Calculate tokens generated per second.
 - `_num_tokens`: Calculate the number of tokens in a text.
 - `_time_for_generation`: Measure the time taken for text generation.
 These metrics help assess the efficiency and speed of text generation, enabling optimizations as needed.
 ## 7. Logging and Checkpoints <a name="logging-and-checkpoints"></a>
 Logging and checkpointing are crucial for tracking model behavior and ensuring reproducibility:
 - `enable_logging`: Initialize logging for the model.
 - `log_event`: Log events and activities.
 - `save_checkpoint`: Save the model state as a checkpoint.
 - `load_checkpoint`: Load the model state from a checkpoint.
 These capabilities aid in debugging, monitoring, and resuming model experiments.
 ## 8. Resource Utilization Tracking <a name="resource-utilization-tracking"></a>
 The `track_resource_utilization` method is a placeholder for tracking and reporting resource utilization, such as CPU and memory usage. It can be customized to suit specific monitoring needs.
 ## 9. Conclusion <a name="conclusion"></a>
 The Language Model Interface (`AbstractLLM`) is a versatile framework for working with language models. Whether you're using pre-trained models or developing your own, this interface provides a consistent and extensible foundation. By following the provided guidelines and examples, you can integrate and customize language models for various natural language processing tasks.
--- a/docs/swarms/models/base_multimodal_model.md
+++ b/docs/swarms/models/base_multimodal_model.md
@ -0,0 +1,293 @@
 # `BaseMultiModalModel` Documentation
 Swarms is a Python library that provides a framework for running multimodal AI models. It allows you to combine text and image inputs and generate coherent and context-aware responses. This library is designed to be extensible, allowing you to integrate various multimodal models.
 ## Table of Contents
 1. [Introduction](#introduction)
 2. [Installation](#installation)
 3. [Getting Started](#getting-started)
 4. [BaseMultiModalModel Class](#basemultimodalmodel-class)
    - [Initialization](#initialization)
    - [Methods](#methods)
 5. [Usage Examples](#usage-examples)
 6. [Additional Tips](#additional-tips)
 7. [References and Resources](#references-and-resources)
 ## 1. Introduction <a name="introduction"></a>
 Swarms is designed to simplify the process of working with multimodal AI models. These models are capable of understanding and generating content based on both textual and image inputs. With this library, you can run such models and receive context-aware responses.
 ## 2. Installation <a name="installation"></a>
 To install swarms, you can use pip:
 ```bash
 pip install swarms
 ```
 ## 3. Getting Started <a name="getting-started"></a>
 To get started with Swarms, you'll need to import the library and create an instance of the `BaseMultiModalModel` class. This class serves as the foundation for running multimodal models.
 ```python
 from swarms.models import BaseMultiModalModel
 model = BaseMultiModalModel(
    model_name="your_model_name",
    temperature=0.5,
    max_tokens=500,
    max_workers=10,
    top_p=1,
    top_k=50,
    beautify=False,
    device="cuda",
    max_new_tokens=500,
    retries=3,
 )
 ```
 You can customize the initialization parameters based on your model's requirements.
 ## 4. BaseMultiModalModel Class <a name="basemultimodalmodel-class"></a>
 ### Initialization <a name="initialization"></a>
 The `BaseMultiModalModel` class is initialized with several parameters that control its behavior. Here's a breakdown of the initialization parameters:
 | Parameter        | Description                                                                                           | Default Value |
 |------------------|-------------------------------------------------------------------------------------------------------|---------------|
 | `model_name`     | The name of the multimodal model to use.                                                              | None          |
 | `temperature`    | The temperature parameter for controlling randomness in text generation.                            | 0.5           |
 | `max_tokens`     | The maximum number of tokens in the generated text.                                                    | 500           |
 | `max_workers`    | The maximum number of concurrent workers for running tasks.                                           | 10            |
 | `top_p`          | The top-p parameter for filtering words in text generation.                                            | 1             |
 | `top_k`          | The top-k parameter for filtering words in text generation.                                            | 50            |
 | `beautify`       | Whether to beautify the output text.                                                                  | False         |
 | `device`         | The device to run the model on (e.g., 'cuda' or 'cpu').                                                | 'cuda'        |
 | `max_new_tokens` | The maximum number of new tokens allowed in generated responses.                                       | 500           |
 | `retries`        | The number of retries in case of an error during text generation.                                      | 3             |
 | `system_prompt`  | A system-level prompt to set context for generation.                                                   | None          |
 | `meta_prompt`    | A meta prompt to provide guidance for including image labels in responses.                             | None          |
 ### Methods <a name="methods"></a>
 The `BaseMultiModalModel` class defines various methods for running multimodal models and managing interactions:
 - `run(task: str, img: str) -> str`: Run the multimodal model with a text task and an image URL to generate a response.
 - `arun(task: str, img: str) -> str`: Run the multimodal model asynchronously with a text task and an image URL to generate a response.
 - `get_img_from_web(img: str) -> Image`: Fetch an image from a URL and return it as a PIL Image.
 - `encode_img(img: str) -> str`: Encode an image to base64 format.
 - `get_img(img: str) -> Image`: Load an image from the local file system and return it as a PIL Image.
 - `clear_chat_history()`: Clear the chat history maintained by the model.
 - `run_many(tasks: List[str], imgs: List[str]) -> List[str]`: Run the model on multiple text tasks and image URLs concurrently and return a list of responses.
 - `run_batch(tasks_images: List[Tuple[str, str]]) -> List[str]`: Process a batch of text tasks and image URLs and return a list of responses.
 - `run_batch_async(tasks_images: List[Tuple[str, str]]) -> List[str]`: Process a batch of text tasks and image URLs asynchronously and return a list of responses.
 - `run_batch_async_with_retries(tasks_images: List[Tuple[str, str]]) -> List[str]`: Process a batch of text tasks and image URLs asynchronously with retries in case of errors and return a list of responses.
 - `unique_chat_history() -> List[str]`: Get the unique chat history stored by the model.
 - `run_with_retries(task: str, img: str) -> str`: Run the model with retries in case of an error.
 - `run_batch_with_retries(tasks_images: List[Tuple[str, str]]) -> List[str]`: Run a batch of tasks with retries in case of errors and return a list of responses.
 - `_tokens_per_second() -> float`: Calculate the tokens generated per second during text generation.
 - `_time_for_generation(task: str) -> float`: Measure the time taken for text generation for a specific task.
 - `generate_summary(text: str) -> str`: Generate a summary of the provided text.
 - `set_temperature(value: float)`: Set the temperature parameter for controlling randomness in text generation.
 - `set_max_tokens(value: int)`: Set the maximum number of tokens allowed in generated responses.
 - `get_generation_time() -> float`: Get the time taken for text generation for the last task.
 - `get_chat_history() -> List[str]`: Get the chat history, including all interactions.
 - `get_unique_chat_history() -> List[str]`: Get the unique chat history, removing duplicate interactions.
 - `get_chat_history_length() -> int`: Get the length of the chat history.
 - `get_unique_chat_history_length() -> int`: Get the length of the unique chat history.
 - `get_chat_history_tokens() -> int`: Get the total number of tokens in the chat history.
 - `print_beautiful(content: str, color: str = 'cyan')`: Print content beautifully using colored text.
 - `stream(content: str)`: Stream the content, printing it character by character.
 - `meta_prompt() -> str`: Get the meta prompt that provides guidance for including image labels in responses.
 ## 5. Usage Examples <a name="usage-examples"></a>
 Let's explore some usage examples of the MultiModalAI library:
 ### Example 1: Running
 the Model
 ```python
 # Import the library
 from swarms.models import BaseMultiModalModel
 # Create an instance of the model
 model = BaseMultiModalModel(
    model_name="your_model_name",
    temperature=0.5,
    max_tokens=500,
    device="cuda",
 )
 # Run the model with a text task and an image URL
 response = model.run("Generate a summary of this text", "https://www.example.com/image.jpg")
 print(response)
 ```
 ### Example 2: Running Multiple Tasks Concurrently
 ```python
 # Import the library
 from swarms.models import BaseMultiModalModel
 # Create an instance of the model
 model = BaseMultiModalModel(
    model_name="your_model_name",
    temperature=0.5,
    max_tokens=500,
    max_workers=4,
    device="cuda",
 )
 # Define a list of tasks and image URLs
 tasks = ["Task 1", "Task 2", "Task 3"]
 images = ["https://image1.jpg", "https://image2.jpg", "https://image3.jpg"]
 # Run the model on multiple tasks concurrently
 responses = model.run_many(tasks, images)
 for response in responses:
    print(response)
 ```
 ### Example 3: Running the Model Asynchronously
 ```python
 # Import the library
 from swarms.models import BaseMultiModalModel
 # Create an instance of the model
 model = BaseMultiModalModel(
    model_name="your_model_name",
    temperature=0.5,
    max_tokens=500,
    device="cuda",
 )
 # Define a list of tasks and image URLs
 tasks_images = [
    ("Task 1", "https://image1.jpg"),
    ("Task 2", "https://image2.jpg"),
    ("Task 3", "https://image3.jpg"),
 ]
 # Run the model on multiple tasks asynchronously
 responses = model.run_batch_async(tasks_images)
 for response in responses:
    print(response)
 ```
 ### Example 4: Inheriting `BaseMultiModalModel` for it's prebuilt classes
 ```python
 from swarms.models import BaseMultiModalModel
 class CustomMultiModalModel(BaseMultiModalModel):
    def __init__(self, model_name, custom_parameter, *args, **kwargs):
        # Call the parent class constructor
        super().__init__(model_name=model_name, *args, **kwargs)
        # Initialize custom parameters specific to your model
        self.custom_parameter = custom_parameter
    def __call__(self, text, img):
        # Implement the multimodal model logic here
        # You can use self.custom_parameter and other inherited attributes
        pass
    def generate_summary(self, text):
        # Implement the summary generation logic using your model
        # You can use self.custom_parameter and other inherited attributes
        pass
 # Create an instance of your custom multimodal model
 custom_model = CustomMultiModalModel(
    model_name="your_custom_model_name",
    custom_parameter="your_custom_value",
    temperature=0.5,
    max_tokens=500,
    device="cuda",
 )
 # Run your custom model
 response = custom_model.run("Generate a summary of this text", "https://www.example.com/image.jpg")
 print(response)
 # Generate a summary using your custom model
 summary = custom_model.generate_summary("This is a sample text to summarize.")
 print(summary)
 ```
 In the code above:
 1. We define a `CustomMultiModalModel` class that inherits from `BaseMultiModalModel`.
 2. In the constructor of our custom class, we call the parent class constructor using `super()` and initialize any custom parameters specific to our model. In this example, we introduced a `custom_parameter`.
 3. We override the `__call__` method, which is responsible for running the multimodal model logic. Here, you can implement the specific behavior of your model, considering both text and image inputs.
 4. We override the `generate_summary` method, which is used to generate a summary of text input. You can implement your custom summarization logic here.
 5. We create an instance of our custom model, passing the required parameters, including the custom parameter.
 6. We demonstrate how to run the custom model and generate a summary using it.
 By inheriting from `BaseMultiModalModel`, you can leverage the prebuilt features and methods provided by the library while customizing the behavior of your multimodal model. This allows you to create powerful and specialized models for various multimodal tasks.
 These examples demonstrate how to use MultiModalAI to run multimodal models with text and image inputs. You can adjust the parameters and methods to suit your specific use cases.
 ## 6. Additional Tips <a name="additional-tips"></a>
 Here are some additional tips and considerations for using MultiModalAI effectively:
 - **Custom Models**: You can create your own multimodal models and inherit from the `BaseMultiModalModel` class to integrate them with this library.
 - **Retries**: In cases where text generation might fail due to various reasons (e.g., server issues), using methods with retries can be helpful.
 - **Monitoring**: You can monitor the performance of your model using methods like `_tokens_per_second()` and `_time_for_generation()`.
 - **Chat History**: The library maintains a chat history, allowing you to keep track of interactions.
 - **Streaming**: The `stream()` method can be useful for displaying output character by character, which can be helpful for certain applications.
 ## 7. References and Resources <a name="references-and-resources"></a>
 Here are some references and resources that you may find useful for working with multimodal models:
 - [Hugging Face Transformers Library](https://huggingface.co/transformers/): A library for working with various transformer-based models.
 - [PIL (Python Imaging Library)](https://pillow.readthedocs.io/en/stable/): Documentation for working with images in Python using the Pillow library.
 - [Concurrent Programming in Python](https://docs.python.org/3/library/concurrent.futures.html): Official Python documentation for concurrent programming.
 - [Requests Library Documentation](https://docs.python-requests.org/en/latest/): Documentation for the Requests library, which is used for making HTTP requests.
 - [Base64 Encoding in Python](https://docs.python.org/3/library/base64.html): Official Python documentation for base64 encoding and decoding.
 This concludes the documentation for the MultiModalAI library. You can now explore the library further and integrate it with your multimodal AI projects.
--- a/mkdocs.yml
+++ b/mkdocs.yml
@ -73,6 +73,7 @@ nav:
    - OmniModalAgent: "swarms/agents/omni_agent.md"
  - swarms.models:
    - Language:
      - BaseLLM: "swarms/models/base_llm.md"
      - Overview: "swarms/models/index.md"
      - HuggingFaceLLM: "swarms/models/huggingface.md"
      - Anthropic: "swarms/models/anthropic.md"
@ -82,6 +83,7 @@ nav:
      - MPT7B: "swarms/models/mpt.md"
      - Mistral: "swarms/models/mistral.md"
    - MultiModal:
      - BaseMultiModalModel: "swarms/models/base_multimodal_model.md"
      - Fuyu: "swarms/models/fuyu.md"
      - Vilt: "swarms/models/vilt.md"
      - Idefics: "swarms/models/idefics.md"
--- a/swarms/models/init.py
+++ b/swarms/models/init.py
@ -17,6 +17,7 @@ from swarms.models.wizard_storytelling import (
 from swarms.models.mpt import MPT7B  # noqa: E402
 # MultiModal Models
 from swarms.models.base_multimodal_model import BaseMultiModalModel # noqa: E402
 from swarms.models.idefics import Idefics  # noqa: E402
 from swarms.models.vilt import Vilt  # noqa: E402
 from swarms.models.nougat import Nougat  # noqa: E402
@ -40,6 +41,7 @@ __all__ = [
    "AzureOpenAI",
    "OpenAIChat",
    "Zephyr",
    "BaseMultiModalModel",
    "Idefics",
    # "Kosmos",
    "Vilt",