pull/197/head
Kye 1 year ago
parent dfa4197b04
commit 66d9f7086b

@ -1,251 +1,201 @@
# `GPT4Vision` Documentation # `GPT4VisionAPI` Documentation
## Table of Contents **Table of Contents**
- [Overview](#overview) - [Introduction](#introduction)
- [Installation](#installation) - [Installation](#installation)
- [Initialization](#initialization) - [Module Overview](#module-overview)
- [Methods](#methods) - [Class: GPT4VisionAPI](#class-gpt4visionapi)
- [process_img](#process_img) - [Initialization](#initialization)
- [__call__](#__call__) - [Methods](#methods)
- [encode_image](#encode_image)
- [run](#run) - [run](#run)
- [arun](#arun) - [__call__](#__call__)
- [Configuration Options](#configuration-options) - [Examples](#examples)
- [Usage Examples](#usage-examples) - [Example 1: Basic Usage](#example-1-basic-usage)
- [Additional Tips](#additional-tips) - [Example 2: Custom API Key](#example-2-custom-api-key)
- [References and Resources](#references-and-resources) - [Example 3: Adjusting Maximum Tokens](#example-3-adjusting-maximum-tokens)
- [Additional Information](#additional-information)
--- - [References](#references)
## Overview
The GPT4Vision Model API is designed to provide an easy-to-use interface for interacting with the OpenAI GPT-4 Vision model. This model can generate textual descriptions for images and answer questions related to visual content. Whether you want to describe images or perform other vision-related tasks, GPT4Vision makes it simple and efficient.
The library offers a straightforward way to send images and tasks to the GPT-4 Vision model and retrieve the generated responses. It handles API communication, authentication, and retries, making it a powerful tool for developers working with computer vision and natural language processing tasks.
## Installation
To use the GPT4Vision Model API, you need to install the required dependencies and configure your environment. Follow these steps to get started:
1. Install the required Python package:
```bash
pip3 install --upgrade swarms
```
2. Make sure you have an OpenAI API key. You can obtain one by signing up on the [OpenAI platform](https://beta.openai.com/signup/).
3. Set your OpenAI API key as an environment variable. You can do this in your code or your environment configuration. Alternatively, you can provide the API key directly when initializing the `GPT4Vision` class.
## Initialization ## Introduction<a name="introduction"></a>
To start using the GPT4Vision Model API, you need to create an instance of the `GPT4Vision` class. You can customize its behavior by providing various configuration options, but it also comes with sensible defaults. Welcome to the documentation for the `GPT4VisionAPI` module! This module is a powerful wrapper for the OpenAI GPT-4 Vision model. It allows you to interact with the model to generate descriptions or answers related to images. This documentation will provide you with comprehensive information on how to use this module effectively.
Here's how you can initialize the `GPT4Vision` class: ## Installation<a name="installation"></a>
```python Before you start using the `GPT4VisionAPI` module, make sure you have the required dependencies installed. You can install them using the following commands:
from swarms.models.gpt4v import GPT4Vision
gpt4vision = GPT4Vision( ```bash
api_key="Your Key" pip3 install --upgrade swarms
)
``` ```
The above code initializes the `GPT4Vision` class with default settings. You can adjust these settings as needed. ## Module Overview<a name="module-overview"></a>
## Methods The `GPT4VisionAPI` module serves as a bridge between your application and the OpenAI GPT-4 Vision model. It allows you to send requests to the model and retrieve responses related to images. Here are some key features and functionality provided by this module:
### `process_img` - Encoding images to base64 format.
- Running the GPT-4 Vision model with specified tasks and images.
- Customization options such as setting the OpenAI API key and maximum token limit.
The `process_img` method is used to preprocess an image before sending it to the GPT-4 Vision model. It takes the image path as input and returns the processed image in a format suitable for API requests. ## Class: GPT4VisionAPI<a name="class-gpt4visionapi"></a>
```python The `GPT4VisionAPI` class is the core component of this module. It encapsulates the functionality required to interact with the GPT-4 Vision model. Below, we'll dive into the class in detail.
processed_img = gpt4vision.process_img(img_path)
```
- `img_path` (str): The file path or URL of the image to be processed. ### Initialization<a name="initialization"></a>
### `__call__` When initializing the `GPT4VisionAPI` class, you have the option to provide the OpenAI API key and set the maximum token limit. Here are the parameters and their descriptions:
The `__call__` method is the main method for interacting with the GPT-4 Vision model. It sends the image and tasks to the model and returns the generated response.
```python
response = gpt4vision(img, tasks)
```
- `img` (Union[str, List[str]]): Either a single image URL or a list of image URLs to be used for the API request. | Parameter | Type | Default Value | Description |
- `tasks` (List[str]): A list of tasks or questions related to the image(s). |---------------------|----------|-------------------------------|----------------------------------------------------------------------------------------------------------|
| openai_api_key | str | `OPENAI_API_KEY` environment variable (if available) | The OpenAI API key. If not provided, it defaults to the `OPENAI_API_KEY` environment variable. |
| max_tokens | int | 300 | The maximum number of tokens to generate in the model's response. |
This method returns a `GPT4VisionResponse` object, which contains the generated answer. Here's how you can initialize the `GPT4VisionAPI` class:
### `run` ```python
from swarms.models import GPT4VisionAPI
The `run` method is an alternative way to interact with the GPT-4 Vision model. It takes a single task and image URL as input and returns the generated response. # Initialize with default API key and max_tokens
api = GPT4VisionAPI()
```python # Initialize with custom API key and max_tokens
response = gpt4vision.run(task, img) custom_api_key = "your_custom_api_key"
api = GPT4VisionAPI(openai_api_key=custom_api_key, max_tokens=500)
``` ```
- `task` (str): The task or question related to the image. ### Methods<a name="methods"></a>
- `img` (str): The image URL to be used for the API request.
This method simplifies interactions when dealing with a single task and image.
### `arun` #### encode_image<a name="encode_image"></a>
The `arun` method is an asynchronous version of the `run` method. It allows for asynchronous processing of API requests, which can be useful in certain scenarios. This method allows you to encode an image from a URL to base64 format. It's a utility function used internally by the module.
```python ```python
import asyncio def encode_image(img: str) -> str:
"""
Encode image to base64.
async def main(): Parameters:
response = await gpt4vision.arun(task, img) - img (str): URL of the image to encode.
print(response)
loop = asyncio.get_event_loop() Returns:
loop.run_until_complete(main()) str: Base64 encoded image.
"""
``` ```
- `task` (str): The task or question related to the image. #### run<a name="run"></a>
- `img` (str): The image URL to be used for the API request.
## Configuration Options
The `GPT4Vision` class provides several configuration options that allow you to customize its behavior:
- `max_retries` (int): The maximum number of retries to make to the API. Default: 3 The `run` method is the primary way to interact with the GPT-4 Vision model. It sends a request to the model with a task and an image URL, and it returns the model's response.
- `backoff_factor` (float): The backoff factor to use for exponential backoff. Default: 2.0
- `timeout_seconds` (int): The timeout in seconds for the API request. Default: 10
- `api_key` (str): The API key to use for the API request. Default: None (set via environment variable)
- `quality` (str): The quality of the image to generate. Options: 'low' or 'high'. Default: 'low'
- `max_tokens` (int): The maximum number of tokens to use for the API request. Default: 200
## Usage Examples ```python
def run(task: str, img: str) -> str:
"""
Run the GPT-4 Vision model.
### Example 1: Generating Image Descriptions Parameters:
- task (str): The task or question related to the image.
- img (str): URL of the image to analyze.
```python Returns:
gpt4vision = GPT4Vision() str: The model's response.
img = "https://example.com/image.jpg" """
tasks = ["Describe this image."]
response = gpt4vision(img, tasks)
print(response.answer)
``` ```
In this example, we create an instance of `GPT4Vision`, provide an image URL, and ask the model to describe the image. The response contains the generated description. #### __call__<a name="__call__"></a>
### Example 2: Custom Configuration The `__call__` method is a convenient way to run the GPT-4 Vision model. It has the same functionality as the `run` method.
```python ```python
custom_config = { def __call__(task: str, img: str) -> str:
"max_retries": 5, """
"timeout_seconds": 20, Run the GPT-4 Vision model (callable).
"quality": "high",
"max_tokens": 300,
}
gpt4vision = GPT4Vision(**custom_config)
img = "https://example.com/another_image.jpg"
tasks = ["What objects can you identify in this image?"]
response = gpt4vision(img, tasks)
print(response.answer)
```
In this example, we create an instance of `GPT4Vision` with custom configuration options. We set a higher timeout, request high-quality images, and allow more tokens in the response. Parameters:
- task (str): The task or question related to the image.
- img
### Example 3: Using the `run` Method (str): URL of the image to analyze.
```python Returns:
gpt4vision = GPT4Vision() str: The model's response.
img = "https://example.com/image.jpg" """
task = "Describe this image in detail."
response = gpt4vision.run(task, img)
print(response)
``` ```
In this example, we use the `run` method to simplify the interaction by providing a single task and image URL. ## Examples<a name="examples"></a>
# Model Usage and Image Understanding Let's explore some usage examples of the `GPT4VisionAPI` module to better understand how to use it effectively.
The GPT-4 Vision model processes images in a unique way, allowing it to answer questions about both or each of the images independently. Here's an overview: ### Example 1: Basic Usage<a name="example-1-basic-usage"></a>
| Purpose | Description | In this example, we'll use the module with the default API key and maximum tokens to analyze an image.
| --------------------------------------- | ---------------------------------------------------------------------------------------------------------------- |
| Image Understanding | The model is shown two copies of the same image and can answer questions about both or each of the images independently. |
# Image Detail Control ```python
from swarms.models import GPT4VisionAPI
You have control over how the model processes the image and generates textual understanding by using the `detail` parameter, which has two options: `low` and `high`. # Initialize with default API key and max_tokens
api = GPT4VisionAPI()
| Detail | Description | # Define the task and image URL
| -------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | task = "What is the color of the object?"
| low | Disables the "high-res" model. The model receives a low-res 512 x 512 version of the image and represents the image with a budget of 65 tokens. Ideal for use cases not requiring high detail. | img = "https://i.imgur.com/2M2ZGwC.jpeg"
| high | Enables "high-res" mode. The model first sees the low-res image and then creates detailed crops of input images as 512px squares based on the input image size. Uses a total of 129 tokens. |
# Managing Images # Run the GPT-4 Vision model
response = api.run(task, img)
To use the Chat Completions API effectively, you must manage the images you pass to the model. Here are some key considerations: # Print the model's response
print(response)
```
| Management Aspect | Description | ### Example 2: Custom API Key<a name="example-2-custom-api-key"></a>
| ------------------------- | ------------------------------------------------------------------------------------------------- |
| Image Reuse | To pass the same image multiple times, include the image with each API request. |
| Image Size Optimization | Improve latency by downsizing images to meet the expected size requirements. |
| Image Deletion | After processing, images are deleted from OpenAI servers and not retained. No data is used for training. |
# Limitations If you have a custom API key, you can initialize the module with it as shown in this example.
While GPT-4 with Vision is powerful, it has some limitations: ```python
from swarms.models import GPT4VisionAPI
| Limitation | Description | # Initialize with custom API key and max_tokens
| -------------------------------------------- | --------------------------------------------------------------------------------------------------- | custom_api_key = "your_custom_api_key"
| Medical Images | Not suitable for interpreting specialized medical images like CT scans. | api = GPT4VisionAPI(openai_api_key=custom_api_key, max_tokens=500)
| Non-English Text | May not perform optimally when handling non-Latin alphabets, such as Japanese or Korean. |
| Large Text in Images | Enlarge text within images for readability, but avoid cropping important details. |
| Rotated or Upside-Down Text/Images | May misinterpret rotated or upside-down text or images. |
| Complex Visual Elements | May struggle to understand complex graphs or text with varying colors or styles. |
| Spatial Reasoning | Struggles with tasks requiring precise spatial localization, such as identifying chess positions. |
| Accuracy | May generate incorrect descriptions or captions in certain scenarios. |
| Panoramic and Fisheye Images | Struggles with panoramic and fisheye images. |
# Calculating Costs # Define the task and image URL
task = "What is the object in the image?"
img = "https://i.imgur.com/3T3ZHwD.jpeg"
Image inputs are metered and charged in tokens. The token cost depends on the image size and detail option. # Run the GPT-4 Vision model
response = api.run(task, img)
| Example | Token Cost | # Print the model's response
| --------------------------------------------- | ----------- | print(response)
| 1024 x 1024 square image in detail: high mode | 765 tokens | ```
| 2048 x 4096 image in detail: high mode | 1105 tokens |
| 4096 x 8192 image in detail: low mode | 85 tokens |
# FAQ ### Example 3: Adjusting Maximum Tokens<a name="example-3-adjusting-maximum-tokens"></a>
Here are some frequently asked questions about GPT-4 with Vision: You can also customize the maximum token limit when initializing the module. In this example, we set it to 1000 tokens.
| Question | Answer | ```python
| -------------------------------------------- | -------------------------------------------------------------------------------------------------- | from swarms.models import GPT4VisionAPI
| Fine-Tuning Image Capabilities | No, fine-tuning the image capabilities of GPT-4 is not supported at this time. |
| Generating Images | GPT-4 is used for understanding images, not generating them. |
| Supported Image File Types | Supported image file types include PNG (.png), JPEG (.jpeg and .jpg), WEBP (.webp), and non-animated GIF (.gif). |
| Image Size Limitations | Image uploads are restricted to 20MB per image. |
| Image Deletion | Uploaded images are automatically deleted after processing by the model. |
| Learning More | For more details about GPT-4 with Vision, refer to the GPT-4 with Vision system card. |
| CAPTCHA Submission | CAPTCHAs are blocked for safety reasons. |
| Rate Limits | Image processing counts toward your tokens per minute (TPM) limit. Refer to the calculating costs section for details. |
| Image Metadata | The model does not receive image metadata. |
| Handling Unclear Images | If an image is unclear, the model will do its best to interpret it, but results may be less accurate. |
# Initialize with default API key and custom max_tokens
api = GPT4VisionAPI(max_tokens=1000)
# Define the task and image URL
task = "Describe the scene in the image."
img = "https://i.imgur.com/4P4ZRxU.jpeg"
## Additional Tips # Run the GPT-4 Vision model
response = api.run(task, img)
- Make sure to handle potential exceptions and errors when making API requests. The library includes retries and error handling, but it's essential to handle exceptions gracefully in your code. # Print the model's response
- Experiment with different configuration options to optimize the trade-off between response quality and response time based on your specific requirements. print(response)
```
## References and Resources ## Additional Information<a name="additional-information"></a>
- [OpenAI Platform](https://beta.openai.com/signup/): Sign up for an OpenAI API key. - If you encounter any errors or issues with the module, make sure to check your API key and internet connectivity.
- [OpenAI API Documentation](https://platform.openai.com/docs/api-reference/chat/create): Official API documentation for the GPT-4 Vision model. - It's recommended to handle exceptions when using the module to gracefully handle errors.
- You can further customize the module to fit your specific use case by modifying the code as needed.
Now you have a comprehensive understanding of the GPT4Vision Model API, its configuration options, and how to use it for various computer vision and natural language processing tasks. Start experimenting and integrating it into your projects to leverage the power of GPT-4 Vision for image-related tasks. ## References<a name="references"></a>
# Conclusion - [OpenAI API Documentation](https://beta.openai.com/docs/)
With GPT-4 Vision, you have a powerful tool for understanding and generating textual descriptions for images. By considering its capabilities, limitations, and cost calculations, you can effectively leverage this model for various image-related tasks. This documentation provides a comprehensive guide on how to use the `GPT4VisionAPI` module effectively. It covers initialization, methods, usage examples, and additional information to ensure a smooth experience when working with the GPT-4 Vision model.
Loading…
Cancel
Save