You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
194 lines
5.0 KiB
194 lines
5.0 KiB
|
|
|
|
# vLLM Integration Guide
|
|
|
|
!!! info "Overview"
|
|
vLLM is a high-performance and easy-to-use library for LLM inference and serving. This guide explains how to integrate vLLM with Swarms for efficient, production-grade language model deployment.
|
|
|
|
|
|
## Installation
|
|
|
|
!!! note "Prerequisites"
|
|
Before you begin, make sure you have Python 3.8+ installed on your system.
|
|
|
|
=== "pip"
|
|
```bash
|
|
pip install -U vllm swarms
|
|
```
|
|
|
|
=== "poetry"
|
|
```bash
|
|
poetry add vllm swarms
|
|
```
|
|
|
|
## Basic Usage
|
|
|
|
Here's a simple example of how to use vLLM with Swarms:
|
|
|
|
```python title="basic_usage.py"
|
|
from swarms.utils.vllm_wrapper import VLLMWrapper
|
|
|
|
# Initialize the vLLM wrapper
|
|
vllm = VLLMWrapper(
|
|
model_name="meta-llama/Llama-2-7b-chat-hf",
|
|
system_prompt="You are a helpful assistant.",
|
|
temperature=0.7,
|
|
max_tokens=4000
|
|
)
|
|
|
|
# Run inference
|
|
response = vllm.run("What is the capital of France?")
|
|
print(response)
|
|
```
|
|
|
|
## VLLMWrapper Class
|
|
|
|
!!! abstract "Class Overview"
|
|
The `VLLMWrapper` class provides a convenient interface for working with vLLM models.
|
|
|
|
### Key Parameters
|
|
|
|
| Parameter | Type | Description | Default |
|
|
|-----------|------|-------------|---------|
|
|
| `model_name` | str | Name of the model to use | "meta-llama/Llama-2-7b-chat-hf" |
|
|
| `system_prompt` | str | System prompt to use | None |
|
|
| `stream` | bool | Whether to stream the output | False |
|
|
| `temperature` | float | Sampling temperature | 0.5 |
|
|
| `max_tokens` | int | Maximum number of tokens to generate | 4000 |
|
|
|
|
### Example with Custom Parameters
|
|
|
|
```python title="custom_parameters.py"
|
|
vllm = VLLMWrapper(
|
|
model_name="meta-llama/Llama-2-13b-chat-hf",
|
|
system_prompt="You are an expert in artificial intelligence.",
|
|
temperature=0.8,
|
|
max_tokens=2000
|
|
)
|
|
```
|
|
|
|
## Integration with Agents
|
|
|
|
You can easily integrate vLLM with Swarms agents for more complex workflows:
|
|
|
|
```python title="agent_integration.py"
|
|
from swarms import Agent
|
|
from swarms.utils.vllm_wrapper import VLLMWrapper
|
|
|
|
# Initialize vLLM
|
|
vllm = VLLMWrapper(
|
|
model_name="meta-llama/Llama-2-7b-chat-hf",
|
|
system_prompt="You are a helpful assistant."
|
|
)
|
|
|
|
# Create an agent with vLLM
|
|
agent = Agent(
|
|
agent_name="Research-Agent",
|
|
agent_description="Expert in conducting research and analysis",
|
|
system_prompt="""You are an expert research agent. Your tasks include:
|
|
1. Analyzing complex topics
|
|
2. Providing detailed summaries
|
|
3. Making data-driven recommendations""",
|
|
llm=vllm,
|
|
max_loops=1
|
|
)
|
|
|
|
# Run the agent
|
|
response = agent.run("Research the impact of AI on healthcare")
|
|
```
|
|
|
|
## Advanced Features
|
|
|
|
### Batch Processing
|
|
|
|
!!! tip "Performance Optimization"
|
|
Use batch processing for efficient handling of multiple tasks simultaneously.
|
|
|
|
```python title="batch_processing.py"
|
|
tasks = [
|
|
"What is machine learning?",
|
|
"Explain neural networks",
|
|
"Describe deep learning"
|
|
]
|
|
|
|
results = vllm.batched_run(tasks, batch_size=3)
|
|
```
|
|
|
|
### Error Handling
|
|
|
|
!!! warning "Error Management"
|
|
Always implement proper error handling in production environments.
|
|
|
|
```python title="error_handling.py"
|
|
from loguru import logger
|
|
|
|
try:
|
|
response = vllm.run("Complex task")
|
|
except Exception as error:
|
|
logger.error(f"Error occurred: {error}")
|
|
```
|
|
|
|
## Best Practices
|
|
|
|
!!! success "Recommended Practices"
|
|
=== "Model Selection"
|
|
- Choose appropriate model sizes based on your requirements
|
|
- Consider the trade-off between model size and inference speed
|
|
|
|
=== "System Resources"
|
|
- Ensure sufficient GPU memory for your chosen model
|
|
- Monitor resource usage during batch processing
|
|
|
|
=== "Prompt Engineering"
|
|
- Use clear and specific system prompts
|
|
- Structure user prompts for optimal results
|
|
|
|
=== "Error Handling"
|
|
- Implement proper error handling and logging
|
|
- Set up monitoring for production deployments
|
|
|
|
=== "Performance"
|
|
- Use batch processing for multiple tasks
|
|
- Adjust max_tokens based on your use case
|
|
- Fine-tune temperature for optimal output quality
|
|
|
|
## Example: Multi-Agent System
|
|
|
|
Here's an example of creating a multi-agent system using vLLM:
|
|
|
|
```python title="multi_agent_system.py"
|
|
from swarms import Agent, ConcurrentWorkflow
|
|
from swarms.utils.vllm_wrapper import VLLMWrapper
|
|
|
|
# Initialize vLLM
|
|
vllm = VLLMWrapper(
|
|
model_name="meta-llama/Llama-2-7b-chat-hf",
|
|
system_prompt="You are a helpful assistant."
|
|
)
|
|
|
|
# Create specialized agents
|
|
research_agent = Agent(
|
|
agent_name="Research-Agent",
|
|
agent_description="Expert in research",
|
|
system_prompt="You are a research expert.",
|
|
llm=vllm
|
|
)
|
|
|
|
analysis_agent = Agent(
|
|
agent_name="Analysis-Agent",
|
|
agent_description="Expert in analysis",
|
|
system_prompt="You are an analysis expert.",
|
|
llm=vllm
|
|
)
|
|
|
|
# Create a workflow
|
|
agents = [research_agent, analysis_agent]
|
|
workflow = ConcurrentWorkflow(
|
|
name="Research-Analysis-Workflow",
|
|
description="Comprehensive research and analysis workflow",
|
|
agents=agents
|
|
)
|
|
|
|
# Run the workflow
|
|
result = workflow.run("Analyze the impact of renewable energy")
|
|
``` |