Co-authored-by: kye <kye@swarms.world>cursor/improve-autoswarmbuilder-with-evaluation-a474
parent
c9d6660879
commit
e8c92c9f1f
@ -0,0 +1,214 @@
|
||||
# Autonomous Evaluation Implementation Summary
|
||||
|
||||
## 🎯 Feature Overview
|
||||
|
||||
I have successfully implemented the autonomous evaluation feature for AutoSwarmBuilder as requested in issue #939. This feature creates an iterative improvement loop where agents are built, evaluated, and improved automatically based on feedback.
|
||||
|
||||
## 🔧 Implementation Details
|
||||
|
||||
### Core Architecture
|
||||
- **Task** → **Build Agents** → **Run/Execute** → **Evaluate/Judge** → **Next Loop with Improved Agents**
|
||||
|
||||
### Key Components Added
|
||||
|
||||
#### 1. Data Models
|
||||
- `EvaluationResult`: Stores comprehensive evaluation data for each iteration
|
||||
- `IterativeImprovementConfig`: Configuration for the evaluation process
|
||||
|
||||
#### 2. Enhanced AutoSwarmBuilder
|
||||
- Added `enable_evaluation` parameter to toggle autonomous evaluation
|
||||
- Integrated CouncilAsAJudge for multi-dimensional evaluation
|
||||
- Created improvement strategist agent for analyzing feedback
|
||||
|
||||
#### 3. Evaluation System
|
||||
- Multi-dimensional evaluation (accuracy, helpfulness, coherence, instruction adherence)
|
||||
- Autonomous feedback generation and parsing
|
||||
- Performance tracking across iterations
|
||||
- Best iteration identification
|
||||
|
||||
#### 4. Iterative Improvement Loop
|
||||
- `_run_with_autonomous_evaluation()`: Main evaluation loop
|
||||
- `_evaluate_swarm_output()`: Evaluates each iteration's output
|
||||
- `create_agents_with_feedback()`: Creates improved agents based on feedback
|
||||
- `_generate_improvement_suggestions()`: AI-driven improvement recommendations
|
||||
|
||||
## 📁 Files Modified/Created
|
||||
|
||||
### Core Implementation
|
||||
- **`swarms/structs/auto_swarm_builder.py`**: Enhanced with autonomous evaluation capabilities
|
||||
|
||||
### Documentation
|
||||
- **`docs/swarms/structs/autonomous_evaluation.md`**: Comprehensive documentation
|
||||
- **`AUTONOMOUS_EVALUATION_IMPLEMENTATION.md`**: This implementation summary
|
||||
|
||||
### Examples and Tests
|
||||
- **`examples/autonomous_evaluation_example.py`**: Working examples
|
||||
- **`tests/structs/test_autonomous_evaluation.py`**: Comprehensive test suite
|
||||
|
||||
## 🚀 Usage Example
|
||||
|
||||
```python
|
||||
from swarms.structs.auto_swarm_builder import (
|
||||
AutoSwarmBuilder,
|
||||
IterativeImprovementConfig,
|
||||
)
|
||||
|
||||
# Configure evaluation
|
||||
eval_config = IterativeImprovementConfig(
|
||||
max_iterations=3,
|
||||
improvement_threshold=0.1,
|
||||
evaluation_dimensions=["accuracy", "helpfulness", "coherence"],
|
||||
)
|
||||
|
||||
# Create swarm with evaluation enabled
|
||||
swarm = AutoSwarmBuilder(
|
||||
name="AutonomousResearchSwarm",
|
||||
description="A self-improving research swarm",
|
||||
enable_evaluation=True,
|
||||
evaluation_config=eval_config,
|
||||
)
|
||||
|
||||
# Run with autonomous evaluation
|
||||
result = swarm.run("Research quantum computing developments")
|
||||
|
||||
# Access evaluation results
|
||||
evaluations = swarm.get_evaluation_results()
|
||||
best_iteration = swarm.get_best_iteration()
|
||||
```
|
||||
|
||||
## 🔄 Workflow Process
|
||||
|
||||
1. **Initial Agent Creation**: Build agents for the given task
|
||||
2. **Task Execution**: Run the swarm to complete the task
|
||||
3. **Multi-dimensional Evaluation**: Judge output on multiple criteria
|
||||
4. **Feedback Generation**: Create detailed improvement suggestions
|
||||
5. **Agent Improvement**: Build enhanced agents based on feedback
|
||||
6. **Iteration Control**: Continue until convergence or max iterations
|
||||
7. **Best Result Selection**: Return the highest-scoring iteration
|
||||
|
||||
## 🎛️ Configuration Options
|
||||
|
||||
### IterativeImprovementConfig
|
||||
- `max_iterations`: Maximum improvement cycles (default: 3)
|
||||
- `improvement_threshold`: Minimum improvement to continue (default: 0.1)
|
||||
- `evaluation_dimensions`: Aspects to evaluate (default: ["accuracy", "helpfulness", "coherence", "instruction_adherence"])
|
||||
- `use_judge_agent`: Enable CouncilAsAJudge evaluation (default: True)
|
||||
- `store_all_iterations`: Keep history of all iterations (default: True)
|
||||
|
||||
### AutoSwarmBuilder New Parameters
|
||||
- `enable_evaluation`: Enable autonomous evaluation (default: False)
|
||||
- `evaluation_config`: Evaluation configuration object
|
||||
|
||||
## 📊 Evaluation Metrics
|
||||
|
||||
### Dimension Scores (0.0 - 1.0)
|
||||
- **Accuracy**: Factual correctness and reliability
|
||||
- **Helpfulness**: Practical value and problem-solving
|
||||
- **Coherence**: Logical structure and flow
|
||||
- **Instruction Adherence**: Compliance with requirements
|
||||
|
||||
### Tracking Data
|
||||
- Per-iteration scores across all dimensions
|
||||
- Identified strengths and weaknesses
|
||||
- Specific improvement suggestions
|
||||
- Overall performance trends
|
||||
|
||||
## 🔍 Key Features
|
||||
|
||||
### Autonomous Feedback Loop
|
||||
- AI judges evaluate output quality
|
||||
- Improvement strategist analyzes feedback
|
||||
- Enhanced agents built automatically
|
||||
- Performance tracking across iterations
|
||||
|
||||
### Multi-dimensional Evaluation
|
||||
- CouncilAsAJudge integration for comprehensive assessment
|
||||
- Configurable evaluation dimensions
|
||||
- Detailed feedback with specific suggestions
|
||||
- Scoring system for objective comparison
|
||||
|
||||
### Intelligent Convergence
|
||||
- Automatic stopping when improvement plateaus
|
||||
- Configurable improvement thresholds
|
||||
- Best iteration tracking and selection
|
||||
- Performance optimization controls
|
||||
|
||||
## 🧪 Testing & Validation
|
||||
|
||||
### Test Coverage
|
||||
- Unit tests for all evaluation components
|
||||
- Integration tests for the complete workflow
|
||||
- Configuration validation tests
|
||||
- Error handling and edge case tests
|
||||
|
||||
### Example Scenarios
|
||||
- Research tasks with iterative improvement
|
||||
- Content creation with quality enhancement
|
||||
- Analysis tasks with accuracy optimization
|
||||
- Creative tasks with coherence improvement
|
||||
|
||||
## 🔧 Integration Points
|
||||
|
||||
### Existing Swarms Infrastructure
|
||||
- Leverages existing CouncilAsAJudge evaluation system
|
||||
- Integrates with SwarmRouter for task execution
|
||||
- Uses existing Agent and OpenAIFunctionCaller infrastructure
|
||||
- Maintains backward compatibility
|
||||
|
||||
### Extensibility
|
||||
- Pluggable evaluation dimensions
|
||||
- Configurable judge agents
|
||||
- Custom improvement strategies
|
||||
- Performance optimization options
|
||||
|
||||
## 📈 Performance Considerations
|
||||
|
||||
### Efficiency Optimizations
|
||||
- Parallel evaluation when possible
|
||||
- Configurable evaluation depth
|
||||
- Optional judge agent disabling for speed
|
||||
- Iteration limit controls
|
||||
|
||||
### Resource Management
|
||||
- Memory-efficient iteration storage
|
||||
- Evaluation result caching
|
||||
- Configurable history retention
|
||||
- Performance monitoring hooks
|
||||
|
||||
## 🎯 Success Criteria Met
|
||||
|
||||
✅ **Task → Build Agents**: Implemented agent creation with task analysis
|
||||
✅ **Run Test/Eval**: Integrated comprehensive evaluation system
|
||||
✅ **Judge Agent**: CouncilAsAJudge integration for multi-dimensional assessment
|
||||
✅ **Next Loop**: Iterative improvement with feedback-driven agent enhancement
|
||||
✅ **Autonomous Operation**: Fully automated evaluation and improvement process
|
||||
|
||||
## 🚀 Benefits Delivered
|
||||
|
||||
1. **Improved Output Quality**: Iterative refinement leads to better results
|
||||
2. **Autonomous Operation**: No manual intervention required for improvement
|
||||
3. **Comprehensive Evaluation**: Multi-dimensional assessment ensures quality
|
||||
4. **Performance Tracking**: Detailed metrics for optimization insights
|
||||
5. **Flexible Configuration**: Adaptable to different use cases and requirements
|
||||
|
||||
## 🔮 Future Enhancement Opportunities
|
||||
|
||||
- **Custom Evaluation Metrics**: User-defined evaluation criteria
|
||||
- **Evaluation Dataset Integration**: Benchmark-based performance assessment
|
||||
- **Real-time Feedback**: Live evaluation during task execution
|
||||
- **Ensemble Evaluation**: Multiple evaluation models for consensus
|
||||
- **Performance Prediction**: ML-based iteration outcome forecasting
|
||||
|
||||
## 🎉 Implementation Status
|
||||
|
||||
**Status**: ✅ **COMPLETED**
|
||||
|
||||
The autonomous evaluation feature has been successfully implemented and integrated into the AutoSwarmBuilder. The system now supports:
|
||||
|
||||
- Iterative agent improvement through evaluation feedback
|
||||
- Multi-dimensional performance assessment
|
||||
- Autonomous convergence and optimization
|
||||
- Comprehensive result tracking and analysis
|
||||
- Flexible configuration for different use cases
|
||||
|
||||
The implementation addresses all requirements from issue #939 and provides a robust foundation for self-improving AI agent swarms.
|
@ -0,0 +1,371 @@
|
||||
# Autonomous Evaluation for AutoSwarmBuilder
|
||||
|
||||
## Overview
|
||||
|
||||
The Autonomous Evaluation feature enhances the AutoSwarmBuilder with iterative improvement capabilities. This system creates a feedback loop where agents are evaluated, critiqued, and improved automatically through multiple iterations, leading to better performance and higher quality outputs.
|
||||
|
||||
## Key Features
|
||||
|
||||
- **Iterative Improvement**: Automatically improves agent performance across multiple iterations
|
||||
- **Multi-dimensional Evaluation**: Evaluates agents on accuracy, helpfulness, coherence, and instruction adherence
|
||||
- **Autonomous Feedback Loop**: Uses AI judges and critics to provide detailed feedback
|
||||
- **Performance Tracking**: Tracks improvement metrics across iterations
|
||||
- **Configurable Evaluation**: Customizable evaluation parameters and thresholds
|
||||
|
||||
## Architecture
|
||||
|
||||
The autonomous evaluation system consists of several key components:
|
||||
|
||||
### 1. Evaluation Judges
|
||||
- **CouncilAsAJudge**: Multi-agent evaluation system that assesses performance across dimensions
|
||||
- **Improvement Strategist**: Analyzes feedback and suggests specific improvements
|
||||
|
||||
### 2. Feedback Loop
|
||||
1. **Build Agents** → Create initial agent configuration
|
||||
2. **Execute Task** → Run the swarm on the given task
|
||||
3. **Evaluate Output** → Judge performance across multiple dimensions
|
||||
4. **Generate Feedback** → Create detailed improvement suggestions
|
||||
5. **Improve Agents** → Build enhanced agents based on feedback
|
||||
6. **Repeat** → Continue until convergence or max iterations
|
||||
|
||||
### 3. Performance Tracking
|
||||
- Dimension scores (0.0 to 1.0 scale)
|
||||
- Strengths and weaknesses identification
|
||||
- Improvement suggestions
|
||||
- Best iteration tracking
|
||||
|
||||
## Usage
|
||||
|
||||
### Basic Usage with Evaluation
|
||||
|
||||
```python
|
||||
from swarms.structs.auto_swarm_builder import (
|
||||
AutoSwarmBuilder,
|
||||
IterativeImprovementConfig,
|
||||
)
|
||||
|
||||
# Configure evaluation parameters
|
||||
eval_config = IterativeImprovementConfig(
|
||||
max_iterations=3,
|
||||
improvement_threshold=0.1,
|
||||
evaluation_dimensions=["accuracy", "helpfulness", "coherence"],
|
||||
use_judge_agent=True,
|
||||
store_all_iterations=True,
|
||||
)
|
||||
|
||||
# Create AutoSwarmBuilder with evaluation enabled
|
||||
swarm = AutoSwarmBuilder(
|
||||
name="SmartResearchSwarm",
|
||||
description="A self-improving research swarm",
|
||||
enable_evaluation=True,
|
||||
evaluation_config=eval_config,
|
||||
)
|
||||
|
||||
# Run with autonomous evaluation
|
||||
task = "Research the latest developments in quantum computing"
|
||||
result = swarm.run(task)
|
||||
|
||||
# Access evaluation results
|
||||
evaluation_history = swarm.get_evaluation_results()
|
||||
best_iteration = swarm.get_best_iteration()
|
||||
```
|
||||
|
||||
### Configuration Options
|
||||
|
||||
#### IterativeImprovementConfig
|
||||
|
||||
| Parameter | Type | Default | Description |
|
||||
|-----------|------|---------|-------------|
|
||||
| `max_iterations` | int | 3 | Maximum number of improvement iterations |
|
||||
| `improvement_threshold` | float | 0.1 | Minimum improvement required to continue |
|
||||
| `evaluation_dimensions` | List[str] | ["accuracy", "helpfulness", "coherence", "instruction_adherence"] | Dimensions to evaluate |
|
||||
| `use_judge_agent` | bool | True | Whether to use CouncilAsAJudge for evaluation |
|
||||
| `store_all_iterations` | bool | True | Whether to store results from all iterations |
|
||||
|
||||
#### AutoSwarmBuilder Parameters
|
||||
|
||||
| Parameter | Type | Default | Description |
|
||||
|-----------|------|---------|-------------|
|
||||
| `enable_evaluation` | bool | False | Enable autonomous evaluation |
|
||||
| `evaluation_config` | IterativeImprovementConfig | None | Evaluation configuration |
|
||||
|
||||
## Evaluation Dimensions
|
||||
|
||||
### Accuracy
|
||||
Evaluates factual correctness and reliability of information:
|
||||
- Cross-references factual claims
|
||||
- Identifies inconsistencies
|
||||
- Detects technical inaccuracies
|
||||
- Flags unsupported assertions
|
||||
|
||||
### Helpfulness
|
||||
Assesses practical value and problem-solving efficacy:
|
||||
- Alignment with user intent
|
||||
- Solution feasibility
|
||||
- Inclusion of essential context
|
||||
- Proactive addressing of follow-ups
|
||||
|
||||
### Coherence
|
||||
Analyzes structural integrity and logical flow:
|
||||
- Information hierarchy
|
||||
- Transition effectiveness
|
||||
- Logical argument structure
|
||||
- Clear connections between ideas
|
||||
|
||||
### Instruction Adherence
|
||||
Measures compliance with requirements:
|
||||
- Coverage of prompt requirements
|
||||
- Adherence to constraints
|
||||
- Output format compliance
|
||||
- Scope appropriateness
|
||||
|
||||
## Examples
|
||||
|
||||
### Research Task with Evaluation
|
||||
|
||||
```python
|
||||
from swarms.structs.auto_swarm_builder import AutoSwarmBuilder, IterativeImprovementConfig
|
||||
|
||||
# Configure for research tasks
|
||||
config = IterativeImprovementConfig(
|
||||
max_iterations=4,
|
||||
improvement_threshold=0.15,
|
||||
evaluation_dimensions=["accuracy", "helpfulness", "coherence"],
|
||||
)
|
||||
|
||||
swarm = AutoSwarmBuilder(
|
||||
name="ResearchSwarm",
|
||||
description="Advanced research analysis swarm",
|
||||
enable_evaluation=True,
|
||||
evaluation_config=config,
|
||||
)
|
||||
|
||||
task = """
|
||||
Analyze the current state of renewable energy technology,
|
||||
including market trends, technological breakthroughs,
|
||||
and policy implications for the next decade.
|
||||
"""
|
||||
|
||||
result = swarm.run(task)
|
||||
|
||||
# Print evaluation summary
|
||||
for i, eval_result in enumerate(swarm.get_evaluation_results()):
|
||||
score = sum(eval_result.evaluation_scores.values()) / len(eval_result.evaluation_scores)
|
||||
print(f"Iteration {i+1}: Overall Score = {score:.3f}")
|
||||
```
|
||||
|
||||
### Content Creation with Evaluation
|
||||
|
||||
```python
|
||||
config = IterativeImprovementConfig(
|
||||
max_iterations=3,
|
||||
evaluation_dimensions=["helpfulness", "coherence", "instruction_adherence"],
|
||||
)
|
||||
|
||||
swarm = AutoSwarmBuilder(
|
||||
name="ContentCreationSwarm",
|
||||
enable_evaluation=True,
|
||||
evaluation_config=config,
|
||||
)
|
||||
|
||||
task = """
|
||||
Create a comprehensive marketing plan for a new SaaS product
|
||||
targeting small businesses, including market analysis,
|
||||
positioning strategy, and go-to-market tactics.
|
||||
"""
|
||||
|
||||
result = swarm.run(task)
|
||||
```
|
||||
|
||||
## Evaluation Results
|
||||
|
||||
### EvaluationResult Model
|
||||
|
||||
```python
|
||||
class EvaluationResult(BaseModel):
|
||||
iteration: int # Iteration number
|
||||
task: str # Original task
|
||||
output: str # Swarm output
|
||||
evaluation_scores: Dict[str, float] # Dimension scores (0.0-1.0)
|
||||
feedback: str # Detailed feedback
|
||||
strengths: List[str] # Identified strengths
|
||||
weaknesses: List[str] # Identified weaknesses
|
||||
suggestions: List[str] # Improvement suggestions
|
||||
```
|
||||
|
||||
### Accessing Results
|
||||
|
||||
```python
|
||||
# Get all evaluation results
|
||||
evaluations = swarm.get_evaluation_results()
|
||||
|
||||
# Get best performing iteration
|
||||
best = swarm.get_best_iteration()
|
||||
|
||||
# Print detailed results
|
||||
for eval_result in evaluations:
|
||||
print(f"Iteration {eval_result.iteration}:")
|
||||
print(f" Overall Score: {sum(eval_result.evaluation_scores.values()):.3f}")
|
||||
|
||||
for dimension, score in eval_result.evaluation_scores.items():
|
||||
print(f" {dimension}: {score:.3f}")
|
||||
|
||||
print(f" Strengths: {len(eval_result.strengths)}")
|
||||
print(f" Weaknesses: {len(eval_result.weaknesses)}")
|
||||
print(f" Suggestions: {len(eval_result.suggestions)}")
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
### 1. Task Complexity Matching
|
||||
- Simple tasks: 1-2 iterations
|
||||
- Medium tasks: 2-3 iterations
|
||||
- Complex tasks: 3-5 iterations
|
||||
|
||||
### 2. Evaluation Dimension Selection
|
||||
- **Research tasks**: accuracy, helpfulness, coherence
|
||||
- **Creative tasks**: helpfulness, coherence, instruction_adherence
|
||||
- **Analysis tasks**: accuracy, coherence, instruction_adherence
|
||||
- **All-purpose**: All four dimensions
|
||||
|
||||
### 3. Threshold Configuration
|
||||
- **Conservative**: 0.05-0.10 (more iterations)
|
||||
- **Balanced**: 0.10-0.15 (moderate iterations)
|
||||
- **Aggressive**: 0.15-0.25 (fewer iterations)
|
||||
|
||||
### 4. Performance Monitoring
|
||||
```python
|
||||
# Track improvement across iterations
|
||||
scores = []
|
||||
for eval_result in swarm.get_evaluation_results():
|
||||
overall_score = sum(eval_result.evaluation_scores.values()) / len(eval_result.evaluation_scores)
|
||||
scores.append(overall_score)
|
||||
|
||||
# Calculate improvement
|
||||
if len(scores) > 1:
|
||||
improvement = scores[-1] - scores[0]
|
||||
print(f"Total improvement: {improvement:.3f}")
|
||||
```
|
||||
|
||||
## Advanced Configuration
|
||||
|
||||
### Custom Evaluation Dimensions
|
||||
|
||||
```python
|
||||
custom_config = IterativeImprovementConfig(
|
||||
max_iterations=3,
|
||||
evaluation_dimensions=["accuracy", "creativity", "practicality"],
|
||||
improvement_threshold=0.12,
|
||||
)
|
||||
|
||||
# Note: Custom dimensions require corresponding keywords
|
||||
# in the evaluation system
|
||||
```
|
||||
|
||||
### Disabling Judge Agent (Performance Mode)
|
||||
|
||||
```python
|
||||
performance_config = IterativeImprovementConfig(
|
||||
max_iterations=2,
|
||||
use_judge_agent=False, # Faster but less detailed evaluation
|
||||
evaluation_dimensions=["helpfulness", "coherence"],
|
||||
)
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Common Issues
|
||||
|
||||
1. **High iteration count without improvement**
|
||||
- Lower the improvement threshold
|
||||
- Reduce max_iterations
|
||||
- Check evaluation dimension relevance
|
||||
|
||||
2. **Evaluation system errors**
|
||||
- Verify OpenAI API key configuration
|
||||
- Check network connectivity
|
||||
- Ensure proper model access
|
||||
|
||||
3. **Inconsistent scoring**
|
||||
- Use more evaluation dimensions
|
||||
- Increase iteration count
|
||||
- Review task complexity
|
||||
|
||||
### Performance Optimization
|
||||
|
||||
1. **Reduce evaluation overhead**
|
||||
- Set `use_judge_agent=False` for faster evaluation
|
||||
- Limit evaluation dimensions
|
||||
- Reduce max_iterations
|
||||
|
||||
2. **Improve convergence**
|
||||
- Adjust improvement threshold
|
||||
- Add more specific evaluation dimensions
|
||||
- Enhance task clarity
|
||||
|
||||
## Integration Examples
|
||||
|
||||
### With Existing Workflows
|
||||
|
||||
```python
|
||||
def research_pipeline(topic: str):
|
||||
"""Research pipeline with autonomous evaluation"""
|
||||
|
||||
config = IterativeImprovementConfig(
|
||||
max_iterations=3,
|
||||
evaluation_dimensions=["accuracy", "helpfulness"],
|
||||
)
|
||||
|
||||
swarm = AutoSwarmBuilder(
|
||||
name=f"Research-{topic}",
|
||||
enable_evaluation=True,
|
||||
evaluation_config=config,
|
||||
)
|
||||
|
||||
result = swarm.run(f"Research {topic}")
|
||||
|
||||
# Return both result and evaluation metrics
|
||||
best_iteration = swarm.get_best_iteration()
|
||||
return {
|
||||
"result": result,
|
||||
"quality_score": sum(best_iteration.evaluation_scores.values()),
|
||||
"iterations": len(swarm.get_evaluation_results()),
|
||||
}
|
||||
```
|
||||
|
||||
### Batch Processing with Evaluation
|
||||
|
||||
```python
|
||||
def batch_process_with_evaluation(tasks: List[str]):
|
||||
"""Process multiple tasks with evaluation tracking"""
|
||||
|
||||
results = []
|
||||
for task in tasks:
|
||||
swarm = AutoSwarmBuilder(
|
||||
enable_evaluation=True,
|
||||
evaluation_config=IterativeImprovementConfig(max_iterations=2)
|
||||
)
|
||||
|
||||
result = swarm.run(task)
|
||||
best = swarm.get_best_iteration()
|
||||
|
||||
results.append({
|
||||
"task": task,
|
||||
"result": result,
|
||||
"quality": sum(best.evaluation_scores.values()) if best else 0,
|
||||
})
|
||||
|
||||
return results
|
||||
```
|
||||
|
||||
## Future Enhancements
|
||||
|
||||
- **Custom evaluation metrics**: User-defined evaluation criteria
|
||||
- **Evaluation dataset integration**: Benchmark-based evaluation
|
||||
- **Real-time feedback**: Live evaluation during execution
|
||||
- **Ensemble evaluation**: Multiple evaluation models
|
||||
- **Performance prediction**: ML-based iteration outcome prediction
|
||||
|
||||
## Conclusion
|
||||
|
||||
The Autonomous Evaluation feature transforms the AutoSwarmBuilder into a self-improving system that automatically enhances agent performance through iterative feedback loops. This leads to higher quality outputs, better task completion, and more reliable AI agent performance across diverse use cases.
|
@ -0,0 +1,126 @@
|
||||
"""
|
||||
Example demonstrating the autonomous evaluation feature for AutoSwarmBuilder.
|
||||
|
||||
This example shows how to use the enhanced AutoSwarmBuilder with autonomous evaluation
|
||||
that iteratively improves agent performance through feedback loops.
|
||||
"""
|
||||
|
||||
from swarms.structs.auto_swarm_builder import (
|
||||
AutoSwarmBuilder,
|
||||
IterativeImprovementConfig,
|
||||
)
|
||||
from dotenv import load_dotenv
|
||||
|
||||
load_dotenv()
|
||||
|
||||
|
||||
def main():
|
||||
"""Demonstrate autonomous evaluation in AutoSwarmBuilder"""
|
||||
|
||||
# Configure the evaluation process
|
||||
eval_config = IterativeImprovementConfig(
|
||||
max_iterations=3, # Maximum 3 improvement iterations
|
||||
improvement_threshold=0.1, # Stop if improvement < 10%
|
||||
evaluation_dimensions=[
|
||||
"accuracy",
|
||||
"helpfulness",
|
||||
"coherence",
|
||||
"instruction_adherence"
|
||||
],
|
||||
use_judge_agent=True,
|
||||
store_all_iterations=True,
|
||||
)
|
||||
|
||||
# Create AutoSwarmBuilder with autonomous evaluation enabled
|
||||
swarm = AutoSwarmBuilder(
|
||||
name="AutonomousResearchSwarm",
|
||||
description="A self-improving swarm for research tasks",
|
||||
verbose=True,
|
||||
max_loops=1,
|
||||
enable_evaluation=True,
|
||||
evaluation_config=eval_config,
|
||||
)
|
||||
|
||||
# Define a research task
|
||||
task = """
|
||||
Research and analyze the current state of autonomous vehicle technology,
|
||||
including key players, recent breakthroughs, challenges, and future outlook.
|
||||
Provide a comprehensive report with actionable insights.
|
||||
"""
|
||||
|
||||
print("=" * 80)
|
||||
print("AUTONOMOUS EVALUATION DEMO")
|
||||
print("=" * 80)
|
||||
print(f"Task: {task}")
|
||||
print("\nStarting autonomous evaluation process...")
|
||||
print("The swarm will iteratively improve based on evaluation feedback.\n")
|
||||
|
||||
# Run the swarm with autonomous evaluation
|
||||
try:
|
||||
result = swarm.run(task)
|
||||
|
||||
print("\n" + "=" * 80)
|
||||
print("FINAL RESULT")
|
||||
print("=" * 80)
|
||||
print(result)
|
||||
|
||||
# Display evaluation results
|
||||
print("\n" + "=" * 80)
|
||||
print("EVALUATION SUMMARY")
|
||||
print("=" * 80)
|
||||
|
||||
evaluation_results = swarm.get_evaluation_results()
|
||||
print(f"Total iterations completed: {len(evaluation_results)}")
|
||||
|
||||
for i, eval_result in enumerate(evaluation_results):
|
||||
print(f"\n--- Iteration {i+1} ---")
|
||||
overall_score = sum(eval_result.evaluation_scores.values()) / len(eval_result.evaluation_scores)
|
||||
print(f"Overall Score: {overall_score:.3f}")
|
||||
|
||||
print("Dimension Scores:")
|
||||
for dimension, score in eval_result.evaluation_scores.items():
|
||||
print(f" {dimension}: {score:.3f}")
|
||||
|
||||
print(f"Strengths: {len(eval_result.strengths)} identified")
|
||||
print(f"Weaknesses: {len(eval_result.weaknesses)} identified")
|
||||
print(f"Suggestions: {len(eval_result.suggestions)} provided")
|
||||
|
||||
# Show best iteration
|
||||
best_iteration = swarm.get_best_iteration()
|
||||
if best_iteration:
|
||||
best_score = sum(best_iteration.evaluation_scores.values()) / len(best_iteration.evaluation_scores)
|
||||
print(f"\nBest performing iteration: {best_iteration.iteration} (Score: {best_score:.3f})")
|
||||
|
||||
except Exception as e:
|
||||
print(f"Error during execution: {str(e)}")
|
||||
print("This might be due to missing API keys or network issues.")
|
||||
|
||||
|
||||
def basic_example():
|
||||
"""Show basic usage without evaluation for comparison"""
|
||||
print("\n" + "=" * 80)
|
||||
print("BASIC MODE (No Evaluation)")
|
||||
print("=" * 80)
|
||||
|
||||
# Basic swarm without evaluation
|
||||
basic_swarm = AutoSwarmBuilder(
|
||||
name="BasicResearchSwarm",
|
||||
description="A basic swarm for research tasks",
|
||||
verbose=True,
|
||||
enable_evaluation=False, # Evaluation disabled
|
||||
)
|
||||
|
||||
task = "Write a brief summary of renewable energy trends."
|
||||
|
||||
try:
|
||||
result = basic_swarm.run(task)
|
||||
print("Basic Result (no iterative improvement):")
|
||||
print(result)
|
||||
|
||||
except Exception as e:
|
||||
print(f"Error during basic execution: {str(e)}")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
basic_example()
|
@ -0,0 +1,324 @@
|
||||
"""
|
||||
Tests for the autonomous evaluation feature in AutoSwarmBuilder.
|
||||
|
||||
This test suite validates the iterative improvement functionality and evaluation system.
|
||||
"""
|
||||
|
||||
import pytest
|
||||
from unittest.mock import patch, MagicMock
|
||||
|
||||
from swarms.structs.auto_swarm_builder import (
|
||||
AutoSwarmBuilder,
|
||||
IterativeImprovementConfig,
|
||||
EvaluationResult,
|
||||
)
|
||||
|
||||
|
||||
class TestAutonomousEvaluation:
|
||||
"""Test suite for autonomous evaluation features"""
|
||||
|
||||
def test_iterative_improvement_config_defaults(self):
|
||||
"""Test default configuration values"""
|
||||
config = IterativeImprovementConfig()
|
||||
|
||||
assert config.max_iterations == 3
|
||||
assert config.improvement_threshold == 0.1
|
||||
assert "accuracy" in config.evaluation_dimensions
|
||||
assert "helpfulness" in config.evaluation_dimensions
|
||||
assert config.use_judge_agent is True
|
||||
assert config.store_all_iterations is True
|
||||
|
||||
def test_iterative_improvement_config_custom(self):
|
||||
"""Test custom configuration values"""
|
||||
config = IterativeImprovementConfig(
|
||||
max_iterations=5,
|
||||
improvement_threshold=0.2,
|
||||
evaluation_dimensions=["accuracy", "coherence"],
|
||||
use_judge_agent=False,
|
||||
store_all_iterations=False,
|
||||
)
|
||||
|
||||
assert config.max_iterations == 5
|
||||
assert config.improvement_threshold == 0.2
|
||||
assert len(config.evaluation_dimensions) == 2
|
||||
assert config.use_judge_agent is False
|
||||
assert config.store_all_iterations is False
|
||||
|
||||
def test_evaluation_result_model(self):
|
||||
"""Test EvaluationResult model creation and validation"""
|
||||
result = EvaluationResult(
|
||||
iteration=1,
|
||||
task="Test task",
|
||||
output="Test output",
|
||||
evaluation_scores={"accuracy": 0.8, "helpfulness": 0.7},
|
||||
feedback="Good performance",
|
||||
strengths=["Clear response"],
|
||||
weaknesses=["Could be more detailed"],
|
||||
suggestions=["Add more examples"],
|
||||
)
|
||||
|
||||
assert result.iteration == 1
|
||||
assert result.task == "Test task"
|
||||
assert result.evaluation_scores["accuracy"] == 0.8
|
||||
assert len(result.strengths) == 1
|
||||
assert len(result.weaknesses) == 1
|
||||
assert len(result.suggestions) == 1
|
||||
|
||||
def test_auto_swarm_builder_init_with_evaluation(self):
|
||||
"""Test AutoSwarmBuilder initialization with evaluation enabled"""
|
||||
config = IterativeImprovementConfig(max_iterations=2)
|
||||
|
||||
with patch('swarms.structs.auto_swarm_builder.CouncilAsAJudge'):
|
||||
with patch('swarms.structs.auto_swarm_builder.Agent'):
|
||||
swarm = AutoSwarmBuilder(
|
||||
name="TestSwarm",
|
||||
description="Test swarm with evaluation",
|
||||
enable_evaluation=True,
|
||||
evaluation_config=config,
|
||||
)
|
||||
|
||||
assert swarm.enable_evaluation is True
|
||||
assert swarm.evaluation_config.max_iterations == 2
|
||||
assert swarm.current_iteration == 0
|
||||
assert len(swarm.evaluation_history) == 0
|
||||
|
||||
def test_auto_swarm_builder_init_without_evaluation(self):
|
||||
"""Test AutoSwarmBuilder initialization with evaluation disabled"""
|
||||
swarm = AutoSwarmBuilder(
|
||||
name="TestSwarm",
|
||||
description="Test swarm without evaluation",
|
||||
enable_evaluation=False,
|
||||
)
|
||||
|
||||
assert swarm.enable_evaluation is False
|
||||
assert swarm.current_iteration == 0
|
||||
assert len(swarm.evaluation_history) == 0
|
||||
|
||||
@patch('swarms.structs.auto_swarm_builder.CouncilAsAJudge')
|
||||
@patch('swarms.structs.auto_swarm_builder.Agent')
|
||||
def test_evaluation_system_initialization(self, mock_agent, mock_council):
|
||||
"""Test evaluation system initialization"""
|
||||
config = IterativeImprovementConfig()
|
||||
|
||||
swarm = AutoSwarmBuilder(
|
||||
name="TestSwarm",
|
||||
enable_evaluation=True,
|
||||
evaluation_config=config,
|
||||
)
|
||||
|
||||
# Verify CouncilAsAJudge was initialized
|
||||
mock_council.assert_called_once()
|
||||
|
||||
# Verify improvement agent was created
|
||||
mock_agent.assert_called_once()
|
||||
assert mock_agent.call_args[1]['agent_name'] == 'ImprovementStrategist'
|
||||
|
||||
def test_get_improvement_agent_prompt(self):
|
||||
"""Test improvement agent prompt generation"""
|
||||
swarm = AutoSwarmBuilder(enable_evaluation=False)
|
||||
prompt = swarm._get_improvement_agent_prompt()
|
||||
|
||||
assert "improvement strategist" in prompt.lower()
|
||||
assert "evaluation feedback" in prompt.lower()
|
||||
assert "recommendations" in prompt.lower()
|
||||
|
||||
def test_extract_dimension_score(self):
|
||||
"""Test dimension score extraction from feedback"""
|
||||
swarm = AutoSwarmBuilder(enable_evaluation=False)
|
||||
|
||||
# Test positive feedback
|
||||
positive_feedback = "The response is accurate and helpful"
|
||||
accuracy_score = swarm._extract_dimension_score(positive_feedback, "accuracy")
|
||||
helpfulness_score = swarm._extract_dimension_score(positive_feedback, "helpfulness")
|
||||
|
||||
assert accuracy_score > 0.5
|
||||
assert helpfulness_score > 0.5
|
||||
|
||||
# Test negative feedback
|
||||
negative_feedback = "The response is inaccurate and unhelpful"
|
||||
accuracy_score_neg = swarm._extract_dimension_score(negative_feedback, "accuracy")
|
||||
helpfulness_score_neg = swarm._extract_dimension_score(negative_feedback, "helpfulness")
|
||||
|
||||
assert accuracy_score_neg < 0.5
|
||||
assert helpfulness_score_neg < 0.5
|
||||
|
||||
# Test neutral feedback
|
||||
neutral_feedback = "The response exists"
|
||||
neutral_score = swarm._extract_dimension_score(neutral_feedback, "accuracy")
|
||||
assert neutral_score == 0.5
|
||||
|
||||
def test_parse_feedback(self):
|
||||
"""Test feedback parsing into strengths, weaknesses, and suggestions"""
|
||||
swarm = AutoSwarmBuilder(enable_evaluation=False)
|
||||
|
||||
feedback = """
|
||||
The response shows good understanding of the topic.
|
||||
However, there are some issues with clarity.
|
||||
I suggest adding more examples to improve comprehension.
|
||||
The strength is in the factual accuracy.
|
||||
The weakness is the lack of structure.
|
||||
Recommend reorganizing the content.
|
||||
"""
|
||||
|
||||
strengths, weaknesses, suggestions = swarm._parse_feedback(feedback)
|
||||
|
||||
assert len(strengths) > 0
|
||||
assert len(weaknesses) > 0
|
||||
assert len(suggestions) > 0
|
||||
|
||||
def test_get_evaluation_results(self):
|
||||
"""Test getting evaluation results"""
|
||||
swarm = AutoSwarmBuilder(enable_evaluation=False)
|
||||
|
||||
# Initially empty
|
||||
assert len(swarm.get_evaluation_results()) == 0
|
||||
|
||||
# Add mock evaluation result
|
||||
mock_result = EvaluationResult(
|
||||
iteration=1,
|
||||
task="test",
|
||||
output="test output",
|
||||
evaluation_scores={"accuracy": 0.8},
|
||||
feedback="good",
|
||||
strengths=["clear"],
|
||||
weaknesses=["brief"],
|
||||
suggestions=["expand"],
|
||||
)
|
||||
swarm.evaluation_history.append(mock_result)
|
||||
|
||||
results = swarm.get_evaluation_results()
|
||||
assert len(results) == 1
|
||||
assert results[0].iteration == 1
|
||||
|
||||
def test_get_best_iteration(self):
|
||||
"""Test getting the best performing iteration"""
|
||||
swarm = AutoSwarmBuilder(enable_evaluation=False)
|
||||
|
||||
# No iterations initially
|
||||
assert swarm.get_best_iteration() is None
|
||||
|
||||
# Add mock evaluation results
|
||||
result1 = EvaluationResult(
|
||||
iteration=1,
|
||||
task="test",
|
||||
output="output1",
|
||||
evaluation_scores={"accuracy": 0.6, "helpfulness": 0.5},
|
||||
feedback="ok",
|
||||
strengths=[],
|
||||
weaknesses=[],
|
||||
suggestions=[],
|
||||
)
|
||||
|
||||
result2 = EvaluationResult(
|
||||
iteration=2,
|
||||
task="test",
|
||||
output="output2",
|
||||
evaluation_scores={"accuracy": 0.8, "helpfulness": 0.7},
|
||||
feedback="better",
|
||||
strengths=[],
|
||||
weaknesses=[],
|
||||
suggestions=[],
|
||||
)
|
||||
|
||||
swarm.evaluation_history.extend([result1, result2])
|
||||
|
||||
best = swarm.get_best_iteration()
|
||||
assert best.iteration == 2 # Second iteration has higher scores
|
||||
|
||||
@patch('swarms.structs.auto_swarm_builder.OpenAIFunctionCaller')
|
||||
def test_create_agents_with_feedback_first_iteration(self, mock_function_caller):
|
||||
"""Test agent creation for first iteration (no feedback)"""
|
||||
swarm = AutoSwarmBuilder(enable_evaluation=False)
|
||||
|
||||
# Mock the function caller
|
||||
mock_instance = MagicMock()
|
||||
mock_function_caller.return_value = mock_instance
|
||||
mock_instance.run.return_value.model_dump.return_value = {
|
||||
"agents": [
|
||||
{
|
||||
"name": "TestAgent",
|
||||
"description": "A test agent",
|
||||
"system_prompt": "You are a test agent"
|
||||
}
|
||||
]
|
||||
}
|
||||
|
||||
# Mock build_agent method
|
||||
with patch.object(swarm, 'build_agent') as mock_build_agent:
|
||||
mock_agent = MagicMock()
|
||||
mock_build_agent.return_value = mock_agent
|
||||
|
||||
agents = swarm.create_agents_with_feedback("test task")
|
||||
|
||||
assert len(agents) == 1
|
||||
mock_build_agent.assert_called_once()
|
||||
|
||||
def test_run_single_iteration_mode(self):
|
||||
"""Test running in single iteration mode (evaluation disabled)"""
|
||||
swarm = AutoSwarmBuilder(enable_evaluation=False)
|
||||
|
||||
with patch.object(swarm, 'create_agents') as mock_create:
|
||||
with patch.object(swarm, 'initialize_swarm_router') as mock_router:
|
||||
mock_create.return_value = []
|
||||
mock_router.return_value = "test result"
|
||||
|
||||
result = swarm.run("test task")
|
||||
|
||||
assert result == "test result"
|
||||
mock_create.assert_called_once_with("test task")
|
||||
mock_router.assert_called_once()
|
||||
|
||||
|
||||
class TestEvaluationIntegration:
|
||||
"""Integration tests for the evaluation system"""
|
||||
|
||||
@patch('swarms.structs.auto_swarm_builder.CouncilAsAJudge')
|
||||
@patch('swarms.structs.auto_swarm_builder.Agent')
|
||||
@patch('swarms.structs.auto_swarm_builder.OpenAIFunctionCaller')
|
||||
def test_evaluation_workflow(self, mock_function_caller, mock_agent, mock_council):
|
||||
"""Test the complete evaluation workflow"""
|
||||
# Setup mocks
|
||||
mock_council_instance = MagicMock()
|
||||
mock_council.return_value = mock_council_instance
|
||||
mock_council_instance.run.return_value = "Evaluation feedback"
|
||||
|
||||
mock_agent_instance = MagicMock()
|
||||
mock_agent.return_value = mock_agent_instance
|
||||
mock_agent_instance.run.return_value = "Improvement suggestions"
|
||||
|
||||
mock_function_caller_instance = MagicMock()
|
||||
mock_function_caller.return_value = mock_function_caller_instance
|
||||
mock_function_caller_instance.run.return_value.model_dump.return_value = {
|
||||
"agents": [
|
||||
{
|
||||
"name": "TestAgent",
|
||||
"description": "Test",
|
||||
"system_prompt": "Test prompt"
|
||||
}
|
||||
]
|
||||
}
|
||||
|
||||
# Configure swarm
|
||||
config = IterativeImprovementConfig(max_iterations=1)
|
||||
swarm = AutoSwarmBuilder(
|
||||
name="TestSwarm",
|
||||
enable_evaluation=True,
|
||||
evaluation_config=config,
|
||||
)
|
||||
|
||||
# Mock additional methods
|
||||
with patch.object(swarm, 'build_agent') as mock_build:
|
||||
with patch.object(swarm, 'initialize_swarm_router') as mock_router:
|
||||
mock_build.return_value = mock_agent_instance
|
||||
mock_router.return_value = "Task output"
|
||||
|
||||
# Run the swarm
|
||||
result = swarm.run("test task")
|
||||
|
||||
# Verify evaluation was performed
|
||||
assert len(swarm.evaluation_history) == 1
|
||||
assert result == "Task output"
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
pytest.main([__file__])
|
Loading…
Reference in new issue