You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
swarms/docs/swarms/structs/council_of_judges.md

7.4 KiB

CouncilAsAJudge

The CouncilAsAJudge is a sophisticated evaluation system that employs multiple AI agents to assess model responses across various dimensions. It provides comprehensive, multi-dimensional analysis of AI model outputs through parallel evaluation and aggregation.

Overview

The CouncilAsAJudge implements a council of specialized AI agents that evaluate different aspects of a model's response. Each agent focuses on a specific dimension of evaluation, and their findings are aggregated into a comprehensive report.

graph TD
    A[User Query] --> B[Base Agent]
    B --> C[Model Response]
    C --> D[CouncilAsAJudge]
    
    subgraph "Evaluation Dimensions"
        D --> E1[Accuracy Agent]
        D --> E2[Helpfulness Agent]
        D --> E3[Harmlessness Agent]
        D --> E4[Coherence Agent]
        D --> E5[Conciseness Agent]
        D --> E6[Instruction Adherence Agent]
    end
    
    E1 --> F[Evaluation Aggregation]
    E2 --> F
    E3 --> F
    E4 --> F
    E5 --> F
    E6 --> F
    
    F --> G[Comprehensive Report]
    
    style D fill:#f9f,stroke:#333,stroke-width:2px
    style F fill:#bbf,stroke:#333,stroke-width:2px

Key Features

  • Parallel evaluation across multiple dimensions
  • Caching system for improved performance
  • Dynamic model selection
  • Comprehensive evaluation metrics
  • Thread-safe execution
  • Detailed technical analysis

Installation

pip install swarms

Basic Usage

from swarms import Agent, CouncilAsAJudge

# Create a base agent
base_agent = Agent(
    agent_name="Financial-Analysis-Agent",
    system_prompt="You are a financial expert helping users understand and establish ROTH IRAs.",
    model_name="claude-opus-4-20250514",
    max_loops=1,
)

# Run the base agent
user_query = "How can I establish a ROTH IRA to buy stocks and get a tax break?"
model_output = base_agent.run(user_query)

# Create and run the council
panel = CouncilAsAJudge()
results = panel.run(user_query, model_output)
print(results)

Advanced Usage

Custom Model Configuration

from swarms import CouncilAsAJudge

# Initialize with custom model
council = CouncilAsAJudge(
    model_name="anthropic/claude-3-sonnet-20240229",
    output_type="all",
    cache_size=256,
    max_workers=4,
    random_model_name=False
)

Parallel Processing Configuration

from swarms import CouncilAsAJudge

# Configure parallel processing
council = CouncilAsAJudge(
    max_workers=8,  # Custom number of worker threads
    random_model_name=True  # Enable dynamic model selection
)

Evaluation Dimensions

The council evaluates responses across six key dimensions:

Dimension Evaluation Criteria
Accuracy • Factual correctness
• Source credibility
• Temporal consistency
• Technical accuracy
Helpfulness • Problem-solving efficacy
• Solution feasibility
• Context inclusion
• Proactive addressing of follow-ups
Harmlessness • Safety assessment
• Ethical considerations
• Age-appropriateness
• Content sensitivity
Coherence • Structural integrity
• Logical flow
• Information hierarchy
• Transition effectiveness
Conciseness • Communication efficiency
• Information density
• Redundancy elimination
• Focus maintenance
Instruction Adherence • Requirement coverage
• Constraint compliance
• Format matching
• Scope appropriateness

API Reference

CouncilAsAJudge

class CouncilAsAJudge:
    def __init__(
        self,
        id: str = swarm_id(),
        name: str = "CouncilAsAJudge",
        description: str = "Evaluates the model's response across multiple dimensions",
        model_name: str = "gpt-4o-mini",
        output_type: str = "all",
        cache_size: int = 128,
        max_workers: int = None,
        random_model_name: bool = True,
    )

Parameters

  • id (str): Unique identifier for the council
  • name (str): Display name of the council
  • description (str): Description of the council's purpose
  • model_name (str): Name of the model to use for evaluations
  • output_type (str): Type of output to return
  • cache_size (int): Size of the LRU cache for prompts
  • max_workers (int): Maximum number of worker threads
  • random_model_name (bool): Whether to use random model selection

Methods

run

def run(self, task: str, model_response: str) -> None

Evaluates a model response across all dimensions.

Parameters
  • task (str): Original user prompt
  • model_response (str): Model's response to evaluate
Returns
  • Comprehensive evaluation report

Examples

Financial Analysis Example

from swarms import Agent, CouncilAsAJudge

# Create financial analysis agent
financial_agent = Agent(
    agent_name="Financial-Analysis-Agent",
    system_prompt="You are a financial expert helping users understand and establish ROTH IRAs.",
    model_name="claude-opus-4-20250514",
    max_loops=1,
)

# Run analysis
query = "How can I establish a ROTH IRA to buy stocks and get a tax break?"
response = financial_agent.run(query)

# Evaluate response
council = CouncilAsAJudge()
evaluation = council.run(query, response)
print(evaluation)

Technical Documentation Example

from swarms import Agent, CouncilAsAJudge

# Create documentation agent
doc_agent = Agent(
    agent_name="Documentation-Agent",
    system_prompt="You are a technical documentation expert.",
    model_name="gpt-4",
    max_loops=1,
)

# Generate documentation
query = "Explain how to implement a REST API using FastAPI"
response = doc_agent.run(query)

# Evaluate documentation quality
council = CouncilAsAJudge(
    model_name="anthropic/claude-3-sonnet-20240229",
    output_type="all"
)
evaluation = council.run(query, response)
print(evaluation)

Best Practices

Model Selection

!!! tip "Model Selection Best Practices" - Choose appropriate models for your use case - Consider using random model selection for diverse evaluations - Match model capabilities to evaluation requirements

Performance Optimization

!!! note "Performance Tips" - Adjust cache size based on memory constraints - Configure worker threads based on CPU cores - Monitor memory usage with large responses

Error Handling

!!! warning "Error Handling Guidelines" - Implement proper exception handling - Monitor evaluation failures - Log evaluation results for analysis

Resource Management

!!! info "Resource Management" - Clean up resources after evaluation - Monitor thread pool usage - Implement proper shutdown procedures

Troubleshooting

Memory Issues

!!! danger "Memory Problems" If you encounter memory-related problems:

- Reduce cache size
- Decrease number of worker threads
- Process smaller chunks of text

Performance Problems

!!! warning "Performance Issues" To improve performance:

- Increase cache size
- Adjust worker thread count
- Use more efficient models

Evaluation Failures

!!! danger "Evaluation Issues" When evaluations fail:

- Check model availability
- Verify input format
- Monitor error logs

Contributing

!!! success "Contributing" Contributions are welcome! Please feel free to submit a Pull Request.

License

!!! info "License" This project is licensed under the MIT License - see the LICENSE file for details.