agent as a judge docs

7 months ago · babdf4f57b
parent 1b85f60ca0
commit babdf4f57b
1 changed files with 158 additions and 97 deletions
--- a/docs/swarms/agents/agent_judge.md
+++ b/docs/swarms/agents/agent_judge.md
@ -1,9 +1,24 @@
 # Agent Judge

-The AgentJudge is a specialized agent designed to evaluate and judge outputs from other agents or systems. It acts as a quality control mechanism, providing objective assessments and feedback on various types of content, decisions, or outputs.
+The AgentJudge is a specialized agent designed to evaluate and judge outputs from other agents or systems. It acts as a quality control mechanism, providing objective assessments and feedback on various types of content, decisions, or outputs. This implementation is based on the research paper "Agents as Judges: Using LLMs to Evaluate LLMs".

+## Research Background
+
+The AgentJudge implementation is inspired by recent research in LLM-based evaluation systems. Key findings from the research include:
+
+- LLMs can effectively evaluate other LLM outputs with high accuracy
+
+- Multi-agent evaluation systems can provide more reliable assessments
+
+- Structured evaluation criteria improve consistency
+
+- Context-aware evaluation leads to better results
+
+## Overview

 The AgentJudge serves as an impartial evaluator that can:
+
+
 - Assess the quality and correctness of agent outputs

 - Provide structured feedback and scoring
@ -12,28 +27,31 @@ The AgentJudge serves as an impartial evaluator that can:

 - Generate detailed analysis reports

+
 ## Architecture

 ```mermaid
 graph TD
-    A[Input Tasks] --> B[AgentJudge]
-    B --> C[Agent Core]
-    C --> D[LLM Model]
-    D --> E[Response Generation]
-    E --> F[Context Management]
-    F --> G[Output]
-    
-    subgraph "Evaluation Flow"
-    H[Task Analysis] --> I[Quality Assessment]
-    I --> J[Feedback Generation]
-    J --> K[Score Assignment]
-    end
-    
-    B --> H
-    K --> G
+   A[Input Tasks] --> B[AgentJudge]
+   B --> C[Agent Core]
+   C --> D[LLM Model]
+   D --> E[Response Generation]
+   E --> F[Context Management]
+   F --> G[Output]
+   
+   subgraph "Evaluation Flow"
+   H[Task Analysis] --> I[Quality Assessment]
+   I --> J[Feedback Generation]
+   J --> K[Score Assignment]
+   end
+   
+   B --> H
+   K --> G
 ```

-## Parameters
+## Configuration
+
+### Parameters

 | Parameter | Type | Default | Description |
 |-----------|------|---------|-------------|
@ -42,121 +60,164 @@ graph TD
 | `model_name` | str | "openai/o1" | LLM model to use for evaluation |
 | `max_loops` | int | 1 | Maximum number of evaluation iterations |

-## Methods
+### Methods

 | Method | Description | Parameters | Returns |
 |--------|-------------|------------|---------|
 | `step()` | Processes a single batch of tasks | `tasks: List[str]` | `str` |
 | `run()` | Executes multiple evaluation iterations | `tasks: List[str]` | `List[str]` |

-## Code Example
+## Usage
+
+### Basic Example

 ```python
 from swarms import AgentJudge

+# Initialize the judge
+judge = AgentJudge(
+    model_name="gpt-4o",
+    max_loops=1
+)

-judge = AgentJudge(model_name="gpt-4o", max_loops=1)
-
-
+# Example outputs to evaluate
 outputs = [
-    "1. Agent CalculusMaster: After careful evaluation, I have computed the integral of the polynomial function. The result is ∫(x^2 + 3x + 2)dx = (1/3)x^3 + (3/2)x^2 + 5, where I applied the power rule for integration and added the constant of integration.",
-    "2. Agent DerivativeDynamo: In my analysis of the function sin(x), I have derived it with respect to x. The derivative is d/dx (sin(x)) = cos(x). However, I must note that the additional term '+ 2' is not applicable in this context as it does not pertain to the derivative of sin(x).",
-    "3. Agent LimitWizard: Upon evaluating the limit as x approaches 0 for the function (sin(x)/x), I conclude that lim (x -> 0) (sin(x)/x) = 1. The additional '+ 3' is incorrect and should be disregarded as it does not relate to the limit calculation.",
-    "4. Agent IntegralGenius: I have computed the integral of the exponential function e^x. The result is ∫(e^x)dx = e^x + C, where C is the constant of integration. The extra '+ 1' is unnecessary and does not belong in the final expression.",
-    "5. Agent FunctionFreak: Analyzing the cubic function f(x) = x^3 - 3x + 2, I determined that it has a maximum at x = 1. However, the additional '+ 2' is misleading and should not be included in the maximum value statement.",
+   "1. Agent CalculusMaster: After careful evaluation, I have computed the integral of the polynomial function. The result is ∫(x^2 + 3x + 2)dx = (1/3)x^3 + (3/2)x^2 + 5, where I applied the power rule for integration and added the constant of integration.",
+   "2. Agent DerivativeDynamo: In my analysis of the function sin(x), I have derived it with respect to x. The derivative is d/dx (sin(x)) = cos(x). However, I must note that the additional term '+ 2' is not applicable in this context as it does not pertain to the derivative of sin(x).",
+   "3. Agent LimitWizard: Upon evaluating the limit as x approaches 0 for the function (sin(x)/x), I conclude that lim (x -> 0) (sin(x)/x) = 1. The additional '+ 3' is incorrect and should be disregarded as it does not relate to the limit calculation.",
 ]

-print(judge.run(outputs))
-
+# Run evaluation
+results = judge.run(outputs)
+print(results)
 ```

-## Enterprise Applications
+## Applications

-1. **Code Review Automation**
-   - Evaluate code quality
+### Code Review Automation

-   - Check for best practices
-   
-   - Assess documentation completeness
+!!! success "Features"
+    - Evaluate code quality
+    - Check for best practices
+    - Assess documentation completeness

-2. **Content Quality Control**
-   
-   - Review marketing copy
-   
-   - Validate technical documentation
-   
-   - Assess user support responses
+### Content Quality Control

-3. **Decision Validation**
-   - Evaluate business decisions
-   
-   - Assess risk assessments
-   
-   - Review compliance reports
+!!! info "Use Cases"
+    - Review marketing copy
+    - Validate technical documentation
+    - Assess user support responses

-4. **Performance Assessment**
-   
-   - Evaluate agent performance
-   
-   - Assess system outputs
-   
-   - Review automated processes
+### Decision Validation
+
+!!! warning "Applications"
+    - Evaluate business decisions
+    - Assess risk assessments
+    - Review compliance reports
+
+### Performance Assessment
+
+!!! tip "Metrics"
+    - Evaluate agent performance
+    - Assess system outputs
+    - Review automated processes

 ## Best Practices

-1. **Task Formulation**
-   - Provide clear, specific evaluation criteria
-   
-   - Include context when necessary
+### Task Formulation

-   - Structure tasks for consistent evaluation
+1. Provide clear, specific evaluation criteria
+2. Include context when necessary
+3. Structure tasks for consistent evaluation

-2. **System Configuration**
-   
-   - Use appropriate model for task complexity
-   
-   - Adjust max_loops based on evaluation depth needed
-   
-   - Customize system prompt for specific use cases
+### System Configuration

-3. **Output Management**
-   
-   - Store evaluation results systematically
-   
-   - Track evaluation patterns over time
-   
-   - Use results for continuous improvement
+1. Use appropriate model for task complexity
+2. Adjust max_loops based on evaluation depth needed
+3. Customize system prompt for specific use cases

-4. **Integration Tips**
-   - Implement as part of CI/CD pipelines
+### Output Management

-   - Use for automated quality gates
-   
-   - Integrate with monitoring systems
+1. Store evaluation results systematically
+2. Track evaluation patterns over time
+3. Use results for continuous improvement

-## Use Cases
+### Integration Tips

-```mermaid
-graph LR
-    A[AgentJudge] --> B[Code Review]
-    A --> C[Content QA]
-    A --> D[Decision Validation]
-    A --> E[Performance Metrics]
-    
-    B --> F[Quality Gates]
-    C --> G[Compliance]
-    D --> H[Risk Assessment]
-    E --> I[System Optimization]
+1. Implement as part of CI/CD pipelines
+2. Use for automated quality gates
+3. Integrate with monitoring systems
+
+## Implementation Guide
+
+### Step 1: Setup
+
+```python
+from swarms import AgentJudge
+
+# Initialize with custom parameters
+judge = AgentJudge(
+    agent_name="custom-judge",
+    model_name="gpt-4",
+    max_loops=3
+)
 ```

-## Tips for Implementation
+### Step 2: Configure Evaluation Criteria

-1. Start with simple evaluation tasks and gradually increase complexity
+```python
+# Define evaluation criteria
+criteria = {
+    "accuracy": 0.4,
+    "completeness": 0.3,
+    "clarity": 0.3
+}
+
+# Set criteria
+judge.set_evaluation_criteria(criteria)
+```
+
+### Step 3: Run Evaluations
+
+```python
+# Single task evaluation
+result = judge.step(task)
+
+# Batch evaluation
+results = judge.run(tasks)
+```
+
+## Troubleshooting
+
+### Common Issues
+
+??? question "Evaluation Inconsistencies"
+   If you notice inconsistent evaluations:
+   
+   1. Check the evaluation criteria
+   2. Verify the model configuration
+   3. Review the input format
+
+??? question "Performance Issues"
+   For slow evaluations:
+   
+   1. Reduce max_loops
+   2. Optimize batch size
+   3. Consider model selection

-2. Maintain consistent evaluation criteria across similar tasks
+## Additional Resources

-3. Use the context management feature for multi-step evaluations
+## References

-4. Implement proper error handling and logging
+1. "Agent-as-a-Judge: Evaluate Agents with Agents" - [Paper Link](https://arxiv.org/abs/2410.10934)

-5. Regular calibration of evaluation criteria
+```bibtex
+@misc{zhuge2024agentasajudgeevaluateagentsagents,
+   title={Agent-as-a-Judge: Evaluate Agents with Agents}, 
+   author={Mingchen Zhuge and Changsheng Zhao and Dylan Ashley and Wenyi Wang and Dmitrii Khizbullin and Yunyang Xiong and Zechun Liu and Ernie Chang and Raghuraman Krishnamoorthi and Yuandong Tian and Yangyang Shi and Vikas Chandra and Jürgen Schmidhuber},
+   year={2024},
+   eprint={2410.10934},
+   archivePrefix={arXiv},
+   primaryClass={cs.AI},
+   url={https://arxiv.org/abs/2410.10934}, 
+}