You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
231 lines
6.6 KiB
231 lines
6.6 KiB
# Qdrant RAG Example with Document Ingestion
|
|
|
|
This example demonstrates how to use the agent structure from `example.py` with Qdrant RAG to ingest a vast array of PDF documents and text files for advanced quantitative trading analysis.
|
|
|
|
## 🚀 Features
|
|
|
|
- **Document Ingestion**: Process PDF, TXT, and Markdown files automatically
|
|
- **Qdrant Vector Database**: High-performance vector storage with similarity search
|
|
- **Sentence Transformer Embeddings**: Local embedding generation using state-of-the-art models
|
|
- **Intelligent Chunking**: Smart text chunking with overlap for better retrieval
|
|
- **Concurrent Processing**: Multi-threaded document processing for large collections
|
|
- **RAG Integration**: Seamless integration with Swarms Agent framework
|
|
- **Financial Analysis**: Specialized for quantitative trading and financial research
|
|
|
|
## 📋 Prerequisites
|
|
|
|
- Python 3.10+
|
|
- Qdrant client (local or cloud)
|
|
- Sentence transformers for embeddings
|
|
- Swarms framework
|
|
|
|
## 🛠️ Installation
|
|
|
|
1. **Install dependencies**:
|
|
```bash
|
|
pip install -r requirements.txt
|
|
```
|
|
|
|
2. **Set up environment variables** (optional, for cloud deployment):
|
|
```bash
|
|
export QDRANT_URL="your_qdrant_url"
|
|
export QDRANT_API_KEY="your_qdrant_api_key"
|
|
```
|
|
|
|
## 🏗️ Architecture
|
|
|
|
The example consists of three main components:
|
|
|
|
### 1. DocumentProcessor
|
|
- Handles file discovery and text extraction
|
|
- Supports PDF, TXT, and Markdown formats
|
|
- Concurrent processing for large document collections
|
|
- Error handling and validation
|
|
|
|
### 2. QdrantRAGMemory
|
|
- Vector database management with Qdrant
|
|
- Intelligent text chunking with overlap
|
|
- Semantic search capabilities
|
|
- Metadata storage and retrieval
|
|
|
|
### 3. QuantitativeTradingRAGAgent
|
|
- Combines Swarms Agent with RAG capabilities
|
|
- Financial analysis specialization
|
|
- Document context enhancement
|
|
- Query processing and response generation
|
|
|
|
## 📖 Usage
|
|
|
|
### Basic Setup
|
|
|
|
```python
|
|
from qdrant_rag_example import QuantitativeTradingRAGAgent
|
|
|
|
# Initialize the agent
|
|
agent = QuantitativeTradingRAGAgent(
|
|
agent_name="Financial-Analysis-Agent",
|
|
collection_name="financial_documents",
|
|
model_name="claude-sonnet-4-20250514"
|
|
)
|
|
```
|
|
|
|
### Document Ingestion
|
|
|
|
```python
|
|
# Ingest documents from a directory
|
|
documents_path = "./financial_documents"
|
|
num_ingested = agent.ingest_documents(documents_path)
|
|
print(f"Ingested {num_ingested} documents")
|
|
```
|
|
|
|
### Querying Documents
|
|
|
|
```python
|
|
# Search for relevant information
|
|
results = agent.query_documents("gold ETFs investment strategies", limit=5)
|
|
for result in results:
|
|
print(f"Document: {result['document_name']}")
|
|
print(f"Relevance: {result['similarity_score']:.3f}")
|
|
print(f"Content: {result['chunk_text'][:200]}...")
|
|
```
|
|
|
|
### Running Analysis
|
|
|
|
```python
|
|
# Run financial analysis with RAG context
|
|
task = "What are the best top 3 ETFs for gold coverage?"
|
|
response = agent.run_analysis(task)
|
|
print(response)
|
|
```
|
|
|
|
## 📁 Directory Structure
|
|
|
|
```
|
|
financial_documents/
|
|
├── research_papers/
|
|
│ ├── gold_etf_analysis.pdf
|
|
│ ├── market_research.pdf
|
|
│ └── portfolio_strategies.pdf
|
|
├── company_reports/
|
|
│ ├── annual_reports.txt
|
|
│ └── quarterly_updates.md
|
|
└── market_data/
|
|
├── historical_prices.csv
|
|
└── volatility_analysis.txt
|
|
```
|
|
|
|
## ⚙️ Configuration Options
|
|
|
|
### Agent Configuration
|
|
- `agent_name`: Name of the agent
|
|
- `collection_name`: Qdrant collection name
|
|
- `model_name`: LLM model to use
|
|
- `max_loops`: Maximum agent execution loops
|
|
- `chunk_size`: Text chunk size (default: 1000)
|
|
- `chunk_overlap`: Overlap between chunks (default: 200)
|
|
|
|
### Document Processing
|
|
- `supported_extensions`: File types to process
|
|
- `max_workers`: Concurrent processing threads
|
|
- `score_threshold`: Similarity search threshold
|
|
|
|
## 🔍 Advanced Features
|
|
|
|
### Custom Embedding Models
|
|
```python
|
|
# Use different sentence transformer models
|
|
from sentence_transformers import SentenceTransformer
|
|
|
|
custom_model = SentenceTransformer("all-mpnet-base-v2")
|
|
# Update the embedding model in QdrantRAGMemory
|
|
```
|
|
|
|
### Cloud Deployment
|
|
```python
|
|
# Connect to Qdrant cloud
|
|
agent = QuantitativeTradingRAGAgent(
|
|
qdrant_url="https://your-instance.qdrant.io",
|
|
qdrant_api_key="your_api_key"
|
|
)
|
|
```
|
|
|
|
### Batch Processing
|
|
```python
|
|
# Process multiple directories
|
|
directories = ["./docs1", "./docs2", "./docs3"]
|
|
for directory in directories:
|
|
agent.ingest_documents(directory)
|
|
```
|
|
|
|
## 📊 Performance Considerations
|
|
|
|
- **Chunk Size**: Larger chunks (1000-2000 chars) for detailed analysis, smaller (500-1000) for precise retrieval
|
|
- **Overlap**: 10-20% overlap between chunks for better context continuity
|
|
- **Concurrency**: Adjust `max_workers` based on your system capabilities
|
|
- **Vector Size**: 768 dimensions for sentence-transformers, 1536 for OpenAI embeddings
|
|
|
|
## 🚨 Error Handling
|
|
|
|
The system includes comprehensive error handling for:
|
|
- File not found errors
|
|
- Unsupported file types
|
|
- Processing failures
|
|
- Network connectivity issues
|
|
- Invalid document content
|
|
|
|
## 🔧 Troubleshooting
|
|
|
|
### Common Issues
|
|
|
|
1. **Import Errors**: Ensure all dependencies are installed
|
|
```bash
|
|
pip install -r requirements.txt
|
|
```
|
|
|
|
2. **Memory Issues**: Reduce chunk size or use cloud Qdrant
|
|
```python
|
|
agent = QuantitativeTradingRAGAgent(chunk_size=500)
|
|
```
|
|
|
|
3. **Processing Failures**: Check file permissions and formats
|
|
```python
|
|
# Verify supported formats
|
|
processor = DocumentProcessor(supported_extensions=['.pdf', '.txt'])
|
|
```
|
|
|
|
### Performance Optimization
|
|
|
|
- Use SSD storage for document processing
|
|
- Increase `max_workers` for multi-core systems
|
|
- Consider cloud Qdrant for large document collections
|
|
- Implement document caching for frequently accessed files
|
|
|
|
## 📈 Use Cases
|
|
|
|
- **Financial Research**: Analyze market reports, earnings calls, and research papers
|
|
- **Legal Document Review**: Process contracts, regulations, and case law
|
|
- **Academic Research**: Index research papers and academic literature
|
|
- **Compliance Monitoring**: Track regulatory changes and compliance requirements
|
|
- **Risk Assessment**: Analyze risk reports and market analysis
|
|
|
|
## 🤝 Contributing
|
|
|
|
To extend this example:
|
|
1. Add support for additional file formats
|
|
2. Implement custom embedding strategies
|
|
3. Add document versioning and change tracking
|
|
4. Integrate with other vector databases
|
|
5. Add document summarization capabilities
|
|
|
|
## 📄 License
|
|
|
|
This example is part of the Swarms framework and follows the same licensing terms.
|
|
|
|
## 🆘 Support
|
|
|
|
For issues and questions:
|
|
- Check the Swarms documentation
|
|
- Review the example code and error messages
|
|
- Ensure all dependencies are properly installed
|
|
- Verify Qdrant connection and configuration
|