swarms/docs/rag-vector-databases/weaviate-local.md

# Weaviate Local RAG Integration with Swarms

## Overview

Weaviate Local is a self-hosted version of the Weaviate vector database that runs on your own infrastructure. It provides the same powerful GraphQL API, multi-modal capabilities, and AI integrations as Weaviate Cloud, but with full control over data, deployment, and customization. Weaviate Local is ideal for organizations requiring data sovereignty, custom configurations, or air-gapped deployments while maintaining enterprise-grade vector search capabilities.

## Key Features

- **Self-Hosted Control**: Full ownership of data and infrastructure
- **GraphQL API**: Flexible query language for complex data operations
- **Multi-Modal Support**: Built-in support for text, images, and other data types
- **Custom Modules**: Extensible architecture with custom vectorization modules
- **Docker Deployment**: Easy containerized deployment and scaling
- **Schema Flexibility**: Dynamic schema with automatic type inference
- **Hybrid Search**: Combine vector similarity with keyword search
- **Real-time Updates**: Live data updates without service interruption

## Architecture

Weaviate Local integrates with Swarms agents as a self-hosted, customizable vector database:

```
[Agent] -> [Weaviate Local Memory] -> [Local GraphQL + Vector Engine] -> [Custom Results] -> [Retrieved Context]
```

The system provides full control over the deployment environment while maintaining Weaviate's advanced search capabilities.

## Setup & Configuration

### Installation

```bash
# Docker installation (recommended)
docker pull semitechnologies/weaviate:latest

# Python client
pip install weaviate-client
pip install swarms
pip install litellm
```

### Docker Deployment

```yaml
# docker-compose.yml
version: '3.4'
services:
  weaviate:
    command:
    - --host
    - 0.0.0.0
    - --port
    - '8080'
    - --scheme
    - http
    image: semitechnologies/weaviate:1.22.4
    ports:
    - "8080:8080"
    restart: on-failure:0
    environment:
      QUERY_DEFAULTS_LIMIT: 25
      AUTHENTICATION_ANONYMOUS_ACCESS_ENABLED: 'true'
      PERSISTENCE_DATA_PATH: '/var/lib/weaviate'
      DEFAULT_VECTORIZER_MODULE: 'none'
      ENABLE_MODULES: 'text2vec-openai,text2vec-cohere,text2vec-huggingface'
      CLUSTER_HOSTNAME: 'node1'
    volumes:
    - weaviate_data:/var/lib/weaviate
volumes:
  weaviate_data:
```

### Environment Variables

```bash
# Local Weaviate connection
export WEAVIATE_URL="http://localhost:8080"

# Optional: Authentication (if enabled)
export WEAVIATE_USERNAME="admin"
export WEAVIATE_PASSWORD="password"

# API keys for built-in modules
export OPENAI_API_KEY="your-openai-key"
export COHERE_API_KEY="your-cohere-key"
export HUGGINGFACE_API_KEY="your-hf-key"
```

## Code Example

```python
"""
Weaviate Local RAG Integration with Swarms Agent

This example demonstrates how to integrate self-hosted Weaviate as a customizable
vector database for RAG operations with full local control.
"""

import weaviate
from typing import List, Dict, Any, Optional
from swarms import Agent
from litellm import embedding
import uuid
from datetime import datetime

class WeaviateLocalMemory:
    """Weaviate Local-based memory system for RAG operations"""

    def __init__(self,
                 url: str = "http://localhost:8080",
                 class_name: str = "LocalDocument",
                 embedding_model: str = "text-embedding-3-small",
                 use_builtin_vectorization: bool = False,
                 auth_config: Optional[Dict] = None):
        """
        Initialize Weaviate Local memory system

        Args:
            url: Weaviate server URL
            class_name: Name of the Weaviate class
            embedding_model: LiteLLM embedding model name
            use_builtin_vectorization: Use Weaviate's built-in vectorization
            auth_config: Authentication configuration
        """
        self.url = url
        self.class_name = class_name
        self.embedding_model = embedding_model
        self.use_builtin_vectorization = use_builtin_vectorization

        # Initialize client
        self.client = self._create_client(auth_config)

        # Create schema
        self._create_schema()

    def _create_client(self, auth_config: Optional[Dict] = None):
        """Create Weaviate local client"""
        client_config = {"url": self.url}

        if auth_config:
            if auth_config.get("type") == "api_key":
                client_config["auth_client_secret"] = weaviate.AuthApiKey(
                    api_key=auth_config["api_key"]
                )
            elif auth_config.get("type") == "username_password":
                client_config["auth_client_secret"] = weaviate.AuthClientPassword(
                    username=auth_config["username"],
                    password=auth_config["password"]
                )

        # Add API keys for modules
        additional_headers = {}
        if "OPENAI_API_KEY" in os.environ:
            additional_headers["X-OpenAI-Api-Key"] = os.environ["OPENAI_API_KEY"]

        if additional_headers:
            client_config["additional_headers"] = additional_headers

        client = weaviate.Client(**client_config)

        # Test connection
        try:
            client.schema.get()
            print(f"Connected to Weaviate Local: {self.url}")
        except Exception as e:
            raise ConnectionError(f"Failed to connect to Weaviate: {e}")

        return client

    def _create_schema(self):
        """Create Weaviate class schema"""
        schema = self.client.schema.get()
        existing_classes = [c["class"] for c in schema.get("classes", [])]

        if self.class_name in existing_classes:
            print(f"Class '{self.class_name}' already exists")
            return

        # Define comprehensive schema
        class_obj = {
            "class": self.class_name,
            "description": "Local document class for Swarms RAG operations",
            "vectorizer": "none" if not self.use_builtin_vectorization else "text2vec-openai",
            "properties": [
                {
                    "name": "text",
                    "dataType": ["text"],
                    "description": "Document content",
                    "indexFilterable": True,
                    "indexSearchable": True,
                    "tokenization": "word"
                },
                {
                    "name": "title",
                    "dataType": ["string"],
                    "description": "Document title",
                    "indexFilterable": True
                },
                {
                    "name": "category",
                    "dataType": ["string"],
                    "description": "Document category",
                    "indexFilterable": True
                },
                {
                    "name": "tags",
                    "dataType": ["string[]"],
                    "description": "Document tags",
                    "indexFilterable": True
                },
                {
                    "name": "author",
                    "dataType": ["string"],
                    "description": "Document author",
                    "indexFilterable": True
                },
                {
                    "name": "created_at",
                    "dataType": ["date"],
                    "description": "Creation date"
                },
                {
                    "name": "metadata",
                    "dataType": ["object"],
                    "description": "Additional metadata"
                }
            ]
        }

        self.client.schema.create_class(class_obj)
        print(f"Created local class '{self.class_name}'")

    def add_documents(self, documents: List[Dict]) -> List[str]:
        """Add documents with rich metadata to Weaviate Local"""
        doc_ids = []

        with self.client.batch as batch:
            batch.batch_size = 100

            for doc_data in documents:
                doc_id = str(uuid.uuid4())
                doc_ids.append(doc_id)

                # Prepare properties
                properties = {
                    "text": doc_data.get("text", ""),
                    "title": doc_data.get("title", ""),
                    "category": doc_data.get("category", ""),
                    "tags": doc_data.get("tags", []),
                    "author": doc_data.get("author", ""),
                    "created_at": doc_data.get("created_at", datetime.now().isoformat()),
                    "metadata": doc_data.get("metadata", {})
                }

                batch_obj = {
                    "class": self.class_name,
                    "id": doc_id,
                    "properties": properties
                }

                # Add vector if using external embeddings
                if not self.use_builtin_vectorization:
                    text_content = doc_data.get("text", "")
                    if text_content:
                        embedding_vec = self._get_embeddings([text_content])[0]
                        batch_obj["vector"] = embedding_vec

                batch.add_data_object(**batch_obj)

        print(f"Added {len(documents)} documents to local Weaviate")
        return doc_ids

    def _get_embeddings(self, texts: List[str]) -> List[List[float]]:
        """Generate embeddings using LiteLLM"""
        response = embedding(model=self.embedding_model, input=texts)
        return [item["embedding"] for item in response["data"]]

    def search(self, query: str, limit: int = 3, **kwargs) -> Dict[str, Any]:
        """Search documents with flexible filtering"""
        # Build query
        query_builder = (
            self.client.query
            .get(self.class_name, ["text", "title", "category", "tags", "author", "metadata"])
        )

        # Add vector search
        if self.use_builtin_vectorization:
            query_builder = query_builder.with_near_text({"concepts": [query]})
        else:
            query_embedding = self._get_embeddings([query])[0]
            query_builder = query_builder.with_near_vector({"vector": query_embedding})

        # Add optional filters
        if "where_filter" in kwargs:
            query_builder = query_builder.with_where(kwargs["where_filter"])

        # Execute query
        result = (
            query_builder
            .with_limit(limit)
            .with_additional(["certainty", "distance", "id"])
            .do()
        )

        # Format results
        formatted_results = {"documents": [], "metadata": [], "scores": [], "ids": []}

        if "data" in result and "Get" in result["data"]:
            for item in result["data"]["Get"].get(self.class_name, []):
                formatted_results["documents"].append(item.get("text", ""))

                # Combine all metadata
                metadata = {
                    "title": item.get("title", ""),
                    "category": item.get("category", ""),
                    "tags": item.get("tags", []),
                    "author": item.get("author", ""),
                    **item.get("metadata", {})
                }
                formatted_results["metadata"].append(metadata)
                formatted_results["ids"].append(item["_additional"]["id"])

                score = item["_additional"].get("certainty", 0.0)
                formatted_results["scores"].append(float(score))

        return formatted_results

# Sample usage
memory = WeaviateLocalMemory(
    url="http://localhost:8080",
    class_name="SwarmsLocalKB",
    embedding_model="text-embedding-3-small"
)

# Add rich documents
documents = [
    {
        "text": "Weaviate Local provides full control over vector database deployment and data sovereignty.",
        "title": "Local Deployment Benefits",
        "category": "deployment",
        "tags": ["weaviate", "local", "control"],
        "author": "System",
        "metadata": {"difficulty": "intermediate", "topic": "infrastructure"}
    },
    {
        "text": "Self-hosted Weaviate enables custom configurations and air-gapped deployments for sensitive data.",
        "title": "Security and Compliance",
        "category": "security",
        "tags": ["security", "compliance", "air-gap"],
        "author": "Admin",
        "metadata": {"difficulty": "advanced", "topic": "security"}
    }
]

# Create agent and add documents
memory.add_documents(documents)

agent = Agent(
    agent_name="Local-Weaviate-Agent",
    agent_description="Agent with self-hosted Weaviate for private RAG operations",
    model_name="gpt-4o",
    max_loops=1,
    dynamic_temperature_enabled=True,
)

def query_local_rag(query: str, limit: int = 3) -> str:
    """Query local Weaviate with RAG"""
    results = memory.search(query, limit=limit)

    if not results["documents"]:
        return agent.run(query)

    context = "\n".join(results["documents"])

    enhanced_prompt = f"""
Based on this local knowledge base context:

{context}

Question: {query}

Provide a comprehensive answer using the context.
"""

    return agent.run(enhanced_prompt)

# Example usage
response = query_local_rag("What are the benefits of local Weaviate deployment?")
print(response)
```

## Use Cases

### 1. **Data Sovereignty & Compliance**
- Government and healthcare organizations
- GDPR/HIPAA compliance requirements
- Sensitive data processing

### 2. **Air-Gapped Environments**
- Military and defense applications
- High-security research facilities
- Offline AI systems

### 3. **Custom Infrastructure**
- Specific hardware requirements
- Custom networking configurations
- Specialized security measures

### 4. **Development & Testing**
- Local development environments
- CI/CD integration
- Performance testing

## Deployment Options

### Docker Compose
```yaml
version: '3.4'
services:
  weaviate:
    image: semitechnologies/weaviate:1.22.4
    restart: on-failure:0
    ports:
      - "8080:8080"
    environment:
      AUTHENTICATION_ANONYMOUS_ACCESS_ENABLED: 'true'
      PERSISTENCE_DATA_PATH: '/var/lib/weaviate'
      DEFAULT_VECTORIZER_MODULE: 'none'
      ENABLE_MODULES: 'text2vec-openai,backup-filesystem'
    volumes:
      - ./weaviate_data:/var/lib/weaviate
```

### Kubernetes
```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: weaviate
spec:
  replicas: 1
  selector:
    matchLabels:
      app: weaviate
  template:
    metadata:
      labels:
        app: weaviate
    spec:
      containers:
      - name: weaviate
        image: semitechnologies/weaviate:1.22.4
        ports:
        - containerPort: 8080
        env:
        - name: PERSISTENCE_DATA_PATH
          value: '/var/lib/weaviate'
        volumeMounts:
        - name: weaviate-storage
          mountPath: /var/lib/weaviate
      volumes:
      - name: weaviate-storage
        persistentVolumeClaim:
          claimName: weaviate-pvc
```

## Best Practices

1. **Resource Planning**: Allocate sufficient memory and storage for your dataset
2. **Backup Strategy**: Implement regular backups using Weaviate's backup modules
3. **Monitoring**: Set up health checks and performance monitoring
4. **Security**: Configure authentication and network security appropriately
5. **Scaling**: Plan for horizontal scaling with clustering if needed
6. **Updates**: Establish update procedures for Weaviate versions
7. **Data Migration**: Plan migration strategies for schema changes

This guide covers the essentials of deploying and integrating Weaviate Local with Swarms agents for private, self-controlled RAG applications.