From ee43b2ebb6189c27573c88decaf8d1e2fa35a6c1 Mon Sep 17 00:00:00 2001 From: Kye Gomez Date: Fri, 12 Jul 2024 10:58:27 -0700 Subject: [PATCH] [DOCS][FAISS] --- docs/swarms_memory/faiss.md | 232 ++++++++++++++++++++++++++++++++++++ 1 file changed, 232 insertions(+) create mode 100644 docs/swarms_memory/faiss.md diff --git a/docs/swarms_memory/faiss.md b/docs/swarms_memory/faiss.md new file mode 100644 index 00000000..d4c143f5 --- /dev/null +++ b/docs/swarms_memory/faiss.md @@ -0,0 +1,232 @@ +# FAISSDB: Documentation + +The `FAISSDB` class is a highly customizable wrapper for the FAISS (Facebook AI Similarity Search) library, designed for efficient similarity search and clustering of dense vectors. This class facilitates the creation of a Retrieval-Augmented Generation (RAG) system by providing methods to add documents to a FAISS index and query the index for similar documents. It supports custom embedding models, preprocessing functions, and other customizations to fit various use cases. + + +### Parameters + +| Parameter | Type | Default | Description | +|------------------------|--------------------------------------------------|-------------------------------|-----------------------------------------------------------------------------| +| `dimension` | `int` | `768` | Dimension of the document embeddings. | +| `index_type` | `str` | `'Flat'` | Type of FAISS index to use (`'Flat'` or `'IVF'`). | +| `embedding_model` | `Optional[Any]` | `None` | Custom embedding model. | +| `embedding_function` | `Optional[Callable[[str], List[float]]]` | `None` | Custom function to generate embeddings from text. | +| `preprocess_function` | `Optional[Callable[[str], str]]` | `None` | Custom function to preprocess text before embedding. | +| `postprocess_function` | `Optional[Callable[[List[Dict[str, Any]]], List[Dict[str, Any]]]]` | `None` | Custom function to postprocess the results. | +| `metric` | `str` | `'cosine'` | Distance metric for FAISS index (`'cosine'` or `'l2'`). | +| `logger_config` | `Optional[Dict[str, Any]]` | `None` | Configuration for the logger. | + +## Methods + +### `__init__` + +Initializes the FAISSDB instance, setting up the logger, creating the FAISS index, and configuring custom functions if provided. + +### `add` + +Adds a document to the FAISS index. + +#### Parameters + +| Parameter | Type | Default | Description | +|-----------|-------------------------|---------|-------------------------------------------------| +| `doc` | `str` | None | The document to be added. | +| `metadata`| `Optional[Dict[str, Any]]` | None | Additional metadata for the document. | + +#### Example Usage + +```python +db = FAISSDB(dimension=768) +db.add("This is a sample document.", {"category": "sample"}) +``` + +### `query` + +Queries the FAISS index for similar documents. + +#### Parameters + +| Parameter | Type | Default | Description | +|-----------|------|---------|-------------| +| `query` | `str` | None | The query string. | +| `top_k` | `int` | `5` | The number of top results to return. | + +#### Returns + +| Type | Description | +|------|-------------| +| `List[Dict[str, Any]]` | A list of dictionaries containing the top_k most similar documents. | + +#### Example Usage + +```python +results = db.query("What is artificial intelligence?") +for result in results: + print(f"Score: {result['score']}, Text: {result['metadata']['text']}") +``` + +## Internal Methods + +### `_setup_logger` + +Sets up the logger with the given configuration. + +#### Parameters + +| Parameter | Type | Default | Description | +|-----------|-------------------------|---------|------------------------------------------| +| `config` | `Optional[Dict[str, Any]]` | None | Configuration for the logger. | + +### `_create_index` + +Creates and returns a FAISS index based on the specified type and metric. + +#### Parameters + +| Parameter | Type | Default | Description | +|-----------|-------|---------|----------------------------------------------| +| `index_type` | `str` | 'Flat' | Type of FAISS index to use. | +| `metric` | `str` | 'cosine' | Distance metric for FAISS index. | + +#### Returns + +| Type | Description | +|------|------------------| +| `faiss.Index` | FAISS index instance. | + +### `_default_embedding_function` + +Default embedding function using the SentenceTransformer model. + +#### Parameters + +| Parameter | Type | Default | Description | +|-----------|------|---------|----------------------| +| `text` | `str` | None | The input text to embed. | + +#### Returns + +| Type | Description | +|------|-------------------| +| `List[float]` | Embedding vector for the input text. | + +### `_default_preprocess_function` + +Default preprocessing function. + +#### Parameters + +| Parameter | Type | Default | Description | +|-----------|------|---------|--------------------| +| `text` | `str` | None | The input text to preprocess. | + +#### Returns + +| Type | Description | +|------|------------------| +| `str` | Preprocessed text. | + +### `_default_postprocess_function` + +Default postprocessing function. + +#### Parameters + +| Parameter | Type | Default | Description | +|-----------|------|---------|--------------------------------| +| `results` | `List[Dict[str, Any]]` | None | The results to postprocess. | + +#### Returns + +| Type | Description | +|------|--------------------------| +| `List[Dict[str, Any]]` | Postprocessed results. | + +## Usage Examples + +### Example 1: Basic Usage + +```python +# Initialize the FAISSDB instance +db = FAISSDB(dimension=768, index_type="Flat") + +# Add documents to the FAISS index +db.add("This is a document about AI.", {"category": "AI"}) +db.add("Python is great for data science.", {"category": "Programming"}) + +# Query the FAISS index +results = db.query("Tell me about AI") +for result in results: + print(f"Score: {result['score']}, Text: {result['metadata']['text']}") +``` + +### Example 2: Custom Functions + +```python +from transformers import AutoTokenizer, AutoModel +import torch + +# Custom embedding function using a HuggingFace model +def custom_embedding_function(text: str) -> List[float]: + tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased") + model = AutoModel.from_pretrained("bert-base-uncased") + inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=512) + with torch.no_grad(): + outputs = model(**inputs) + embeddings = outputs.last_hidden_state.mean(dim=1).squeeze().tolist() + return embeddings + +# Custom preprocessing function +def custom_preprocess(text: str) -> str: + return text.lower().strip() + +# Custom postprocessing function +def custom_postprocess(results: List[Dict[str, Any]]) -> List[Dict[str, Any]]: + for result in results: + result["custom_score"] = result["score"] * 2 # Example modification + return results + +# Initialize the FAISSDB instance with custom functions +db = FAISSDB( + dimension=768, + index_type="Flat", + embedding_function=custom_embedding_function, + preprocess_function=custom_preprocess, + postprocess_function=custom_postprocess, + metric="cosine", + logger_config={ + "handlers": [ + {"sink": "custom_faiss_rag_wrapper.log", "rotation": "1 GB"}, + {"sink": lambda msg: print(f"Custom log: {msg}", end="")} + ], + }, +) + +# Add documents to the FAISS index +db.add("This is a document about machine learning.", {"category": "ML"}) +db.add("Python is a versatile programming language.", {"category": "Programming"}) + +# Query the FAISS index +results = db.query("Explain machine learning") +for result in results: + print(f"Score: {result['score']}, Custom Score: {result['custom_score']}, Text: {result['metadata']['text']}") +``` + +## Additional Information and Tips + +- Ensure that the dimension of the document embeddings matches the dimension specified during the initialization of the FAISSDB instance. +- Use custom embedding functions to leverage domain-specific models for generating embeddings. +- Custom preprocessing and postprocessing functions can help tailor the text processing and + + result formatting to specific needs. +- FAISS supports various types of indices; choose the one that best fits the application requirements (e.g., `Flat` for brute-force search, `IVF` for faster search with some accuracy trade-off). +- Properly configure the logger to monitor and debug the operations of the FAISSDB instance. + +## References and Resources + +- [FAISS GitHub Repository](https://github.com/facebookresearch/faiss) +- [Sentence Transformers Documentation](https://www.sbert.net/) +- [Loguru Documentation](https://loguru.readthedocs.io/en/stable/) +- [HuggingFace Transformers](https://huggingface.co/transformers/) + +By following this documentation, users can effectively utilize the `FAISSDB` class for various similarity search and document retrieval tasks, customizing it to their specific needs through the provided hooks and functions. \ No newline at end of file