You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
swarms/docs/swarms_memory/faiss.md

232 lines
9.2 KiB

7 months ago
# FAISSDB: Documentation
The `FAISSDB` class is a highly customizable wrapper for the FAISS (Facebook AI Similarity Search) library, designed for efficient similarity search and clustering of dense vectors. This class facilitates the creation of a Retrieval-Augmented Generation (RAG) system by providing methods to add documents to a FAISS index and query the index for similar documents. It supports custom embedding models, preprocessing functions, and other customizations to fit various use cases.
### Parameters
| Parameter | Type | Default | Description |
|------------------------|--------------------------------------------------|-------------------------------|-----------------------------------------------------------------------------|
| `dimension` | `int` | `768` | Dimension of the document embeddings. |
| `index_type` | `str` | `'Flat'` | Type of FAISS index to use (`'Flat'` or `'IVF'`). |
| `embedding_model` | `Optional[Any]` | `None` | Custom embedding model. |
| `embedding_function` | `Optional[Callable[[str], List[float]]]` | `None` | Custom function to generate embeddings from text. |
| `preprocess_function` | `Optional[Callable[[str], str]]` | `None` | Custom function to preprocess text before embedding. |
| `postprocess_function` | `Optional[Callable[[List[Dict[str, Any]]], List[Dict[str, Any]]]]` | `None` | Custom function to postprocess the results. |
| `metric` | `str` | `'cosine'` | Distance metric for FAISS index (`'cosine'` or `'l2'`). |
| `logger_config` | `Optional[Dict[str, Any]]` | `None` | Configuration for the logger. |
## Methods
### `__init__`
Initializes the FAISSDB instance, setting up the logger, creating the FAISS index, and configuring custom functions if provided.
### `add`
Adds a document to the FAISS index.
#### Parameters
| Parameter | Type | Default | Description |
|-----------|-------------------------|---------|-------------------------------------------------|
| `doc` | `str` | None | The document to be added. |
| `metadata`| `Optional[Dict[str, Any]]` | None | Additional metadata for the document. |
#### Example Usage
```python
db = FAISSDB(dimension=768)
db.add("This is a sample document.", {"category": "sample"})
```
### `query`
Queries the FAISS index for similar documents.
#### Parameters
| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `query` | `str` | None | The query string. |
| `top_k` | `int` | `5` | The number of top results to return. |
#### Returns
| Type | Description |
|------|-------------|
| `List[Dict[str, Any]]` | A list of dictionaries containing the top_k most similar documents. |
#### Example Usage
```python
results = db.query("What is artificial intelligence?")
for result in results:
print(f"Score: {result['score']}, Text: {result['metadata']['text']}")
```
## Internal Methods
### `_setup_logger`
Sets up the logger with the given configuration.
#### Parameters
| Parameter | Type | Default | Description |
|-----------|-------------------------|---------|------------------------------------------|
| `config` | `Optional[Dict[str, Any]]` | None | Configuration for the logger. |
### `_create_index`
Creates and returns a FAISS index based on the specified type and metric.
#### Parameters
| Parameter | Type | Default | Description |
|-----------|-------|---------|----------------------------------------------|
| `index_type` | `str` | 'Flat' | Type of FAISS index to use. |
| `metric` | `str` | 'cosine' | Distance metric for FAISS index. |
#### Returns
| Type | Description |
|------|------------------|
| `faiss.Index` | FAISS index instance. |
### `_default_embedding_function`
Default embedding function using the SentenceTransformer model.
#### Parameters
| Parameter | Type | Default | Description |
|-----------|------|---------|----------------------|
| `text` | `str` | None | The input text to embed. |
#### Returns
| Type | Description |
|------|-------------------|
| `List[float]` | Embedding vector for the input text. |
### `_default_preprocess_function`
Default preprocessing function.
#### Parameters
| Parameter | Type | Default | Description |
|-----------|------|---------|--------------------|
| `text` | `str` | None | The input text to preprocess. |
#### Returns
| Type | Description |
|------|------------------|
| `str` | Preprocessed text. |
### `_default_postprocess_function`
Default postprocessing function.
#### Parameters
| Parameter | Type | Default | Description |
|-----------|------|---------|--------------------------------|
| `results` | `List[Dict[str, Any]]` | None | The results to postprocess. |
#### Returns
| Type | Description |
|------|--------------------------|
| `List[Dict[str, Any]]` | Postprocessed results. |
## Usage Examples
### Example 1: Basic Usage
```python
# Initialize the FAISSDB instance
db = FAISSDB(dimension=768, index_type="Flat")
# Add documents to the FAISS index
db.add("This is a document about AI.", {"category": "AI"})
db.add("Python is great for data science.", {"category": "Programming"})
# Query the FAISS index
results = db.query("Tell me about AI")
for result in results:
print(f"Score: {result['score']}, Text: {result['metadata']['text']}")
```
### Example 2: Custom Functions
```python
from transformers import AutoTokenizer, AutoModel
import torch
# Custom embedding function using a HuggingFace model
def custom_embedding_function(text: str) -> List[float]:
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=512)
with torch.no_grad():
outputs = model(**inputs)
embeddings = outputs.last_hidden_state.mean(dim=1).squeeze().tolist()
return embeddings
# Custom preprocessing function
def custom_preprocess(text: str) -> str:
return text.lower().strip()
# Custom postprocessing function
def custom_postprocess(results: List[Dict[str, Any]]) -> List[Dict[str, Any]]:
for result in results:
result["custom_score"] = result["score"] * 2 # Example modification
return results
# Initialize the FAISSDB instance with custom functions
db = FAISSDB(
dimension=768,
index_type="Flat",
embedding_function=custom_embedding_function,
preprocess_function=custom_preprocess,
postprocess_function=custom_postprocess,
metric="cosine",
logger_config={
"handlers": [
{"sink": "custom_faiss_rag_wrapper.log", "rotation": "1 GB"},
{"sink": lambda msg: print(f"Custom log: {msg}", end="")}
],
},
)
# Add documents to the FAISS index
db.add("This is a document about machine learning.", {"category": "ML"})
db.add("Python is a versatile programming language.", {"category": "Programming"})
# Query the FAISS index
results = db.query("Explain machine learning")
for result in results:
print(f"Score: {result['score']}, Custom Score: {result['custom_score']}, Text: {result['metadata']['text']}")
```
## Additional Information and Tips
- Ensure that the dimension of the document embeddings matches the dimension specified during the initialization of the FAISSDB instance.
- Use custom embedding functions to leverage domain-specific models for generating embeddings.
- Custom preprocessing and postprocessing functions can help tailor the text processing and
result formatting to specific needs.
- FAISS supports various types of indices; choose the one that best fits the application requirements (e.g., `Flat` for brute-force search, `IVF` for faster search with some accuracy trade-off).
- Properly configure the logger to monitor and debug the operations of the FAISSDB instance.
## References and Resources
- [FAISS GitHub Repository](https://github.com/facebookresearch/faiss)
- [Sentence Transformers Documentation](https://www.sbert.net/)
- [Loguru Documentation](https://loguru.readthedocs.io/en/stable/)
- [HuggingFace Transformers](https://huggingface.co/transformers/)
By following this documentation, users can effectively utilize the `FAISSDB` class for various similarity search and document retrieval tasks, customizing it to their specific needs through the provided hooks and functions.