parent
c4b5b4c609
commit
ee43b2ebb6
@ -0,0 +1,232 @@
|
||||
# FAISSDB: Documentation
|
||||
|
||||
The `FAISSDB` class is a highly customizable wrapper for the FAISS (Facebook AI Similarity Search) library, designed for efficient similarity search and clustering of dense vectors. This class facilitates the creation of a Retrieval-Augmented Generation (RAG) system by providing methods to add documents to a FAISS index and query the index for similar documents. It supports custom embedding models, preprocessing functions, and other customizations to fit various use cases.
|
||||
|
||||
|
||||
### Parameters
|
||||
|
||||
| Parameter | Type | Default | Description |
|
||||
|------------------------|--------------------------------------------------|-------------------------------|-----------------------------------------------------------------------------|
|
||||
| `dimension` | `int` | `768` | Dimension of the document embeddings. |
|
||||
| `index_type` | `str` | `'Flat'` | Type of FAISS index to use (`'Flat'` or `'IVF'`). |
|
||||
| `embedding_model` | `Optional[Any]` | `None` | Custom embedding model. |
|
||||
| `embedding_function` | `Optional[Callable[[str], List[float]]]` | `None` | Custom function to generate embeddings from text. |
|
||||
| `preprocess_function` | `Optional[Callable[[str], str]]` | `None` | Custom function to preprocess text before embedding. |
|
||||
| `postprocess_function` | `Optional[Callable[[List[Dict[str, Any]]], List[Dict[str, Any]]]]` | `None` | Custom function to postprocess the results. |
|
||||
| `metric` | `str` | `'cosine'` | Distance metric for FAISS index (`'cosine'` or `'l2'`). |
|
||||
| `logger_config` | `Optional[Dict[str, Any]]` | `None` | Configuration for the logger. |
|
||||
|
||||
## Methods
|
||||
|
||||
### `__init__`
|
||||
|
||||
Initializes the FAISSDB instance, setting up the logger, creating the FAISS index, and configuring custom functions if provided.
|
||||
|
||||
### `add`
|
||||
|
||||
Adds a document to the FAISS index.
|
||||
|
||||
#### Parameters
|
||||
|
||||
| Parameter | Type | Default | Description |
|
||||
|-----------|-------------------------|---------|-------------------------------------------------|
|
||||
| `doc` | `str` | None | The document to be added. |
|
||||
| `metadata`| `Optional[Dict[str, Any]]` | None | Additional metadata for the document. |
|
||||
|
||||
#### Example Usage
|
||||
|
||||
```python
|
||||
db = FAISSDB(dimension=768)
|
||||
db.add("This is a sample document.", {"category": "sample"})
|
||||
```
|
||||
|
||||
### `query`
|
||||
|
||||
Queries the FAISS index for similar documents.
|
||||
|
||||
#### Parameters
|
||||
|
||||
| Parameter | Type | Default | Description |
|
||||
|-----------|------|---------|-------------|
|
||||
| `query` | `str` | None | The query string. |
|
||||
| `top_k` | `int` | `5` | The number of top results to return. |
|
||||
|
||||
#### Returns
|
||||
|
||||
| Type | Description |
|
||||
|------|-------------|
|
||||
| `List[Dict[str, Any]]` | A list of dictionaries containing the top_k most similar documents. |
|
||||
|
||||
#### Example Usage
|
||||
|
||||
```python
|
||||
results = db.query("What is artificial intelligence?")
|
||||
for result in results:
|
||||
print(f"Score: {result['score']}, Text: {result['metadata']['text']}")
|
||||
```
|
||||
|
||||
## Internal Methods
|
||||
|
||||
### `_setup_logger`
|
||||
|
||||
Sets up the logger with the given configuration.
|
||||
|
||||
#### Parameters
|
||||
|
||||
| Parameter | Type | Default | Description |
|
||||
|-----------|-------------------------|---------|------------------------------------------|
|
||||
| `config` | `Optional[Dict[str, Any]]` | None | Configuration for the logger. |
|
||||
|
||||
### `_create_index`
|
||||
|
||||
Creates and returns a FAISS index based on the specified type and metric.
|
||||
|
||||
#### Parameters
|
||||
|
||||
| Parameter | Type | Default | Description |
|
||||
|-----------|-------|---------|----------------------------------------------|
|
||||
| `index_type` | `str` | 'Flat' | Type of FAISS index to use. |
|
||||
| `metric` | `str` | 'cosine' | Distance metric for FAISS index. |
|
||||
|
||||
#### Returns
|
||||
|
||||
| Type | Description |
|
||||
|------|------------------|
|
||||
| `faiss.Index` | FAISS index instance. |
|
||||
|
||||
### `_default_embedding_function`
|
||||
|
||||
Default embedding function using the SentenceTransformer model.
|
||||
|
||||
#### Parameters
|
||||
|
||||
| Parameter | Type | Default | Description |
|
||||
|-----------|------|---------|----------------------|
|
||||
| `text` | `str` | None | The input text to embed. |
|
||||
|
||||
#### Returns
|
||||
|
||||
| Type | Description |
|
||||
|------|-------------------|
|
||||
| `List[float]` | Embedding vector for the input text. |
|
||||
|
||||
### `_default_preprocess_function`
|
||||
|
||||
Default preprocessing function.
|
||||
|
||||
#### Parameters
|
||||
|
||||
| Parameter | Type | Default | Description |
|
||||
|-----------|------|---------|--------------------|
|
||||
| `text` | `str` | None | The input text to preprocess. |
|
||||
|
||||
#### Returns
|
||||
|
||||
| Type | Description |
|
||||
|------|------------------|
|
||||
| `str` | Preprocessed text. |
|
||||
|
||||
### `_default_postprocess_function`
|
||||
|
||||
Default postprocessing function.
|
||||
|
||||
#### Parameters
|
||||
|
||||
| Parameter | Type | Default | Description |
|
||||
|-----------|------|---------|--------------------------------|
|
||||
| `results` | `List[Dict[str, Any]]` | None | The results to postprocess. |
|
||||
|
||||
#### Returns
|
||||
|
||||
| Type | Description |
|
||||
|------|--------------------------|
|
||||
| `List[Dict[str, Any]]` | Postprocessed results. |
|
||||
|
||||
## Usage Examples
|
||||
|
||||
### Example 1: Basic Usage
|
||||
|
||||
```python
|
||||
# Initialize the FAISSDB instance
|
||||
db = FAISSDB(dimension=768, index_type="Flat")
|
||||
|
||||
# Add documents to the FAISS index
|
||||
db.add("This is a document about AI.", {"category": "AI"})
|
||||
db.add("Python is great for data science.", {"category": "Programming"})
|
||||
|
||||
# Query the FAISS index
|
||||
results = db.query("Tell me about AI")
|
||||
for result in results:
|
||||
print(f"Score: {result['score']}, Text: {result['metadata']['text']}")
|
||||
```
|
||||
|
||||
### Example 2: Custom Functions
|
||||
|
||||
```python
|
||||
from transformers import AutoTokenizer, AutoModel
|
||||
import torch
|
||||
|
||||
# Custom embedding function using a HuggingFace model
|
||||
def custom_embedding_function(text: str) -> List[float]:
|
||||
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
|
||||
model = AutoModel.from_pretrained("bert-base-uncased")
|
||||
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=512)
|
||||
with torch.no_grad():
|
||||
outputs = model(**inputs)
|
||||
embeddings = outputs.last_hidden_state.mean(dim=1).squeeze().tolist()
|
||||
return embeddings
|
||||
|
||||
# Custom preprocessing function
|
||||
def custom_preprocess(text: str) -> str:
|
||||
return text.lower().strip()
|
||||
|
||||
# Custom postprocessing function
|
||||
def custom_postprocess(results: List[Dict[str, Any]]) -> List[Dict[str, Any]]:
|
||||
for result in results:
|
||||
result["custom_score"] = result["score"] * 2 # Example modification
|
||||
return results
|
||||
|
||||
# Initialize the FAISSDB instance with custom functions
|
||||
db = FAISSDB(
|
||||
dimension=768,
|
||||
index_type="Flat",
|
||||
embedding_function=custom_embedding_function,
|
||||
preprocess_function=custom_preprocess,
|
||||
postprocess_function=custom_postprocess,
|
||||
metric="cosine",
|
||||
logger_config={
|
||||
"handlers": [
|
||||
{"sink": "custom_faiss_rag_wrapper.log", "rotation": "1 GB"},
|
||||
{"sink": lambda msg: print(f"Custom log: {msg}", end="")}
|
||||
],
|
||||
},
|
||||
)
|
||||
|
||||
# Add documents to the FAISS index
|
||||
db.add("This is a document about machine learning.", {"category": "ML"})
|
||||
db.add("Python is a versatile programming language.", {"category": "Programming"})
|
||||
|
||||
# Query the FAISS index
|
||||
results = db.query("Explain machine learning")
|
||||
for result in results:
|
||||
print(f"Score: {result['score']}, Custom Score: {result['custom_score']}, Text: {result['metadata']['text']}")
|
||||
```
|
||||
|
||||
## Additional Information and Tips
|
||||
|
||||
- Ensure that the dimension of the document embeddings matches the dimension specified during the initialization of the FAISSDB instance.
|
||||
- Use custom embedding functions to leverage domain-specific models for generating embeddings.
|
||||
- Custom preprocessing and postprocessing functions can help tailor the text processing and
|
||||
|
||||
result formatting to specific needs.
|
||||
- FAISS supports various types of indices; choose the one that best fits the application requirements (e.g., `Flat` for brute-force search, `IVF` for faster search with some accuracy trade-off).
|
||||
- Properly configure the logger to monitor and debug the operations of the FAISSDB instance.
|
||||
|
||||
## References and Resources
|
||||
|
||||
- [FAISS GitHub Repository](https://github.com/facebookresearch/faiss)
|
||||
- [Sentence Transformers Documentation](https://www.sbert.net/)
|
||||
- [Loguru Documentation](https://loguru.readthedocs.io/en/stable/)
|
||||
- [HuggingFace Transformers](https://huggingface.co/transformers/)
|
||||
|
||||
By following this documentation, users can effectively utilize the `FAISSDB` class for various similarity search and document retrieval tasks, customizing it to their specific needs through the provided hooks and functions.
|
Loading…
Reference in new issue