9.2 KiB

Raw Blame History

FAISSDB: Documentation

The FAISSDB class is a highly customizable wrapper for the FAISS (Facebook AI Similarity Search) library, designed for efficient similarity search and clustering of dense vectors. This class facilitates the creation of a Retrieval-Augmented Generation (RAG) system by providing methods to add documents to a FAISS index and query the index for similar documents. It supports custom embedding models, preprocessing functions, and other customizations to fit various use cases.

Parameters

Parameter	Type	Default	Description
`dimension`	`int`	`768`	Dimension of the document embeddings.
`index_type`	`str`	`'Flat'`	Type of FAISS index to use (`'Flat'` or `'IVF'`).
`embedding_model`	`Optional[Any]`	`None`	Custom embedding model.
`embedding_function`	`Optional[Callable[[str], List[float]]]`	`None`	Custom function to generate embeddings from text.
`preprocess_function`	`Optional[Callable[[str], str]]`	`None`	Custom function to preprocess text before embedding.
`postprocess_function`	`Optional[Callable[[List[Dict[str, Any]]], List[Dict[str, Any]]]]`	`None`	Custom function to postprocess the results.
`metric`	`str`	`'cosine'`	Distance metric for FAISS index (`'cosine'` or `'l2'`).
`logger_config`	`Optional[Dict[str, Any]]`	`None`	Configuration for the logger.

Methods

`init`

Initializes the FAISSDB instance, setting up the logger, creating the FAISS index, and configuring custom functions if provided.

`add`

Adds a document to the FAISS index.

Parameters

Parameter	Type	Default	Description
`doc`	`str`	None	The document to be added.
`metadata`	`Optional[Dict[str, Any]]`	None	Additional metadata for the document.

Example Usage

db = FAISSDB(dimension=768)
db.add("This is a sample document.", {"category": "sample"})

`query`

Queries the FAISS index for similar documents.

Parameters

Parameter	Type	Default	Description
`query`	`str`	None	The query string.
`top_k`	`int`	`5`	The number of top results to return.

Returns

Type	Description
`List[Dict[str, Any]]`	A list of dictionaries containing the top_k most similar documents.

Example Usage

results = db.query("What is artificial intelligence?")
for result in results:
    print(f"Score: {result['score']}, Text: {result['metadata']['text']}")

Internal Methods

`_setup_logger`

Sets up the logger with the given configuration.

Parameters

Parameter	Type	Default	Description
`config`	`Optional[Dict[str, Any]]`	None	Configuration for the logger.

`_create_index`

Creates and returns a FAISS index based on the specified type and metric.

Parameters

Parameter	Type	Default	Description
`index_type`	`str`	'Flat'	Type of FAISS index to use.
`metric`	`str`	'cosine'	Distance metric for FAISS index.

Returns

Type	Description
`faiss.Index`	FAISS index instance.

`_default_embedding_function`

Default embedding function using the SentenceTransformer model.

Parameters

Parameter	Type	Default	Description
`text`	`str`	None	The input text to embed.

Returns

Type	Description
`List[float]`	Embedding vector for the input text.

`_default_preprocess_function`

Default preprocessing function.

Parameters

Parameter	Type	Default	Description
`text`	`str`	None	The input text to preprocess.

Returns

Type	Description
`str`	Preprocessed text.

`_default_postprocess_function`

Default postprocessing function.

Parameters

Parameter	Type	Default	Description
`results`	`List[Dict[str, Any]]`	None	The results to postprocess.

Returns

Type	Description
`List[Dict[str, Any]]`	Postprocessed results.

Usage Examples

Example 1: Basic Usage

# Initialize the FAISSDB instance
db = FAISSDB(dimension=768, index_type="Flat")

# Add documents to the FAISS index
db.add("This is a document about AI.", {"category": "AI"})
db.add("Python is great for data science.", {"category": "Programming"})

# Query the FAISS index
results = db.query("Tell me about AI")
for result in results:
    print(f"Score: {result['score']}, Text: {result['metadata']['text']}")

Example 2: Custom Functions

from transformers import AutoTokenizer, AutoModel
import torch

# Custom embedding function using a HuggingFace model
def custom_embedding_function(text: str) -> List[float]:
    tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
    model = AutoModel.from_pretrained("bert-base-uncased")
    inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=512)
    with torch.no_grad():
        outputs = model(**inputs)
    embeddings = outputs.last_hidden_state.mean(dim=1).squeeze().tolist()
    return embeddings

# Custom preprocessing function
def custom_preprocess(text: str) -> str:
    return text.lower().strip()

# Custom postprocessing function
def custom_postprocess(results: List[Dict[str, Any]]) -> List[Dict[str, Any]]:
    for result in results:
        result["custom_score"] = result["score"] * 2  # Example modification
    return results

# Initialize the FAISSDB instance with custom functions
db = FAISSDB(
    dimension=768,
    index_type="Flat",
    embedding_function=custom_embedding_function,
    preprocess_function=custom_preprocess,
    postprocess_function=custom_postprocess,
    metric="cosine",
    logger_config={
        "handlers": [
            {"sink": "custom_faiss_rag_wrapper.log", "rotation": "1 GB"},
            {"sink": lambda msg: print(f"Custom log: {msg}", end="")}
        ],
    },
)

# Add documents to the FAISS index
db.add("This is a document about machine learning.", {"category": "ML"})
db.add("Python is a versatile programming language.", {"category": "Programming"})

# Query the FAISS index
results = db.query("Explain machine learning")
for result in results:
    print(f"Score: {result['score']}, Custom Score: {result['custom_score']}, Text: {result['metadata']['text']}")

Additional Information and Tips

Ensure that the dimension of the document embeddings matches the dimension specified during the initialization of the FAISSDB instance.
Use custom embedding functions to leverage domain-specific models for generating embeddings.
Custom preprocessing and postprocessing functions can help tailor the text processing and

result formatting to specific needs.

FAISS supports various types of indices; choose the one that best fits the application requirements (e.g., Flat for brute-force search, IVF for faster search with some accuracy trade-off).
Properly configure the logger to monitor and debug the operations of the FAISSDB instance.

References and Resources

By following this documentation, users can effectively utilize the FAISSDB class for various similarity search and document retrieval tasks, customizing it to their specific needs through the provided hooks and functions.

9.2 KiB Raw Blame History

FAISSDB: Documentation

Parameters

Methods

__init__

add

Parameters

Example Usage

query

Parameters

Returns

Example Usage

Internal Methods

_setup_logger

Parameters

_create_index

Parameters

Returns

_default_embedding_function

Parameters

Returns

_default_preprocess_function

Parameters

Returns

_default_postprocess_function

Parameters

Returns

Usage Examples

Example 1: Basic Usage

Example 2: Custom Functions

Additional Information and Tips

References and Resources

9.2 KiB

Raw Blame History

`init`

`add`

`query`

`_setup_logger`

`_create_index`

`_default_embedding_function`

`_default_preprocess_function`

`_default_postprocess_function`