The FAISSDB class is a highly customizable wrapper for the FAISS (Facebook AI Similarity Search) library, designed for efficient similarity search and clustering of dense vectors. This class facilitates the creation of a Retrieval-Augmented Generation (RAG) system by providing methods to add documents to a FAISS index and query the index for similar documents. It supports custom embedding models, preprocessing functions, and other customizations to fit various use cases.
Parameters
Parameter
Type
Default
Description
dimension
int
768
Dimension of the document embeddings.
index_type
str
'Flat'
Type of FAISS index to use ('Flat' or 'IVF').
embedding_model
Optional[Any]
None
Custom embedding model.
embedding_function
Optional[Callable[[str], List[float]]]
None
Custom function to generate embeddings from text.
preprocess_function
Optional[Callable[[str], str]]
None
Custom function to preprocess text before embedding.
Distance metric for FAISS index ('cosine' or 'l2').
logger_config
Optional[Dict[str, Any]]
None
Configuration for the logger.
Methods
__init__
Initializes the FAISSDB instance, setting up the logger, creating the FAISS index, and configuring custom functions if provided.
add
Adds a document to the FAISS index.
Parameters
Parameter
Type
Default
Description
doc
str
None
The document to be added.
metadata
Optional[Dict[str, Any]]
None
Additional metadata for the document.
Example Usage
db=FAISSDB(dimension=768)db.add("This is a sample document.",{"category":"sample"})
query
Queries the FAISS index for similar documents.
Parameters
Parameter
Type
Default
Description
query
str
None
The query string.
top_k
int
5
The number of top results to return.
Returns
Type
Description
List[Dict[str, Any]]
A list of dictionaries containing the top_k most similar documents.
Example Usage
results=db.query("What is artificial intelligence?")forresultinresults:print(f"Score: {result['score']}, Text: {result['metadata']['text']}")
Internal Methods
_setup_logger
Sets up the logger with the given configuration.
Parameters
Parameter
Type
Default
Description
config
Optional[Dict[str, Any]]
None
Configuration for the logger.
_create_index
Creates and returns a FAISS index based on the specified type and metric.
Parameters
Parameter
Type
Default
Description
index_type
str
'Flat'
Type of FAISS index to use.
metric
str
'cosine'
Distance metric for FAISS index.
Returns
Type
Description
faiss.Index
FAISS index instance.
_default_embedding_function
Default embedding function using the SentenceTransformer model.
Parameters
Parameter
Type
Default
Description
text
str
None
The input text to embed.
Returns
Type
Description
List[float]
Embedding vector for the input text.
_default_preprocess_function
Default preprocessing function.
Parameters
Parameter
Type
Default
Description
text
str
None
The input text to preprocess.
Returns
Type
Description
str
Preprocessed text.
_default_postprocess_function
Default postprocessing function.
Parameters
Parameter
Type
Default
Description
results
List[Dict[str, Any]]
None
The results to postprocess.
Returns
Type
Description
List[Dict[str, Any]]
Postprocessed results.
Usage Examples
Example 1: Basic Usage
# Initialize the FAISSDB instancedb=FAISSDB(dimension=768,index_type="Flat")# Add documents to the FAISS indexdb.add("This is a document about AI.",{"category":"AI"})db.add("Python is great for data science.",{"category":"Programming"})# Query the FAISS indexresults=db.query("Tell me about AI")forresultinresults:print(f"Score: {result['score']}, Text: {result['metadata']['text']}")
Example 2: Custom Functions
fromtransformersimportAutoTokenizer,AutoModelimporttorch# Custom embedding function using a HuggingFace modeldefcustom_embedding_function(text:str)->List[float]:tokenizer=AutoTokenizer.from_pretrained("bert-base-uncased")model=AutoModel.from_pretrained("bert-base-uncased")inputs=tokenizer(text,return_tensors="pt",padding=True,truncation=True,max_length=512)withtorch.no_grad():outputs=model(**inputs)embeddings=outputs.last_hidden_state.mean(dim=1).squeeze().tolist()returnembeddings# Custom preprocessing functiondefcustom_preprocess(text:str)->str:returntext.lower().strip()# Custom postprocessing functiondefcustom_postprocess(results:List[Dict[str,Any]])->List[Dict[str,Any]]:forresultinresults:result["custom_score"]=result["score"]*2# Example modificationreturnresults# Initialize the FAISSDB instance with custom functionsdb=FAISSDB(dimension=768,index_type="Flat",embedding_function=custom_embedding_function,preprocess_function=custom_preprocess,postprocess_function=custom_postprocess,metric="cosine",logger_config={"handlers":[{"sink":"custom_faiss_rag_wrapper.log","rotation":"1 GB"},{"sink":lambdamsg:print(f"Custom log: {msg}",end="")}],},)# Add documents to the FAISS indexdb.add("This is a document about machine learning.",{"category":"ML"})db.add("Python is a versatile programming language.",{"category":"Programming"})# Query the FAISS indexresults=db.query("Explain machine learning")forresultinresults:print(f"Score: {result['score']}, Custom Score: {result['custom_score']}, Text: {result['metadata']['text']}")
Additional Information and Tips
Ensure that the dimension of the document embeddings matches the dimension specified during the initialization of the FAISSDB instance.
Use custom embedding functions to leverage domain-specific models for generating embeddings.
Custom preprocessing and postprocessing functions can help tailor the text processing and
result formatting to specific needs.
FAISS supports various types of indices; choose the one that best fits the application requirements (e.g., Flat for brute-force search, IVF for faster search with some accuracy trade-off).
Properly configure the logger to monitor and debug the operations of the FAISSDB instance.
By following this documentation, users can effectively utilize the FAISSDB class for various similarity search and document retrieval tasks, customizing it to their specific needs through the provided hooks and functions.