You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

516 lines
20 KiB

{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Introduction\n",
"- This notebook is used to understand the behavior of the `generate_data.py` script"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Chunking and FAISS index generation notes\n",
"\n",
"- Saved 341 chunks to saved_data/chunks.pkl\n",
"- The chunking process is deterministic. **chunks.pkl will be identical across runs**\n",
"- The **FAISS index creation can be non-deterministic**:\n",
" - FAISS uses approximate nearest neighbor search algorithms, random initialization, and floating-point precision issues.\n",
" - The order of vector insertion can affect the final index structure"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import sys\n",
"\n",
"sys.path.append(\"..\")\n",
"\n",
"import json\n",
"import os\n",
"import pickle\n",
"import re\n",
"from typing import Dict, List, Optional, Tuple\n",
"\n",
"from langchain.text_splitter import RecursiveCharacterTextSplitter\n",
"\n",
"\n",
"from langchain_community.document_loaders import UnstructuredMarkdownLoader\n",
"from langchain_community.vectorstores import FAISS\n",
"\n",
"from embeddings import CustomHuggingFaceEmbeddings"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# # ========= Part 1: Document Processing and Embedding Generation =========\n",
"# # Load and split the markdown document using LangChain\n",
"# # Load your markdown file (adjust the path as needed)\n",
"# loader = UnstructuredMarkdownLoader(\"../data/mission_report.md\")\n",
"# docs = loader.load()\n",
"\n",
"# # Split the document into smaller chunks (each 1000 characters, no overlap)\n",
"# text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=0)\n",
"# chunks = text_splitter.split_documents(docs)\n",
"\n",
"# # Save chunks for later use\n",
"# os.makedirs(\"saved_data\", exist_ok=True)\n",
"# with open(\"saved_data/chunks.pkl\", \"wb\") as f:\n",
"# pickle.dump(chunks, f)\n",
"# print(f\"Saved {len(chunks)} chunks to saved_data/chunks.pkl\")\n",
"\n",
"# embeddings = CustomHuggingFaceEmbeddings()\n",
"\n",
"# # Create a FAISS vector store from the document chunks and save it locally\n",
"# vectorstore = FAISS.from_documents(chunks, embeddings)\n",
"# vectorstore.save_local(\"faiss_index\")\n",
"# print(\"Saved FAISS index to 'faiss_index'\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Try loading the chunks and parahprasing them"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Load the chunks from the pickle file\n",
"with open(\"saved_data/chunks.pkl\", \"rb\") as f:\n",
" chunks = pickle.load(f)\n",
"\n",
"print(f\"Loaded {len(chunks)} chunks from saved_data/chunks.pkl\")\n",
"\n",
"# Print content of first 3 chunks\n",
"for i, chunk in enumerate(\n",
" chunks[:3]\n",
"): # Ok cool, so this is much simpler than i expected!\n",
" print(f\"\\n--- Chunk {i + 1}/{len(chunks)} ---\")\n",
" print(chunk.page_content)\n",
" print(\"-\" * 50)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Try loading simple local model for paraphrasing (work with 16GB VRAM!)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from unsloth import FastLanguageModel\n",
"from vllm import SamplingParams\n",
"\n",
"# Load the Llama model\n",
"model, tokenizer = FastLanguageModel.from_pretrained(\n",
" # model_name=\"meta-llama/meta-Llama-3.1-8B-Instruct\",\n",
" model_name=\"Qwen/Qwen2-0.5B-Instruct\",\n",
" max_seq_length=2048, # Reduced from 4096\n",
" load_in_4bit=True,\n",
" fast_inference=True,\n",
" gpu_memory_utilization=0.8, # Increased from 0.6\n",
")\n",
"\n",
"# Define sampling parameters\n",
"sampling_params = SamplingParams(\n",
" temperature=0.3,\n",
" top_p=0.95,\n",
" max_tokens=512, # Reduced for simple responses\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## ✅ Paraphrase V0\n",
" Ok this should be good enough for now.\n",
"```bash\n",
"-- Original Chunk 2/3 ---\n",
"Mission Spacecraft Description Laumch date Launch site PA-1 BP-6 First pad abort Nov.7, 1963 White Sands Missile Range; A-001 BP-12 Transonic abort May 13, 1964 N.Mex. White Sands Missile Range, AS-101 BP-13 Nominal launch and exit environment May 28, 1964 N. Mex. Cape Kennedy. Fla. AS-102 BP-15 Nominal launch and exit environment Sept.18,1964 Cape Kennedy, Fla. A-002 BP-23 Maximum dynamic pressure abort Dec.8, 1964 White Sands Missile Range, AS-103 BP-16 Micrometeoroid experiment Feb. 16, 1965 N.Mex. Cape Kennedy, Fla. A-003 BP-22 Low-altitude abort (planned high- May 19, 1965 White Sands Missile Range, AS-104 BP-26 altitude abort) Micrometeoroid experiment and service module May 25, 1965 N.Mex, Cape Kennedy, Fla. PA-2 BP-23A RCS launch environment Second pad abort June 29,1965 White Sands Missile Range, AS-105 BP-9A Micrometeoroid experiment and service module July 30, 1965 N. Mex. Cape Kennedy, Fla. A-004 SC-002 RCS launch environment Power-on tumbling boundary abort Jan.20,1966\n",
"--------------------------------------------------\n",
"\n",
"--- Style 1 Paraphrase ---\n",
"The mission spacecraft described in the text is a spacecraft that is designed to launch and exit into a specific environment based on the specifications provided. The spacecraft is named \"Mission Spacecraft\" and is launched from the White Sands Missile Range in New Mexico on July 28, 1964. The spacecraft is\n",
"--------------------------------------------------\n",
"\n",
"--- Style 2 Paraphrase ---\n",
"The mission spacecraft described in the text is a rocket that is designed to launch into the space. It is launched from a specific site, which is located in the United States. The spacecraft is launched into the space on a nominal launch and exit environment, which is described as a high-pressure abort. The spacecraft is also launched into the space on a micrometeoroid experiment, which is described as a planned high-altitude abort. The spacecraft is launched into the space on a RCS launch environment, which is described as a second pad abort. The spacecraft is launched into the space on a micrometeoroid experiment and service module, which is described as a power-on tumbling boundary abort.\n",
"--------------------------------------------------\n",
"\n",
"--- Style 3 Paraphrase ---\n",
"The mission spacecraft, known as the Laumch, was launched from the site of the White Sands Missile Range, located in the state of New Mexico. The spacecraft was scheduled to enter the launch and exit environment on May 28, 1964, from the White Sands Missile Range in New Mexico. The spacecraft was also scheduled to enter the nominal launch and exit environment on June 18, 1964, from the White Sands Missile Range in New Mexico. The spacecraft was also scheduled to enter the nominal launch and exit environment on September 8, 1964, from the White Sands Missile Range in New Mexico. The spacecraft was also scheduled to enter the nominal launch and exit environment on January 8, 1965, from the White Sands Missile Range in New Mexico. The spacecraft was also scheduled to enter the nominal launch and exit environment on February 16, 1965, from the White Sands Missile Range in New Mexico. The spacecraft was also scheduled to enter the nominal launch and exit environment on May 19, 1965, from the White Sands Missile Range in New Mexico. The spacecraft was also scheduled to enter the nominal launch and exit environment on June 29, 1965, from the White Sands Missile Range in New Mexico. The spacecraft was also scheduled to enter the nominal launch and exit environment on July 30, 1965, from the White Sands Missile Range in New Mexico. The spacecraft was also scheduled to enter the nominal launch and exit environment on October 20, 1966, from the White Sands Missile Range in New Mexico\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"PARAPHRASE_PROMPTS = [\n",
" \"\"\"Rewrite this text in a formal, scholarly tone. Keep it very concise - summarize in 1-2 short sentences. Only output the paraphrased text:\n",
"\n",
" TEXT: {text}\"\"\",\n",
" \"\"\"Rewrite this text in a clear, simple way that's easy to understand. Provide a medium-length explanation with key details. Only output the paraphrased text:\n",
" \n",
" TEXT: {text}\"\"\",\n",
" \"\"\"Rewrite this text in a vivid, engaging style. Expand on the details and provide a comprehensive, detailed version. Only output the paraphrased text:\n",
" \n",
" TEXT: {text}\"\"\",\n",
"]\n",
"\n",
"# Update sampling parameters for each style\n",
"sampling_params_short = SamplingParams(\n",
" temperature=0.3,\n",
" top_p=0.95,\n",
" max_tokens=64, # Short responses\n",
")\n",
"\n",
"sampling_params_medium = SamplingParams(\n",
" temperature=0.3,\n",
" top_p=0.95,\n",
" max_tokens=256, # Medium responses\n",
")\n",
"\n",
"sampling_params_long = SamplingParams(\n",
" temperature=0.3,\n",
" top_p=0.95,\n",
" max_tokens=512, # Long responses\n",
")\n",
"\n",
"\n",
"def generate_response(text: str) -> list[str]:\n",
" \"\"\"\n",
" Generate three different paraphrased versions with varying lengths.\n",
"\n",
" Args:\n",
" text: Text to paraphrase\n",
"\n",
" Returns:\n",
" List of three paraphrased versions (short, medium, long)\n",
" \"\"\"\n",
" responses = []\n",
" sampling_params_list = [\n",
" sampling_params_short,\n",
" sampling_params_medium,\n",
" sampling_params_long,\n",
" ]\n",
"\n",
" for prompt_template, sampling_params in zip(\n",
" PARAPHRASE_PROMPTS, sampling_params_list\n",
" ):\n",
" formatted_prompt = tokenizer.apply_chat_template(\n",
" [{\"role\": \"user\", \"content\": prompt_template.format(text=text)}],\n",
" tokenize=False,\n",
" add_generation_prompt=True,\n",
" )\n",
"\n",
" output = model.fast_generate(\n",
" [formatted_prompt], sampling_params=sampling_params\n",
" )\n",
" responses.append(output[0].outputs[0].text)\n",
"\n",
" return responses\n",
"\n",
"\n",
"# Test example\n",
"test_text = \"The blue whale is the largest animal on Earth\"\n",
"responses = generate_response(test_text)\n",
"for i, response in enumerate(responses, 1):\n",
" print(f\"\\nStyle {i}:\")\n",
" print(response)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# paraphrase only the first 3 chunks with multiple styles\n",
"paraphrased_chunks = []\n",
"for chunk in chunks[:3]:\n",
" styles = generate_response(chunk.page_content) # Now returns list of 3 styles\n",
" paraphrased_chunks.extend(styles)\n",
"\n",
"from pprint import pprint\n",
"\n",
"pprint(paraphrased_chunks) # single list of 3*len(chunks) items"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# TODO: add checkpoint (save to file and resume from there) to this process, as it's long running and resource intensive\n",
"paraphrased_chunks = []\n",
"for chunk in chunks: # all chunks\n",
" styles = generate_response(chunk.page_content) # Now returns list of 3 styles\n",
" paraphrased_chunks.extend(styles) # should be single list of 3*len(chunks) items"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"\n",
"df = pd.DataFrame(paraphrased_chunks)\n",
"# add column names\n",
"df.columns = [\"paraphrased_text\"]\n",
"df.to_csv(\"saved_data/paraphrased_chunks.csv\", index=False)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# print number of rows\n",
"print(f\"Number of rows: {len(df)}\")\n",
"# wtf, ah 341 * 3 = 1023, make sense"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"\n",
"df = pd.read_csv(\"saved_data/paraphrased_chunks.csv\")\n",
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## ✅ Append final vectorstore here"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Load the existing FAISS index\n",
"from langchain_community.vectorstores import FAISS\n",
"from embeddings import CustomHuggingFaceEmbeddings\n",
"\n",
"\n",
"# Load the paraphrased chunks\n",
"df = pd.read_csv(\"saved_data/paraphrased_chunks.csv\")\n",
"print(f\"Loaded {len(df)} paraphrased chunks\")\n",
"\n",
"# Convert DataFrame to Document objects\n",
"from langchain.schema import Document\n",
"\n",
"paraphrased_documents = [\n",
" Document(page_content=row[\"paraphrased_text\"], metadata={})\n",
" for _, row in df.iterrows()\n",
"]\n",
"\n",
"# Initialize the embeddings model\n",
"embeddings = CustomHuggingFaceEmbeddings()\n",
"\n",
"# Create embeddings for the paraphrased chunks\n",
"paraphrased_vectorstore = FAISS.from_documents(paraphrased_documents, embeddings)\n",
"print(\n",
" f\"Created FAISS index for paraphrased chunks with {paraphrased_vectorstore.index.ntotal} vectors\"\n",
")\n",
"\n",
"# NOTE: so this load the already generated vectorstore first\n",
"# Load the existing vectorstore - add allow_dangerous_deserialization=True to fix the error\n",
"existing_vectorstore = FAISS.load_local(\n",
" \"faiss_index\", embeddings, allow_dangerous_deserialization=True\n",
")\n",
"print(f\"Loaded existing FAISS index with {existing_vectorstore.index.ntotal} vectors\")\n",
"\n",
"# Merge the two vectorstores\n",
"# Side effects:\n",
"# Original IDs are not preserved - new IDs are assigned sequentially #TODO: does the final dataset need this information?\n",
"# If vectors are duplicates/very similar, they'll still be added (no deduplication) (don't care for now)\n",
"\n",
"existing_vectorstore.merge_from(paraphrased_vectorstore)\n",
"print(f\"Merged vectorstores, now contains {existing_vectorstore.index.ntotal} vectors\")\n",
"\n",
"# Save the updated vectorstore\n",
"existing_vectorstore.save_local(\"faiss_index_with_paraphrased\")\n",
"print(\"Saved updated FAISS index to 'faiss_index_with_paraphrased'\")\n",
"\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# TODO: Try loading the new vectorstore and see if it works\n",
"# Expected output size: 341 * 4 = 1364\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 🔍 Inspect FAISS vector store\n",
"Ok so basically the faiss vector store contains: \n",
"- Document ID\n",
"- **Full content** -> no need to save the original chunks.pkl file anymore \n",
"- Metadata\n",
"- **Full vector embedding**"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Load your index\n",
"from langchain_community.vectorstores import FAISS\n",
"from embeddings import CustomHuggingFaceEmbeddings\n",
"\n",
"embeddings = CustomHuggingFaceEmbeddings()\n",
"vectorstore = FAISS.load_local(\n",
" \"faiss_index\",\n",
" embeddings,\n",
" allow_dangerous_deserialization=True,\n",
")\n",
"\n",
"# View contents\n",
"docs = vectorstore.docstore._dict\n",
"print(f\"Total documents: {len(docs)}\\n\")\n",
"\n",
"# Print first 5 docs as sample\n",
"for doc_id, doc in list(docs.items())[:5]:\n",
" print(f\"ID: {doc_id}\")\n",
" print(f\"Content: {doc.page_content[:200]}...\") # First 200 chars\n",
" print(f\"Metadata: {doc.metadata}\\n\")\n",
" print(\"-\" * 80 + \"\\n\")\n",
"\n",
"# Print total vectors for verification\n",
"print(f\"Total vectors in index: {vectorstore.index.ntotal}\")\n",
"print(f\"Vector dimension: {vectorstore.index.d}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 📦 Testing simple 1 prompt and choosing model"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"```bash\n",
"# llama3.1-8b-instruct\n",
"Processed prompts: 100%|██████████| 1/1 [00:03<00:00, 3.38s/it, est. speed input: 50.57 toks/s, output: 39.34 toks/s]\n",
"```\n",
"- about 3s/item paraphrasing is llama3.1-8b-instruct locally, might take forever to run, LOL\n",
"- should go with smaller/quantized model first, or probaly just use groq/openrouter API for this?\n",
"\n",
"```bash\n",
"# Qwen2-0.5B-Instruct\n",
"Processed prompts: 100%|██████████| 1/1 [00:00<00:00, 2.32it/s, est. speed input: 369.09 toks/s, output: 167.13 toks/s]\n",
"```\n",
"- ok cool much faster now.\n",
"\n",
"- Noo it's hallucinated, lol, but does this matter 👀?\n",
"```\n",
"--- Chunk 1/3 ---\n",
"MSC-02680\n",
"\n",
"DISTRIBUTION AND REFERENCING\n",
"\n",
"This paper is not suitable for general distribution or referencing. It may be referenced only in other working correspondence and documents by participating organizations.\n",
"\n",
"MANNED SPACECRAFT CENTER HOUSTON.TEXAS SEPTEMBER1970\n",
"--------------------------------------------------\n",
"\n",
"--- Paraphrased Chunk 1/3 ---\n",
"The estimated total energy transferred to the command module was approximately 129 ampere hours. The total of 410 ampere hours remained in the lunar module batteries at the time of undocking. The figure 7.2-l.- Lunar module water usage shows the amount of water used in the lunar module during flight. The figure 7.2-2.- Lunar module total battery capacity during flight shows the total capacity of the lunar module batteries at the time of flight. The figure ll.l-2.- Field meter locations in the proximity of the launch complex shows the locations of field meters in the proximity of the launch complex. The figure ll.l-2.- Field meter locations in the proximity of the launch complex shows the locations of field meters in the proximity of the launch complex. The field meter locations show the locations of field meters in the proximity of the launch complex.\n",
"```\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# paraphrase only the first 3 chunks\n",
"paraphrased_chunks = []\n",
"for chunk in chunks[:3]:\n",
" paraphrased_chunks.append(\n",
" generate_response(f\"Paraphrase the following chunk: {chunk.page_content}\")\n",
" )\n",
"\n",
"# print the first 3 chunks and their paraphrased versions LGTM!\n",
"for i, chunk in enumerate(chunks[:3]):\n",
" print(f\"\\n--- Chunk {i + 1}/3 ---\")\n",
" print(chunk.page_content)\n",
" print(\"-\" * 50)\n",
" print(f\"\\n--- Paraphrased Chunk {i + 1}/3 ---\")\n",
" print(paraphrased_chunks[i])\n",
" print(\"-\" * 50)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": ".venv",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.11"
}
},
"nbformat": 4,
"nbformat_minor": 2
}