From 1bdee261b64bc94cae46fcca510af3aaf0eeb0e3 Mon Sep 17 00:00:00 2001
From: thinhlpg <thinhlpg@gmail.com>
Date: Mon, 24 Mar 2025 09:02:54 +0700
Subject: [PATCH] feat: add draft data generation and documentation

- Updated `00_worklog.md` to reflect optimizations for speed and quality in dataset generation.
- Introduced new documentation files: `choosing-llm-and-prompt-101.md`, `ds-pipeline-v0.md`, and `paraphrase-prompt.md` for better clarity on LLM choices and dataset pipeline.
- Added a Jupyter notebook `250324_generate_data_anatomy.ipynb` to explore the data generation process
---
 docs/00_worklog.md                           |   3 +-
 docs/choosing-llm-and-prompt-101.md          |  12 +
 docs/ds-pipeline-v0.md                       |  34 ++
 docs/paraphrase-prompt.md                    |   1 +
 docs/search-backends.md                      |   1 -
 docs/understanding-search-engine-101.md      |   1 +
 generate_data.py                             |   2 +
 notebooks/250324_generate_data_anatomy.ipynb | 343 +++++++++++++++++++
 8 files changed, 395 insertions(+), 2 deletions(-)
 create mode 100644 docs/choosing-llm-and-prompt-101.md
 create mode 100644 docs/ds-pipeline-v0.md
 create mode 100644 docs/paraphrase-prompt.md
 create mode 100644 docs/understanding-search-engine-101.md
 create mode 100644 notebooks/250324_generate_data_anatomy.ipynb

diff --git a/docs/00_worklog.md b/docs/00_worklog.md
index 3b15ca6..20efe9a 100644
--- a/docs/00_worklog.md
+++ b/docs/00_worklog.md
@@ -4,7 +4,8 @@
 
 - [ ] @thinhlpg transfers the project to @bachvudinh
 - [ ] Modify `generate_dataset.py` (**ONLY AFTER** the simple training and benchmark works):
-    - [ ] As a data dataset maker, I want to change from LLama 3.1 8B to API call, like claude, gemini or openai. Originally they use 3.1 8B for `Self-Bootstrapping` demonstration, but the dataset quality is low, for sure.
+    - [ ] Optimize speed (different LLM models, api, tools, etc.)
+    - [ ] Optimize quality. As a data dataset maker, I want to change from LLama 3.1 8B to API call, like claude, gemini or openai. Originally they use 3.1 8B for `Self-Bootstrapping` demonstration, but the dataset quality is low, for sure.
     - [ ] Experimenting with different chunking strategies
 - [ ] [search-backends.md](search-backends.md) design (for more dataset noise (**ONLY AFTER** the simple training dataset works))
 
diff --git a/docs/choosing-llm-and-prompt-101.md b/docs/choosing-llm-and-prompt-101.md
new file mode 100644
index 0000000..c09a0bd
--- /dev/null
+++ b/docs/choosing-llm-and-prompt-101.md
@@ -0,0 +1,12 @@
+# Choosing LLM 101
+
+ This docs document the choice of LLM for the project.
+    - Architecture
+    - Language
+    - Size
+    - Why
+    - Perferably easily loaded and used  from huggingface hub
+
+## Comparison of LLMs for paraphrasing
+
+## Prompt
diff --git a/docs/ds-pipeline-v0.md b/docs/ds-pipeline-v0.md
new file mode 100644
index 0000000..e16cf0f
--- /dev/null
+++ b/docs/ds-pipeline-v0.md
@@ -0,0 +1,34 @@
+# Dataset pipeline v0
+
+- Why not just create whole new dataset?
+    - we want to keep the same dataset for training and evaluation
+    - because the initial dataset is already good
+    - we don't want to waste it
+
+- Goal: introduce paraphrased document chunks to the training process
+- Ok let just go with the plan below cuz it's FAST to implement!s
+    - Smol model 0.5b
+    - Simple prompt 3 prompts -> 3 paraphrased chunks for each original chunk (why 3? idk, it was revealed for me in my dream, but it's smol and fast to run)
+        - short medium long
+        - 3 different styles/ personalities
+
+- Next (v0.1):
+    - Try this  <https://github.com/argilla-io/synthetic-data-generator>
+
+## How?
+
+- Please refer to [250324_generate_data_anatomy.ipynb](../notebooks/250324_generate_data_anatomy.ipynb) for more details
+    - There are already 3 files generated by original `generate_dataset.py` script. There are chunk id in the question json file.
+    - should modify the `chunks` file to include paraphrased chunks
+    - re run faiss index
+
+- Final data has "chunk_id" field in the question json file, is it used or is it important for the training process or evaluation? - no (checked with Ctrl + F), only the "question" and "answer" matter -> **so i can just iterate over the chunk file and add paraphrased chunks to the vector store**
+    - How do i iterate over the `chunk.pkl` file?
+        - use pickle to load the file
+        - iterate over the file
+        - paraphrase the chunk [paraphrase-prompt.md](paraphrase-prompt.md)
+        - add the paraphrased chunks to the vector store (how? will it affect the original chunk id?)
+            - Can just append the new chunks to the existing file?
+        - save the vector store
+        - save the question json file
+- [ ] Should I ass wrong information or not? How correct should the paraphrased chunk be? How many paraphased chunks should I add for each original chunk? - **V0.1? for now just use simple paraphrasing with correct information.**
diff --git a/docs/paraphrase-prompt.md b/docs/paraphrase-prompt.md
new file mode 100644
index 0000000..5054580
--- /dev/null
+++ b/docs/paraphrase-prompt.md
@@ -0,0 +1 @@
+# Paraphrase Prompt
diff --git a/docs/search-backends.md b/docs/search-backends.md
index 0549a05..cfe4398 100644
--- a/docs/search-backends.md
+++ b/docs/search-backends.md
@@ -5,4 +5,3 @@
 - Embedding models, Retrieval mechanisms (BM25, dense, hybrid), Query expansion methods, Reranking strategies
 - Random search engine assignment per query
 - Noise and inconsistency injection to prevent shortcut learning
-
diff --git a/docs/understanding-search-engine-101.md b/docs/understanding-search-engine-101.md
new file mode 100644
index 0000000..49dd653
--- /dev/null
+++ b/docs/understanding-search-engine-101.md
@@ -0,0 +1 @@
+# 101 Understanding Search Engine #TODO
diff --git a/generate_data.py b/generate_data.py
index 6e8deec..02d4824 100644
--- a/generate_data.py
+++ b/generate_data.py
@@ -47,6 +47,8 @@ vectorstore = FAISS.from_documents(chunks, embeddings)
 vectorstore.save_local("faiss_index")
 print("Saved FAISS index to 'faiss_index'")
 
+# TODO: add the paraphrased chunks to the vector store
+
 # ========= Part 2: QA Generation using Llama Backend =========
 
 # Setup Llama backend via unsloth and vLLM
diff --git a/notebooks/250324_generate_data_anatomy.ipynb b/notebooks/250324_generate_data_anatomy.ipynb
new file mode 100644
index 0000000..edb626c
--- /dev/null
+++ b/notebooks/250324_generate_data_anatomy.ipynb
@@ -0,0 +1,343 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Introduction\n",
+    "- This notebook is used to understand the behavior of the `generate_data.py` script"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Chunking and FAISS index generation notes\n",
+    "\n",
+    "- Saved 341 chunks to saved_data/chunks.pkl\n",
+    "- The chunking process is deterministic. **chunks.pkl will be identical across runs**\n",
+    "- The **FAISS index creation can be non-deterministic**:\n",
+    "  - FAISS uses approximate nearest neighbor search algorithms, random initialization, and floating-point precision issues.\n",
+    "  - The order of vector insertion can affect the final index structure"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import sys\n",
+    "\n",
+    "sys.path.append(\"..\")\n",
+    "\n",
+    "import json\n",
+    "import os\n",
+    "import pickle\n",
+    "import re\n",
+    "from typing import Dict, List, Optional, Tuple\n",
+    "\n",
+    "from langchain.text_splitter import RecursiveCharacterTextSplitter\n",
+    "\n",
+    "# ========= Part 1: Document Processing and Embedding Generation =========\n",
+    "# Load and split the markdown document using LangChain\n",
+    "from langchain_community.document_loaders import UnstructuredMarkdownLoader\n",
+    "from langchain_community.vectorstores import FAISS\n",
+    "\n",
+    "from embeddings import CustomHuggingFaceEmbeddings\n",
+    "\n",
+    "# Load your markdown file (adjust the path as needed)\n",
+    "loader = UnstructuredMarkdownLoader(\"../data/mission_report.md\")\n",
+    "docs = loader.load()\n",
+    "\n",
+    "# Split the document into smaller chunks (each 1000 characters, no overlap)\n",
+    "text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=0)\n",
+    "chunks = text_splitter.split_documents(docs)\n",
+    "\n",
+    "# Save chunks for later use\n",
+    "os.makedirs(\"saved_data\", exist_ok=True)\n",
+    "with open(\"saved_data/chunks.pkl\", \"wb\") as f:\n",
+    "    pickle.dump(chunks, f)\n",
+    "print(f\"Saved {len(chunks)} chunks to saved_data/chunks.pkl\")\n",
+    "\n",
+    "embeddings = CustomHuggingFaceEmbeddings()\n",
+    "\n",
+    "# Create a FAISS vector store from the document chunks and save it locally\n",
+    "vectorstore = FAISS.from_documents(chunks, embeddings)\n",
+    "vectorstore.save_local(\"faiss_index\")\n",
+    "print(\"Saved FAISS index to 'faiss_index'\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Try loading the chunks and parahprasing them"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Load the chunks from the pickle file\n",
+    "with open(\"saved_data/chunks.pkl\", \"rb\") as f:\n",
+    "    chunks = pickle.load(f)\n",
+    "\n",
+    "print(f\"Loaded {len(chunks)} chunks from saved_data/chunks.pkl\")\n",
+    "\n",
+    "# Print content of first 3 chunks\n",
+    "for i, chunk in enumerate(\n",
+    "    chunks[:3]\n",
+    "):  # Ok cool, so this is much simpler than i expected!\n",
+    "    print(f\"\\n--- Chunk {i + 1}/{len(chunks)} ---\")\n",
+    "    print(chunk.page_content)\n",
+    "    print(\"-\" * 50)\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Try loading simple local model for paraphrasing (work with 16GB VRAM!)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from unsloth import FastLanguageModel\n",
+    "from vllm import SamplingParams\n",
+    "\n",
+    "# Load the Llama model\n",
+    "model, tokenizer = FastLanguageModel.from_pretrained(\n",
+    "    # model_name=\"meta-llama/meta-Llama-3.1-8B-Instruct\",\n",
+    "    model_name=\"Qwen/Qwen2-0.5B-Instruct\",\n",
+    "    max_seq_length=2048,  # Reduced from 4096\n",
+    "    load_in_4bit=True,\n",
+    "    fast_inference=True,\n",
+    "    gpu_memory_utilization=0.8,  # Increased from 0.6\n",
+    ")\n",
+    "\n",
+    "# Define sampling parameters\n",
+    "sampling_params = SamplingParams(\n",
+    "    temperature=0.3,\n",
+    "    top_p=0.95,\n",
+    "    max_tokens=512,  # Reduced for simple responses\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## ✅ Paraphrase V0\n",
+    " Ok this should be good enough for now.\n",
+    "```bash\n",
+    "-- Original Chunk 2/3 ---\n",
+    "Mission Spacecraft Description Laumch date Launch site PA-1 BP-6 First pad abort Nov.7, 1963 White Sands Missile Range; A-001 BP-12 Transonic abort May 13, 1964 N.Mex. White Sands Missile Range, AS-101 BP-13 Nominal launch and exit environment May 28, 1964 N. Mex. Cape Kennedy. Fla. AS-102 BP-15 Nominal launch and exit environment Sept.18,1964 Cape Kennedy, Fla. A-002 BP-23 Maximum dynamic pressure abort Dec.8, 1964 White Sands Missile Range, AS-103 BP-16 Micrometeoroid experiment Feb. 16, 1965 N.Mex. Cape Kennedy, Fla. A-003 BP-22 Low-altitude abort (planned high- May 19, 1965 White Sands Missile Range, AS-104 BP-26 altitude abort) Micrometeoroid experiment and service module May 25, 1965 N.Mex, Cape Kennedy, Fla. PA-2 BP-23A RCS launch environment Second pad abort June 29,1965 White Sands Missile Range, AS-105 BP-9A Micrometeoroid experiment and service module July 30, 1965 N. Mex. Cape Kennedy, Fla. A-004 SC-002 RCS launch environment Power-on tumbling boundary abort Jan.20,1966\n",
+    "--------------------------------------------------\n",
+    "\n",
+    "--- Style 1 Paraphrase ---\n",
+    "The mission spacecraft described in the text is a spacecraft that is designed to launch and exit into a specific environment based on the specifications provided. The spacecraft is named \"Mission Spacecraft\" and is launched from the White Sands Missile Range in New Mexico on July 28, 1964. The spacecraft is\n",
+    "--------------------------------------------------\n",
+    "\n",
+    "--- Style 2 Paraphrase ---\n",
+    "The mission spacecraft described in the text is a rocket that is designed to launch into the space. It is launched from a specific site, which is located in the United States. The spacecraft is launched into the space on a nominal launch and exit environment, which is described as a high-pressure abort. The spacecraft is also launched into the space on a micrometeoroid experiment, which is described as a planned high-altitude abort. The spacecraft is launched into the space on a RCS launch environment, which is described as a second pad abort. The spacecraft is launched into the space on a micrometeoroid experiment and service module, which is described as a power-on tumbling boundary abort.\n",
+    "--------------------------------------------------\n",
+    "\n",
+    "--- Style 3 Paraphrase ---\n",
+    "The mission spacecraft, known as the Laumch, was launched from the site of the White Sands Missile Range, located in the state of New Mexico. The spacecraft was scheduled to enter the launch and exit environment on May 28, 1964, from the White Sands Missile Range in New Mexico. The spacecraft was also scheduled to enter the nominal launch and exit environment on June 18, 1964, from the White Sands Missile Range in New Mexico. The spacecraft was also scheduled to enter the nominal launch and exit environment on September 8, 1964, from the White Sands Missile Range in New Mexico. The spacecraft was also scheduled to enter the nominal launch and exit environment on January 8, 1965, from the White Sands Missile Range in New Mexico. The spacecraft was also scheduled to enter the nominal launch and exit environment on February 16, 1965, from the White Sands Missile Range in New Mexico. The spacecraft was also scheduled to enter the nominal launch and exit environment on May 19, 1965, from the White Sands Missile Range in New Mexico. The spacecraft was also scheduled to enter the nominal launch and exit environment on June 29, 1965, from the White Sands Missile Range in New Mexico. The spacecraft was also scheduled to enter the nominal launch and exit environment on July 30, 1965, from the White Sands Missile Range in New Mexico. The spacecraft was also scheduled to enter the nominal launch and exit environment on October 20, 1966, from the White Sands Missile Range in New Mexico\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "PARAPHRASE_PROMPTS = [\n",
+    "    \"\"\"Rewrite this text in a formal, scholarly tone. Keep it very concise - summarize in 1-2 short sentences. Only output the paraphrased text:\n",
+    "\n",
+    "    TEXT: {text}\"\"\",\n",
+    "    \n",
+    "    \"\"\"Rewrite this text in a clear, simple way that's easy to understand. Provide a medium-length explanation with key details. Only output the paraphrased text:\n",
+    "    \n",
+    "    TEXT: {text}\"\"\",\n",
+    "    \n",
+    "    \"\"\"Rewrite this text in a vivid, engaging style. Expand on the details and provide a comprehensive, detailed version. Only output the paraphrased text:\n",
+    "    \n",
+    "    TEXT: {text}\"\"\"\n",
+    "]\n",
+    "\n",
+    "# Update sampling parameters for each style\n",
+    "sampling_params_short = SamplingParams(\n",
+    "    temperature=0.3,\n",
+    "    top_p=0.95,\n",
+    "    max_tokens=64,  # Short responses\n",
+    ")\n",
+    "\n",
+    "sampling_params_medium = SamplingParams(\n",
+    "    temperature=0.3,\n",
+    "    top_p=0.95,\n",
+    "    max_tokens=256,  # Medium responses\n",
+    ")\n",
+    "\n",
+    "sampling_params_long = SamplingParams(\n",
+    "    temperature=0.3,\n",
+    "    top_p=0.95,\n",
+    "    max_tokens=512,  # Long responses\n",
+    ")\n",
+    "\n",
+    "def generate_response(text: str) -> list[str]:\n",
+    "    \"\"\"\n",
+    "    Generate three different paraphrased versions with varying lengths.\n",
+    "\n",
+    "    Args:\n",
+    "        text: Text to paraphrase\n",
+    "\n",
+    "    Returns:\n",
+    "        List of three paraphrased versions (short, medium, long)\n",
+    "    \"\"\"\n",
+    "    responses = []\n",
+    "    sampling_params_list = [sampling_params_short, sampling_params_medium, sampling_params_long]\n",
+    "\n",
+    "    for prompt_template, sampling_params in zip(PARAPHRASE_PROMPTS, sampling_params_list):\n",
+    "        formatted_prompt = tokenizer.apply_chat_template(\n",
+    "            [{\"role\": \"user\", \"content\": prompt_template.format(text=text)}],\n",
+    "            tokenize=False,\n",
+    "            add_generation_prompt=True,\n",
+    "        )\n",
+    "\n",
+    "        output = model.fast_generate(\n",
+    "            [formatted_prompt], sampling_params=sampling_params\n",
+    "        )\n",
+    "        responses.append(output[0].outputs[0].text)\n",
+    "\n",
+    "    return responses\n",
+    "\n",
+    "\n",
+    "# Test example\n",
+    "test_text = \"The blue whale is the largest animal on Earth\"\n",
+    "responses = generate_response(test_text)\n",
+    "for i, response in enumerate(responses, 1):\n",
+    "    print(f\"\\nStyle {i}:\")\n",
+    "    print(response)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# paraphrase only the first 3 chunks with multiple styles\n",
+    "paraphrased_chunks = []\n",
+    "for chunk in chunks[:3]:\n",
+    "    styles = generate_response(chunk.page_content)  # Now returns list of 3 styles\n",
+    "    paraphrased_chunks.append(styles)\n",
+    "\n",
+    "# print the first 3 chunks and their paraphrased versions\n",
+    "for i, chunk in enumerate(chunks[:3]):\n",
+    "    print(f\"\\n--- Original Chunk {i + 1}/3 ---\")\n",
+    "    print(chunk.page_content)\n",
+    "    print(\"-\" * 50)\n",
+    "    \n",
+    "    for j, style in enumerate(paraphrased_chunks[i], 1):\n",
+    "        print(f\"\\n--- Style {j} Paraphrase ---\")\n",
+    "        print(style)\n",
+    "        print(\"-\" * 50)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 📦 Testing simple 1 prompt and choosing model"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "\n",
+    "```bash\n",
+    "# llama3.1-8b-instruct\n",
+    "Processed prompts: 100%|██████████| 1/1 [00:03<00:00,  3.38s/it, est. speed input: 50.57 toks/s, output: 39.34 toks/s]\n",
+    "```\n",
+    "- about 3s/item paraphrasing is llama3.1-8b-instruct locally, might take forever to run, LOL\n",
+    "- should go with smaller/quantized model first, or probaly just use groq/openrouter API for this?\n",
+    "\n",
+    "```bash\n",
+    "# Qwen2-0.5B-Instruct\n",
+    "Processed prompts: 100%|██████████| 1/1 [00:00<00:00,  2.32it/s, est. speed input: 369.09 toks/s, output: 167.13 toks/s]\n",
+    "```\n",
+    "- ok cool much faster now.\n",
+    "\n",
+    "- Noo it's hallucinated, lol, but does this matter 👀?\n",
+    "```\n",
+    "--- Chunk 1/3 ---\n",
+    "MSC-02680\n",
+    "\n",
+    "DISTRIBUTION AND REFERENCING\n",
+    "\n",
+    "This paper is not suitable for general distribution or referencing. It may be referenced only in other working correspondence and documents by participating organizations.\n",
+    "\n",
+    "MANNED SPACECRAFT CENTER HOUSTON.TEXAS SEPTEMBER1970\n",
+    "--------------------------------------------------\n",
+    "\n",
+    "--- Paraphrased Chunk 1/3 ---\n",
+    "The estimated total energy transferred to the command module was approximately 129 ampere hours. The total of 410 ampere hours remained in the lunar module batteries at the time of undocking. The figure 7.2-l.- Lunar module water usage shows the amount of water used in the lunar module during flight. The figure 7.2-2.- Lunar module total battery capacity during flight shows the total capacity of the lunar module batteries at the time of flight. The figure ll.l-2.- Field meter locations in the proximity of the launch complex shows the locations of field meters in the proximity of the launch complex. The figure ll.l-2.- Field meter locations in the proximity of the launch complex shows the locations of field meters in the proximity of the launch complex. The field meter locations show the locations of field meters in the proximity of the launch complex.\n",
+    "```\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# paraphrase only the first 3 chunks\n",
+    "paraphrased_chunks = []\n",
+    "for chunk in chunks[:3]:\n",
+    "    paraphrased_chunks.append(\n",
+    "        generate_response(f\"Paraphrase the following chunk: {chunk.page_content}\")\n",
+    "    )\n",
+    "\n",
+    "# print the first 3 chunks and their paraphrased versions LGTM!\n",
+    "for i, chunk in enumerate(chunks[:3]):\n",
+    "    print(f\"\\n--- Chunk {i + 1}/3 ---\")\n",
+    "    print(chunk.page_content)\n",
+    "    print(\"-\" * 50)\n",
+    "    print(f\"\\n--- Paraphrased Chunk {i + 1}/3 ---\")\n",
+    "    print(paraphrased_chunks[i])\n",
+    "    print(\"-\" * 50)\n"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": ".venv",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.11.11"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}