# Dataset This document describes the creation of a data pipeline to generate a dataset. ## Implementation Phases - [ ] 1.Simple chunk paraphrasing logic that's just work - After splitting, feed the splitted chunks into LLM to paraphrase - Rebuil the FAISS index with the paraphrased chunks - Don't touch `question.json` - [ ] 2.Enhance the dataset quality with API (check backlog) ## Inital idea from @tikikun - Take a dataset and break it into chunks. - Use the **ground truth chunks** (the original, correct ones). - Use an AI model to **paraphrase** those chunks—rewrite them in a different way while keeping the same meaning. - During training, give the model these **paraphrased chunks** and ask it to **search for the original chunk**. - If the model finds the correct original chunk, it gets a **reward**. - This way, the model learns to **retrieve the most accurate chunk** even when given noisy or reworded input. ### Why Does This Work? - **Paraphrasing adds noise**, making the training more realistic. - The model learns to **recover the true information** from different ways of saying it. - It ensures the model **only stops searching when it finds the exact right answer**. - This makes retrieval stronger because it trains the model to handle **real-world variations in wording**. ### Derived from flow matcing - Flow Matching is a generative modeling technique that trains models to transform simple distributions **(like noise)** into complex data distributions by learning continuous transformations, or "flows" - Paraphrase as Noise Introduction: By paraphrasing original data chunks, we introduce controlled noise, creating variations that maintain the original meaning but differ in wording.​ - Model Training with Paraphrased Data: The model is trained to map these paraphrased (noisy) chunks back to their original form, learning to navigate from noise to truth. ## Dataset Format - Should start from AutoDidact `generate_dataset.py` - Output 3 things: - Document chunks.pkl - questions.json - faiss_index - `question.json`: ```json { "chunk_id": "chunk_1", "question": "What is the capital of France?", "answer": "Paris", "difficulty": "easy" } ``` - Original Flow: load markdown -> split into chunks -> generate embeddings -> build FAISS index -> generate questions ```mermaid graph TD %% === Document Processing and Embedding Generation === A1[Load Markdown Document] -->|mission_report.md| A2[Split Document into Chunks] A2 -->|Save as Pickle| A3[💾 Chunks saved_data/chunks.pkl] A3 -->|Load Chunks| B1[Generate Embeddings] B1 -->|Save Embeddings| B2[Build FAISS Vector Store] B2 -->|Save FAISS Index| B3[💾 FAISS Index faiss_index] %% === QA Pair Generation === C1[Load Llama Model] -->|meta-Llama-3.1-8B-Instruct| C2[Configure Sampling Params] A3 -->|Load Chunks| D1[Prepare Sliding Window Prompts] D1 -->|Batch Generation| D2[Generate QA Pairs] D2 -->|Parse & Validate QA Pairs| D3[Filter Valid Pairs] D3 -->|Save Questions| D4[💾 QA Pairs saved_data/questions.json] %% === Error Handling === D2 -->|Retry Failed Prompts| D5[Retry Batch Generation] D5 -->|Parse & Validate QA Pairs| D3 %% Dependencies A1 -.->|Required| B1 A3 -.->|Required| D1 C1 -.->|Required| D2 C2 -.->|Required| D2 B3 -.->|Required| D2 ``` ## Get a sense of how to prepare the dataset for GRPO - - > Your dataset should still have at least **2 columns for question and answer pairs**. However the **answer must not reveal the reasoning behind** how it derived the answer from the question. See below for an example: - Cool basic stuff