You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
3.8 KiB
3.8 KiB
Dataset
This document describes the creation of a data pipeline to generate a dataset.
Implementation Phases
- 1.Simple chunk paraphrasing logic that's just work
- After splitting, feed the splitted chunks into LLM to paraphrase
- Rebuil the FAISS index with the paraphrased chunks
- Don't touch
question.json
- 2.Enhance the dataset quality with API (check backlog)
Inital idea from @tikikun
- Take a dataset and break it into chunks.
- Use the ground truth chunks (the original, correct ones).
- Use an AI model to paraphrase those chunks—rewrite them in a different way while keeping the same meaning.
- During training, give the model these paraphrased chunks and ask it to search for the original chunk.
- If the model finds the correct original chunk, it gets a reward.
- This way, the model learns to retrieve the most accurate chunk even when given noisy or reworded input.
Why Does This Work?
- Paraphrasing adds noise, making the training more realistic.
- The model learns to recover the true information from different ways of saying it.
- It ensures the model only stops searching when it finds the exact right answer.
- This makes retrieval stronger because it trains the model to handle real-world variations in wording.
Derived from flow matcing
- Flow Matching is a generative modeling technique that trains models to transform simple distributions (like noise) into complex data distributions by learning continuous transformations, or "flows"
- Paraphrase as Noise Introduction: By paraphrasing original data chunks, we introduce controlled noise, creating variations that maintain the original meaning but differ in wording.
- Model Training with Paraphrased Data: The model is trained to map these paraphrased (noisy) chunks back to their original form, learning to navigate from noise to truth.
Dataset Format
- Should start from AutoDidact
generate_dataset.py
- Output 3 things:
- Document chunks.pkl
- questions.json
- faiss_index
question.json
:
{
"chunk_id": "chunk_1",
"question": "What is the capital of France?",
"answer": "Paris",
"difficulty": "easy"
}
- Original Flow: load markdown -> split into chunks -> generate embeddings -> build FAISS index -> generate questions
graph TD
%% === Document Processing and Embedding Generation ===
A1[Load Markdown Document] -->|mission_report.md| A2[Split Document into Chunks]
A2 -->|Save as Pickle| A3[💾 Chunks saved_data/chunks.pkl]
A3 -->|Load Chunks| B1[Generate Embeddings]
B1 -->|Save Embeddings| B2[Build FAISS Vector Store]
B2 -->|Save FAISS Index| B3[💾 FAISS Index faiss_index]
%% === QA Pair Generation ===
C1[Load Llama Model] -->|meta-Llama-3.1-8B-Instruct| C2[Configure Sampling Params]
A3 -->|Load Chunks| D1[Prepare Sliding Window Prompts]
D1 -->|Batch Generation| D2[Generate QA Pairs]
D2 -->|Parse & Validate QA Pairs| D3[Filter Valid Pairs]
D3 -->|Save Questions| D4[💾 QA Pairs saved_data/questions.json]
%% === Error Handling ===
D2 -->|Retry Failed Prompts| D5[Retry Batch Generation]
D5 -->|Parse & Validate QA Pairs| D3
%% Dependencies
A1 -.->|Required| B1
A3 -.->|Required| D1
C1 -.->|Required| D2
C2 -.->|Required| D2
B3 -.->|Required| D2
Get a sense of how to prepare the dataset for GRPO
- https://docs.unsloth.ai/basics/reasoning-grpo-and-rl/tutorial-train-your-own-reasoning-model-with-grpo#data-preparation
-
Your dataset should still have at least 2 columns for question and answer pairs. However the answer must not reveal the reasoning behind how it derived the answer from the question. See below for an example:
-
- Cool basic stuff https://docs.unsloth.ai/basics/datasets-101