4.2 KiB

Raw Blame History Unescape Escape

Dataset

This document describes the creation of a data pipeline to generate a dataset.

Implementation Phases

V0 Initial dataset from AutoDidact (V -1)
- saved_data/chunks.pkl (need to keep this to create later dataset)
- saved_data/questions.json
- faiss_index/
V1 Paraphrased dataset
- ~~paraphrased_chunks.pkl (no need, this sucks)~~
- saved_data/chunks.pkl (this is for the ground truth chunks)
- saved_data/questions.json
- faiss_index/ (already contained all the documents ✅) (this include 3 new paraphrased chunks)
V2 Paraphrased dataset with API
- API (for better quality)
- questions.json
- faiss_index/ (already contained all the documents ✅) (this include 3 new paraphrased chunks)
V3
- IDK, let's survive V1 first.

Inital idea from @tikikun

Take a dataset and break it into chunks.
Use the ground truth chunks (the original, correct ones).
Use an AI model to paraphrase those chunks—rewrite them in a different way while keeping the same meaning.
During training, give the model these paraphrased chunks and ask it to search for the original chunk.
If the model finds the correct original chunk, it gets a reward.
This way, the model learns to retrieve the most accurate chunk even when given noisy or reworded input.

Why Does This Work?

Paraphrasing adds noise, making the training more realistic.
The model learns to recover the true information from different ways of saying it.
It ensures the model only stops searching when it finds the exact right answer.
This makes retrieval stronger because it trains the model to handle real-world variations in wording.

Derived from flow matcing

Flow Matching is a generative modeling technique that trains models to transform simple distributions (like noise) into complex data distributions by learning continuous transformations, or "flows"
Paraphrase as Noise Introduction: By paraphrasing original data chunks, we introduce controlled noise, creating variations that maintain the original meaning but differ in wording.
Model Training with Paraphrased Data: The model is trained to map these paraphrased (noisy) chunks back to their original form, learning to navigate from noise to truth.

Dataset Format

Should start from AutoDidact generate_dataset.py
Output 3 things:
- Document chunks.pkl
- questions.json
- faiss_index
question.json:

{
    "chunk_id": "chunk_1",
    "question": "What is the capital of France?",
    "answer": "Paris",
    "difficulty": "easy"
}

Original Flow: load markdown -> split into chunks -> generate embeddings -> build FAISS index -> generate questions

graph TD

    %% === Document Processing and Embedding Generation ===
    A1[Load Markdown Document] -->|mission_report.md| A2[Split Document into Chunks]
    A2 -->|Save as Pickle| A3[💾 Chunks saved_data/chunks.pkl]

    A3 -->|Load Chunks| B1[Generate Embeddings]
    B1 -->|Save Embeddings| B2[Build FAISS Vector Store]
    B2 -->|Save FAISS Index| B3[💾 FAISS Index faiss_index]

    %% === QA Pair Generation ===
    C1[Load Llama Model] -->|meta-Llama-3.1-8B-Instruct| C2[Configure Sampling Params]

    A3 -->|Load Chunks| D1[Prepare Sliding Window Prompts]
    D1 -->|Batch Generation| D2[Generate QA Pairs]
    D2 -->|Parse & Validate QA Pairs| D3[Filter Valid Pairs]
    D3 -->|Save Questions| D4[💾 QA Pairs saved_data/questions.json]

    %% === Error Handling ===
    D2 -->|Retry Failed Prompts| D5[Retry Batch Generation]
    D5 -->|Parse & Validate QA Pairs| D3

    %% Dependencies
    A1 -.->|Required| B1
    A3 -.->|Required| D1
    C1 -.->|Required| D2
    C2 -.->|Required| D2
    B3 -.->|Required| D2

Get a sense of how to prepare the dataset for GRPO

https://docs.unsloth.ai/basics/reasoning-grpo-and-rl/tutorial-train-your-own-reasoning-model-with-grpo#data-preparation
- Your dataset should still have at least 2 columns for question and answer pairs. However the answer must not reveal the reasoning behind how it derived the answer from the question. See below for an example:
Cool basic stuff https://docs.unsloth.ai/basics/datasets-101

4.2 KiB Raw Blame History Unescape Escape

Dataset

Implementation Phases

Inital idea from @tikikun

Why Does This Work?

Derived from flow matcing

Dataset Format

Get a sense of how to prepare the dataset for GRPO

4.2 KiB

Raw Blame History Unescape Escape