You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

4.2 KiB

Dataset

This document describes the creation of a data pipeline to generate a dataset.

Implementation Phases

  • V0 Initial dataset from AutoDidact (V -1)

    • saved_data/chunks.pkl (need to keep this to create later dataset)
    • saved_data/questions.json
    • faiss_index/
  • V1 Paraphrased dataset

    • paraphrased_chunks.pkl (no need, this sucks)
    • saved_data/chunks.pkl (this is for the ground truth chunks)
    • saved_data/questions.json
    • faiss_index/ (already contained all the documents ) (this include 3 new paraphrased chunks)
  • V2 Paraphrased dataset with API

    • API (for better quality)
    • questions.json
    • faiss_index/ (already contained all the documents ) (this include 3 new paraphrased chunks)
  • V3

    • IDK, let's survive V1 first.

Inital idea from @tikikun

  • Take a dataset and break it into chunks.
  • Use the ground truth chunks (the original, correct ones).
  • Use an AI model to paraphrase those chunks—rewrite them in a different way while keeping the same meaning.
  • During training, give the model these paraphrased chunks and ask it to search for the original chunk.
  • If the model finds the correct original chunk, it gets a reward.
  • This way, the model learns to retrieve the most accurate chunk even when given noisy or reworded input.

Why Does This Work?

  • Paraphrasing adds noise, making the training more realistic.
  • The model learns to recover the true information from different ways of saying it.
  • It ensures the model only stops searching when it finds the exact right answer.
  • This makes retrieval stronger because it trains the model to handle real-world variations in wording.

Derived from flow matcing

  • Flow Matching is a generative modeling technique that trains models to transform simple distributions (like noise) into complex data distributions by learning continuous transformations, or "flows"
  • Paraphrase as Noise Introduction: By paraphrasing original data chunks, we introduce controlled noise, creating variations that maintain the original meaning but differ in wording.
  • Model Training with Paraphrased Data: The model is trained to map these paraphrased (noisy) chunks back to their original form, learning to navigate from noise to truth.

Dataset Format

  • Should start from AutoDidact generate_dataset.py
  • Output 3 things:
    • Document chunks.pkl
    • questions.json
    • faiss_index
  • question.json:
{
    "chunk_id": "chunk_1",
    "question": "What is the capital of France?",
    "answer": "Paris",
    "difficulty": "easy"
}
  • Original Flow: load markdown -> split into chunks -> generate embeddings -> build FAISS index -> generate questions
graph TD

    %% === Document Processing and Embedding Generation ===
    A1[Load Markdown Document] -->|mission_report.md| A2[Split Document into Chunks]
    A2 -->|Save as Pickle| A3[💾 Chunks saved_data/chunks.pkl]

    A3 -->|Load Chunks| B1[Generate Embeddings]
    B1 -->|Save Embeddings| B2[Build FAISS Vector Store]
    B2 -->|Save FAISS Index| B3[💾 FAISS Index faiss_index]

    %% === QA Pair Generation ===
    C1[Load Llama Model] -->|meta-Llama-3.1-8B-Instruct| C2[Configure Sampling Params]

    A3 -->|Load Chunks| D1[Prepare Sliding Window Prompts]
    D1 -->|Batch Generation| D2[Generate QA Pairs]
    D2 -->|Parse & Validate QA Pairs| D3[Filter Valid Pairs]
    D3 -->|Save Questions| D4[💾 QA Pairs saved_data/questions.json]

    %% === Error Handling ===
    D2 -->|Retry Failed Prompts| D5[Retry Batch Generation]
    D5 -->|Parse & Validate QA Pairs| D3

    %% Dependencies
    A1 -.->|Required| B1
    A3 -.->|Required| D1
    C1 -.->|Required| D2
    C2 -.->|Required| D2
    B3 -.->|Required| D2

Get a sense of how to prepare the dataset for GRPO