feat: add initial project structure and core functionality

- Added initial files from AutoDiact as starting point - Enhanced `README.md` with project overview and setup instructions. . - Removed `ugly_code_file.py` as part of cleanup. - Added various documentation files and assets for project clarity. - Included Jupyter notebooks for training and experimentation.
4 months ago · a58722e16f
parent 91c2476c28
commit a58722e16f
29 changed files with 41923 additions and 27 deletions
--- a/.env.example
+++ b/.env.example
@ -0,0 +1,2 @@
+HF_TOKEN=
+OPENROUTER_API_KEY=
--- a/.gitignore
+++ b/.gitignore
@ -1,3 +1,10 @@
+# DeepSearch
+.ruff_cache/
+saved_data/
+saved_models/
+faiss_index/
+.vscode/
+
 # Byte-compiled / optimized / DLL files
 __pycache__/
 *.py[cod]
--- a/README.md
+++ b/README.md
@ -1 +1,39 @@
-# DeepSearch
+# DeepSearch - A Hard Working Search Engine 🔍
+
+DeepSearch trains a small language model to develop effective search behaviors instead of memorizing static data. It interacts with multiple synthetic search engines, each with unique retrieval mechanisms, to refine queries and persist in searching until it finds exact answers. The project focuses on reinforcement learning, preventing overfitting, and optimizing for efficiency in real-world search applications.
+
+![Project Whiteboard](docs/assets/whiteboard.drawio.png)
+
+## Setup
+
+```bash
+python -m venv .venv
+source .venv/bin/activate
+pip install -r requirements.txt
+```
+
+## Models
+
+You can find our models on Hugging Face 🤗! We're committed to open-source and easy access for the research community.
+
+| Model | Backbone | Size | Link |
+|-------|----------|------|------|
+| - | - | - | - |
+
+## Datasets
+
+We've released our datasets on Hugging Face 🤗 to support reproducibility and further research.
+
+| Dataset                             | Description                                         | Size  | Link                                                                                    |
+|--------------------------------------|-----------------------------------------------------|-------|-----------------------------------------------------------------------------------------|
+| -                                    | -                                                   | -     | -                                                                                       |
+| -                                    | -                                                   | -     | -                                                                                       |
+| -                                    | -                                                   | -     | -                                                                                       |
+
+## References
+
+- This project is kickstarted from [AutoDidact](https://github.com/dCaples/AutoDidact)
+
+## Personal Notes
+
+- **This is research code**, so I'm prioritizing speed over code quality for now. Expect things to be messy—both the code and commit history. Roasting is welcome, but don't judge me too hard; I'll clean it up later. **I don’t know what I don’t know**, but I’m eager (and desperate) to learn and improve, so any constructive feedback is highly appreciated! 💖
--- a/UnslothGRPOTrainerTemp.py
+++ b/UnslothGRPOTrainerTemp.py
--- a/data/apollo-13-mission-report.pdf
+++ b/data/apollo-13-mission-report.pdf
--- a/data/mission_report.md
+++ b/data/mission_report.md
--- a/docs/00_worklog.md
+++ b/docs/00_worklog.md
@ -0,0 +1,63 @@
+# Worklog
+
+## Backlog
+
+- [ ] Modify `generate_dataset.py` (**ONLY AFTER** the simple training and benchmark works):
+    - [ ] As a data dataset maker, I want to change from LLama 3.1 8B to API call, like claude, gemini or openai. Originally they use 3.1 8B for `Self-Bootstrapping` demonstration, but the dataset quality is low, for sure.
+    - [ ] Experimenting with different chunking strategies
+- [ ] [search-backends.md](search-backends.md) design (for more dataset noise (**ONLY AFTER** the simple training dataset works))
+
+- [ ] Research a little bit on Agentic Reward Modeling (for designing better reward function maybe?)
+    - <https://medium.com/@techsachin/agentic-reward-modeling-combine-human-preferences-with-verifiable-correctness-signals-for-reliable-76c408b3491c>
+    - <https://arxiv.org/pdf/2502.19328>
+    - <https://github.com/THU-KEG/Agentic-Reward-Modeling>
+    - <https://www.themoonlight.io/en/review/agentic-reward-modeling-integrating-human-preferences-with-verifiable-correctness-signals-for-reliable-reward-systems>
+
+## yymmdd
+
+- [ ] task description
+
+## 250324
+
+- [ ] @thinhlpg transfers the project to @bachvudinh
+
+## 250323
+
+- [ ] Train the model
+- [ ] Make the dataset
+- [ ] Upload datasets to HF Hub
+      - Initial dataset from AutoDidact
+      - Paraphrased sdataset
+- [ ] Make a simple gradio demo app
+
+## 250322
+
+- [x] Moving all the scattered and disorganized stuffs that've been working on for the past week into this repo.
+- [x] Write  proposal for DeepSearch
+    - [x] [evaluation.md](evaluation.md) design (list out the metrics and why)
+    - [x] [dataset.md](dataset.md) design (pipeline, data structure,...)
+    - [x] [reward-functions.md](reward-functions.md) design (list out the functions and why)
+- [x] As a new member of research team, i'm curious on how did we do GRPO with Alphamaze?, so that I can inherit the good stuff and improve the workflow!!!
+    - [Alphamaze](https://github.com/menloresearch/visual-thinker)?
+    - <https://www.menlo.ai/blog/alpha-maze>
+    - <https://arxiv.org/pdf/2502.14669>
+    - > Our training process involved two key stages: creating a specialized dataset and then using a combination of supervised fine-tuning (SFT) and reinforcement learning (RL) to train the model.
+    - LLaMA-Factory for SFT **(1.5B 6xA6000 1.5 hour)** and Unsloth for GRPO
+    - 💡 Hmm so for SFT we have 50% successful data and 50% retry data, and full successful data for GRPO. Can I also apply this to DeepSearch as well? #HACK
+
+## 250321
+
+- [x] Inspect the code of AutoDidact in a more detailed way <https://github.com/menloresearch/DeepSearch/issues/4>
+
+## 250320
+
+- Research on GRPO <https://github.com/menloresearch/DeepSearch/issues/2>
+
+## 250319
+
+- Research on GRPO <https://github.com/menloresearch/DeepSearch/issues/2>
+- Run the training script of AutoDidact
+
+## 250318
+
+- Idea received <https://github.com/menloresearch/DeepSearch/issues/1>
--- a/docs/adaptive-search-behavior.md
+++ b/docs/adaptive-search-behavior.md
@ -0,0 +1,31 @@
+# Adaptive Search Behavior
+
+- [Agent Action](agent-action.md) -> mostly recognize missing something -> perform "refined query"
+- [ ] As a model trainer, I want to inspect the full chat state of the agent to know what's going on so I can improve it -> implement a simple cli inspect tool after training, just print out full chat state.
+- Example from AutoDidact:
+
+```markdown
+Example Question
+What was the reason for substituting the backup Command Module Pilot 3 days prior to the Apollo 13 flight?
+
+Step-by-Step Search Process
+Query : "Apollo 13 Command Module Pilot substitution"
+
+Outcome: Retrieved operational support details, but no explanation for the substitution.
+Agent's Action: Recognized missing information → **Refined query**.
+Query : "Apollo 13 Command Module Pilot substitution reason"
+
+Outcome: Retrieved general mission anomaly details, but still no direct answer.
+Agent's Action: **Increased query specificity**.
+Query : "Apollo 13 John 'Jack' Swigert substitution"
+
+Outcome: Found general mission reports, but still lacked a clear reason for substitution.
+Agent's Action: Hypothesized illness might be a factor → **Refined query** accordingly.
+Query : "Apollo 13 Jack Swigert illness substitution"
+
+Outcome: Retrieved the exact explanation: "Several days prior to launch, the backup Lunar Module Pilot became sick with measles. Examinations of the prime crew indicated that the Command Module Pilot was not immune to the disease; therefore, the backup Command Module Pilot was substituted."
+Final Answer
+The original Command Module Pilot lacked immunity to measles, necessitating his replacement by Jack Swigert.
+
+This example shows how llama learns to do multiple searches to find answers to its questions.
+```
--- a/docs/agent-action.md
+++ b/docs/agent-action.md
@ -0,0 +1,5 @@
+# Agent Action
+
+- [ ] Research a bit more on this because I'm a bit outdated on the training side
+    - [ ] How does the dataset look like?
+    - [ ] How to evaluate the performance?
--- a/docs/assets/whiteboard.drawio.png
+++ b/docs/assets/whiteboard.drawio.png
--- a/docs/dataset.md
+++ b/docs/dataset.md
@ -0,0 +1,91 @@
+# Dataset
+
+This document describes the creation of a data pipeline to generate a dataset.
+
+## Implementation Phases
+
+- [ ] 1.Simple chunk paraphrasing logic that's just work
+    - After splitting, feed the splitted chunks into LLM to paraphrase
+    - Rebuil the FAISS index with the paraphrased chunks
+    - Don't touch `question.json`
+- [ ] 2.Enhance the dataset quality with API (check backlog)
+
+## Inital idea from @tikikun
+
+- Take a dataset and break it into chunks.
+- Use the **ground truth chunks** (the original, correct ones).
+- Use an AI model to **paraphrase** those chunks—rewrite them in a different way while keeping the same meaning.
+- During training, give the model these **paraphrased chunks** and ask it to **search for the original chunk**.
+- If the model finds the correct original chunk, it gets a **reward**.
+- This way, the model learns to **retrieve the most accurate chunk** even when given noisy or reworded input.
+
+### Why Does This Work?
+
+- **Paraphrasing adds noise**, making the training more realistic.
+- The model learns to **recover the true information** from different ways of saying it.
+- It ensures the model **only stops searching when it finds the exact right answer**.
+- This makes retrieval stronger because it trains the model to handle **real-world variations in wording**.
+
+### Derived from flow matcing
+
+- Flow Matching is a generative modeling technique that trains models to transform simple distributions **(like noise)** into complex data distributions by learning continuous transformations, or "flows"
+- Paraphrase as Noise Introduction: By paraphrasing original data chunks, we introduce controlled noise, creating variations that maintain the original meaning but differ in wording.
+- Model Training with Paraphrased Data: The model is trained to map these paraphrased (noisy) chunks back to their original form, learning to navigate from noise to truth.
+
+## Dataset Format
+
+- Should start from AutoDidact `generate_dataset.py`
+- Output 3 things:
+    - Document chunks.pkl
+    - questions.json
+    - faiss_index
+- `question.json`:
+
+```json
+{
+    "chunk_id": "chunk_1",
+    "question": "What is the capital of France?",
+    "answer": "Paris",
+    "difficulty": "easy"
+}
+```
+
+- Original Flow: load markdown -> split into chunks -> generate embeddings -> build FAISS index -> generate questions
+
+```mermaid
+graph TD
+
+    %% === Document Processing and Embedding Generation ===
+    A1[Load Markdown Document] -->|mission_report.md| A2[Split Document into Chunks]
+    A2 -->|Save as Pickle| A3[💾 Chunks saved_data/chunks.pkl]
+
+    A3 -->|Load Chunks| B1[Generate Embeddings]
+    B1 -->|Save Embeddings| B2[Build FAISS Vector Store]
+    B2 -->|Save FAISS Index| B3[💾 FAISS Index faiss_index]
+
+    %% === QA Pair Generation ===
+    C1[Load Llama Model] -->|meta-Llama-3.1-8B-Instruct| C2[Configure Sampling Params]
+
+    A3 -->|Load Chunks| D1[Prepare Sliding Window Prompts]
+    D1 -->|Batch Generation| D2[Generate QA Pairs]
+    D2 -->|Parse & Validate QA Pairs| D3[Filter Valid Pairs]
+    D3 -->|Save Questions| D4[💾 QA Pairs saved_data/questions.json]
+
+    %% === Error Handling ===
+    D2 -->|Retry Failed Prompts| D5[Retry Batch Generation]
+    D5 -->|Parse & Validate QA Pairs| D3
+
+    %% Dependencies
+    A1 -.->|Required| B1
+    A3 -.->|Required| D1
+    C1 -.->|Required| D2
+    C2 -.->|Required| D2
+    B3 -.->|Required| D2
+
+```
+
+## Get a sense of how to prepare the dataset for GRPO
+
+- <https://docs.unsloth.ai/basics/reasoning-grpo-and-rl/tutorial-train-your-own-reasoning-model-with-grpo#data-preparation>
+    - > Your dataset should still have at least **2 columns for question and answer pairs**. However the **answer must not reveal the reasoning behind** how it derived the answer from the question. See below for an example:
+- Cool basic stuff <https://docs.unsloth.ai/basics/datasets-101>
--- a/docs/evaluation.md
+++ b/docs/evaluation.md
@ -0,0 +1,52 @@
+# Evaluation
+
+- **Goal**:
+    - 1. Better performance than the original one (by auto eval script)
+    - 2. Better performance by real human eval/preference
+
+## Implementation Phases
+
+- [x] 1. Just take the eval function from the original repo (it simply uses accuracy (exact match)) and simple quick glance on the output quality.
+- [ ] 2. Find some more common and conventional dataset and benchmarks (still auto script)
+- [ ] 3. Setup human eval
+
+## Baseline
+
+- Info from autodidact
+    - After just 100 steps of GRPO training (1 hour on a single RTX 4090 GPU), Llama-8B significantly improved its ability to research and answer questions from the Apollo 13 mission report
+    - On a validation set of 68 questions, accuracy more than doubled from 23% to 59%.
+
+- Training log: idk why but the result that I got from acutally running the training is a bit lower.
+
+```bash
+ceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
+  completion_ids = [torch.tensor(ids, device=device) for ids in completion_ids]
+Processed prompts: 100%|████████████████| 16/16 [00:00<00:00, 39.27it/s, est. speed input: 6827.13 toks/s, output: 81.01 toks/s]
+rewards_per_func: tensor([0.6875, 0.7000], device='cuda:0'):05,  2.55it/s, est. speed input: 385.80 toks/s, output: 5.11 toks/s]
+{'loss': 0.0003, 'grad_norm': 0.5810762047767639, 'learning_rate': 0.0, 'rewards/reward_correctness': 0.6875, 'rewards/reward_formatting': 0.699999988079071, 'reward': 1.3875000476837158, 'reward_std': 0.44403791427612305, 'completion_length': 224.125, 'kl': 0.00834659393876791, 'epoch': 0.34}
+{'train_runtime': 7992.2854, 'train_samples_per_second': 0.202, 'train_steps_per_second': 0.013, 'train_loss': 0.0005197484556535774, 'epoch': 0.34}
+100%|███████████████████████████████████████████████████████████████████████████████████████| 101/101 [2:13:12<00:00, 79.13s/it]
+Processed prompts: 100%|████████████████| 67/67 [00:20<00:00,  3.28it/s, est. speed input: 950.44 toks/s, output: 394.51 toks/s]
+Processed prompts: 100%|███████████████| 66/66 [00:20<00:00,  3.15it/s, est. speed input: 2383.55 toks/s, output: 323.82 toks/s]
+Processed prompts: 100%|███████████████| 20/20 [00:17<00:00,  1.13it/s, est. speed input: 1320.49 toks/s, output: 146.76 toks/s]
+Processed prompts: 100%|████████████████| 17/17 [00:16<00:00,  1.04it/s, est. speed input: 1620.28 toks/s, output: 98.35 toks/s]
+Processed prompts: 100%|██████████████████| 9/9 [00:15<00:00,  1.73s/it, est. speed input: 1165.77 toks/s, output: 71.38 toks/s]
+Processed prompts: 100%|████████████████| 67/67 [00:04<00:00, 16.31it/s, est. speed input: 3617.28 toks/s, output: 61.11 toks/s]
+RESULTS:
+percentage of correct answers: 0.5074626865671642
+==============================
+Processed prompts: 100%|███████████████| 67/67 [00:15<00:00,  4.46it/s, est. speed input: 1292.29 toks/s, output: 561.32 toks/s]
+Processed prompts: 100%|███████████████| 44/44 [00:18<00:00,  2.44it/s, est. speed input: 1800.84 toks/s, output: 244.13 toks/s]
+Processed prompts: 100%|███████████████| 13/13 [00:12<00:00,  1.05it/s, est. speed input: 1209.04 toks/s, output: 126.32 toks/s]
+Processed prompts: 100%|███████████████| 10/10 [00:13<00:00,  1.32s/it, est. speed input: 1225.46 toks/s, output: 109.78 toks/s]
+Processed prompts: 100%|██████████████████| 7/7 [00:12<00:00,  1.86s/it, est. speed input: 1149.18 toks/s, output: 76.05 toks/s]
+Processed prompts: 100%|████████████████| 67/67 [00:02<00:00, 31.53it/s, est. speed input: 6047.70 toks/s, output: 83.31 toks/s]
+RESULTS:
+percentage of correct answers: 0.19402985074626866
+==============================
+[rank0]:[W320 07:13:50.651270455 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present,  but this warning has only been added since PyTorch 2.4 (function operator())
+```
+
+## Getting some sense of the eval data or benchmark
+
+- > For example, benchmarks like ARC-AGI, which involve visual reasoning, remain challenging for these models, even though they might seem straightforward to a human. (ichigo)
--- a/docs/grpo-idea.md
+++ b/docs/grpo-idea.md
@ -0,0 +1,15 @@
+# GRPO idea
+
+- The training flow of R1 is really simple (thanks my friend professional yapper @vTuanpham) for initially clarifing my dumbness 🤣
+
+```python
+1. Train một con base biết dùng tool bằng sft thuần để boost
+Tuan
+2. Sau đó thả rông bằng gpro, syntax gần đúng 0.5, syntax đúng params lệch quá thì 0.65, cả hai đều được thì 0.85,...
+```
+
+## Unsloth's guide
+
+- <https://unsloth.ai/blog/r1-reasoning>
+- Heheboi let's steal this notebook <https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1_(8B)-GRPO.ipynb>
+- <https://docs.unsloth.ai/basics/reasoning-grpo-and-rl> - This is like the most simple
--- a/docs/project-overview-mermaid.md
+++ b/docs/project-overview-mermaid.md
@ -0,0 +1,10 @@
+```mermaid
+graph TD
+    A[User Query] -->|Random Search Engine Assigned| B{Synthetic Search Engine}
+    B -->|Retrieves Initial Results| C[Model Analyzes Results]
+    C -->|Refines Query if Needed| D[Iterative Search Process]
+    D -->|Final Answer Found| E[Return Best Match]
+    E -->|Rewards/Penalties Applied| F[Reinforcement Learning Update]
+    F -->|Optimized Search Strategy| B
+
+```
--- a/docs/reward-functions.md
+++ b/docs/reward-functions.md
@ -0,0 +1,282 @@
+# Reward functions
+
+This note is a collection of stolen reward functions and tips from other projects.
+
+- NEED SOMETHING THAT MAKE THE MODEL WORK HARDER!!!
+- [x] Goal: design reward functions (Search Task!) for DeepSearch's GRPO trainings (likely to be exact match) (**Try the suggestion by unsloth below, lol**)
+    - > You can refer to the examples below. You can input your generations into an LLM like ChatGPT 4o or Llama 3.1 (8B) and design a reward function and verifier to evaluate it. **For example, feed your generations into a LLM of your choice and set a rule: "If the answer sounds too robotic, deduct 3 points." This helps refine outputs based on quality criteria**
+    - Label studio suggest consult domain experts -> ask the LLM to be search engine expert??
+    - Starting from the default of AutoDiact should be good enough, then figure out big brain moves from there
+
+## Implementation Phases
+
+- [ ] 1.Just keep the default ones from AutoDidact and add the Exact Match Idea
+    - Oh they only use 2 reward functions "reward_correctness" and "reward_formatting"
+- [ ] 2. Add more if needed.
+
+## Psuedo code
+
+```python
+
+
+
+
+
+```
+
+## Get a sense of Reward functions
+
+- <https://github.com/kubernetes-bad/reward-composer>
+    - Reward Composer is a collection of simple building blocks for making your perfect reward function for Reinforcement Learning training of language models... It's like Lego for GRPO.
+- <https://gist.github.com/willccbb/4676755236bb08cab5f4e54a0475d6fb>
+    - Really minimalist and simple grpo training script (only 171 lines :O)
+- Example form unsloth's blog <https://docs.unsloth.ai/basics/reasoning-grpo-and-rl#reward-function-examples>
+    - > You can **reuse data** across multiple epochs. - What does this mean 👀?
+
+- From <https://labelstud.io/blog/reinforcement-learning-from-verifiable-rewards/#how-to-design-a-verifiable-reward-function>
+    - Factual Accuracy: Checking whether the output contains verifiable facts.
+    - Logical Consistency: Ensuring that arguments or narratives are internally consistent. Ensure solving propositional logic reasoning problems
+    - Exact Match and Heuristics: Use deterministic rules to check correctness (e.g., exact match in math answers, passing test cases in code, **matching the predefined categories or taxonomy** etc.)
+    - > Designing a verifiable reward function **requires expert knowledge, domain expertise**, and structured data interfaces - Can I just LLM Roleplaying search engine expert? 👀
+    - Multi-Level Scoring: Implement tiered scoring mechanisms to reward partial correctness where applicable. (cool, might try this)
+    - > 3. Validate the Reward Model Based on Generated Examples
+Run Controlled Tests: Generate model outputs and measure how well the reward function distinguishes correct from incorrect responses.
+Evaluate for Robustness: Ensure the function avoids penalizing correct responses due to formatting issues or minor variations.
+A/B Testing with RL Agents: Compare performance between models trained with and without the verifiable reward function.
+
+## Reward Function vs Verifier
+
+Stolen note from unsloth's docs:
+
+| Component | Purpose | Characteristics | Examples |
+|-----------|---------|-----------------|----------|
+| **Verifier** | Determines correctness | - No numerical scoring<br>- Binary correct/incorrect judgment | - Checks if "2+2=5" is wrong<br>- Executes code to validate syntax/logic |
+| **Reward Function** | Assigns numerical scores | - Converts verification to numbers<br>- Can include multiple criteria | - Wrong answer: -1 or -2<br>- Correct answer: +1 or +2<br>- Penalties for length/readability |
+| **Key Differences** | | - Verifier: checks correctness without scoring<br>- Reward Function: assigns scores without necessarily verifying<br>- Reward Function can use a Verifier, but they're distinct components | |
+
+## Idea examples
+
+Note taken from unsloth's docs.
+
+Example #1: Simple Arithmetic Task
+
+- Question: "2 + 2"
+- Answer: "4"
+- Reward Function 1:
+    - If a number is detected → +1
+    - If no number is detected → -1
+
+Example #2: Email Automation Task
+
+- Question: Inbound email
+- Answer: Outbound email
+- Reward Functions:
+    - If the answer contains a required keyword → +1
+    - If the answer exactly matches the ideal response → +1
+    - If the response is too long → -1
+    - If the recipient's name is included → +1
+    - If a signature block (phone, email, address) is present → +1
+
+## Code Examples
+
+- Below is a code snippet from @unslothai sample notebook, which is taken from @willccbb's gist
+
+```python
+# Reward functions
+def correctness_reward_func(prompts, completions, answer, **kwargs) -> list[float]:
+    responses = [completion[0]["content"] for completion in completions]
+    q = prompts[0][-1]["content"]
+    extracted_responses = [extract_xml_answer(r) for r in responses]
+    print(
+        "-" * 20,
+        f"Question:\n{q}",
+        f"\nAnswer:\n{answer[0]}",
+        f"\nResponse:\n{responses[0]}",
+        f"\nExtracted:\n{extracted_responses[0]}",
+    )
+    return [2.0 if r == a else 0.0 for r, a in zip(extracted_responses, answer)]
+
+
+def int_reward_func(completions, **kwargs) -> list[float]:
+    responses = [completion[0]["content"] for completion in completions]
+    extracted_responses = [extract_xml_answer(r) for r in responses]
+    return [0.5 if r.isdigit() else 0.0 for r in extracted_responses]
+
+
+def strict_format_reward_func(completions, **kwargs) -> list[float]:
+    """Reward function that checks if the completion has a specific format."""
+    pattern = r"^<reasoning>\n.*?\n</reasoning>\n<answer>\n.*?\n</answer>\n$"
+    responses = [completion[0]["content"] for completion in completions]
+    matches = [re.match(pattern, r) for r in responses]
+    return [0.5 if match else 0.0 for match in matches]
+
+
+def soft_format_reward_func(completions, **kwargs) -> list[float]:
+    """Reward function that checks if the completion has a specific format."""
+    pattern = r"<reasoning>.*?</reasoning>\s*<answer>.*?</answer>"
+    responses = [completion[0]["content"] for completion in completions]
+    matches = [re.match(pattern, r) for r in responses]
+    return [0.5 if match else 0.0 for match in matches]
+
+
+def count_xml(text) -> float:
+    count = 0.0
+    if text.count("<reasoning>\n") == 1:
+        count += 0.125
+    if text.count("\n</reasoning>\n") == 1:
+        count += 0.125
+    if text.count("\n<answer>\n") == 1:
+        count += 0.125
+        count -= len(text.split("\n</answer>\n")[-1]) * 0.001
+    if text.count("\n</answer>") == 1:
+        count += 0.125
+        count -= (len(text.split("\n</answer>")[-1]) - 1) * 0.001
+    return count
+
+
+def xmlcount_reward_func(completions, **kwargs) -> list[float]:
+    contents = [completion[0]["content"] for completion in completions]
+    return [count_xml(c) for c in contents]
+
+
+...
+
+trainer = GRPOTrainer(
+    model=model,
+    processing_class=tokenizer,
+    reward_funcs=[ # Personal note: didn't expect this be so simple to implement @@
+        xmlcount_reward_func,
+        soft_format_reward_func,
+        strict_format_reward_func,
+        int_reward_func,
+        correctness_reward_func,
+    ],
+    args=training_args,
+    train_dataset=dataset,
+)
+trainer.train()
+
+```
+
+- [x] Just curious, how did the team implemented the reward functions for [Alphamaze](https://github.com/menloresearch/visual-thinker)?
+
+- Below is from Alphamaze's repo
+    - > We designed a reward function 3 components. Correctness Reward (+0.2 per solution step): This reward is scaled according to the number of steps in the maze solution. Each valid movement step adds 0.2 points to the total score. For example, a solution requiring 4 steps earns a reward of 0.2 × 4 = 0.8 points, incentivizing both accuracy and efficiency in navigation. Integrity Reward (+0.5): This reward is given for each valid movement token (<|up|>, <|down|>, <|left|>, <|right|>) in the predicted sequence, encouraging the generation of meaningful and valid movement steps.
+    - > Thinking Reward (+0.25): This reward is given for correctly using the <think> tag in the output, ensuring completeness and consistency in the reasoning format. These reward components were weighted to prioritize correctness while also encouraging valid movement sequences and proper reasoning formatting with <think> tag. We adapted the Group Relative Policy Optimization (GRPO) algorithm, as employed in DeepSeek-R1 [Guo et al., 2025], to perform reinforcement learning. GRPO estimates advantages based on relative group scores, offering computational efficiency compared to critic-based methods.
+
+```python
+
+def xmlcount_reward_func(completions, **kwargs) -> List[float]:
+    """
+    Reward function based on proper XML tag usage.
+    
+    Args:
+        completions: Model completions
+        
+    Returns:
+        List of reward scores
+    """
+    contents = [completion[0]["content"] for completion in completions]
+    return [count_xml(c) for c in contents]
+
+def int_reward_func(completions, **kwargs) -> List[float]:
+    """
+    Reward function that checks if responses contain valid direction tokens.
+    
+    Args:
+        completions: Model completions
+        
+    Returns:
+        List of reward scores
+    """
+    allowed_tokens = {"<|up|>", "<|down|>", "<|right|>", "<|left|>"}
+    
+    responses = [completion[0]['content'] for completion in completions]
+    extracted_responses = [extract_xml_answer(r) for r in responses]
+
+
+def correctness_reward_func(prompts, completions, answer, **kwargs) -> List[float]:
+    """
+    Reward function that checks correctness of answers.
+    
+    Args:
+        prompts: Input prompts
+        completions: Model completions
+        answer: Ground truth answers
+        
+    Returns:
+        List of reward scores
+    """
+    rewards = []
+    responses = [completion[0]['content'] for completion in completions]
+    q = prompts[0][-1]['content']
+    extracted_responses = [extract_xml_answer(r) for r in responses]
+    logger.debug('-'*20)
+    logger.debug(f"Question:\n{q}")
+    logger.debug(f"\nAnswer:\n{answer[0]}")
+    logger.debug(f"\nResponse:\n{responses[0]}")
+    logger.debug(f"\nExtracted:\n{extracted_responses[0]}")
+    for r, a in zip(extracted_responses, answer):
+        if r == a:
+            direction = r.split("|><|")
+            rewards.append(len(direction)*0.2)
+        else:
+            rewards.append(0.0)
+    return rewards
+
+
+# def strict_format_reward_func(completions, **kwargs) -> List[float]:
+#     """
+#     Reward function that checks if completions strictly follow the required format.
+    
+#     Args:
+#         completions: Model completions
+        
+#     Returns:
+#         List of reward scores
+#     """
+#     pattern = r"^<think>\n.*?\n</think>\n\n.*?\n$"
+#     responses = [completion[0]["content"] for completion in completions]
+#     matches = [re.match(pattern, r, re.DOTALL) for r in responses]
+#     return [0.5 if match else 0.0 for match in matches]
+
+
+# def soft_format_reward_func(completions, **kwargs) -> List[float]:
+#     """
+#     Reward function that checks if completions loosely follow the required format.
+    
+#     Args:
+#         completions: Model completions
+        
+#     Returns:
+#         List of reward scores
+#     """
+#     pattern = r"<think>.*?</think>\s*.*?"
+#     responses = [completion[0]["content"] for completion in completions]
+#     matches = [re.match(pattern, r, re.DOTALL) for r in responses]
+#     return [0.5 if match else 0.0 for match in matches]
+
+
+...
+
+        reward_funcs=[
+            xmlcount_reward_func,
+            # soft_format_reward_func,
+            # strict_format_reward_func,
+            int_reward_func,
+            correctness_reward_func,
+        ],
+```
+
+## Comparison of Alphamaze's reward functions and unsloth's
+
+| Feature                     | Unsloth Example                                                                                                                                                              | AlphaMaze                                                                                                                                                              | Similarities                                                                                                   | Differences                                                                                                                                                                                                                                                                     |
+| :-------------------------- | :------------------------------------------------------------------------------------------------------------------------------------------------------------------- | :--------------------------------------------------------------------------------------------------------------------------------------------------------------------- | :------------------------------------------------------------------------------------------------------------- | :--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| **Overall Purpose**         | To evaluate and score the quality of model-generated text based on various criteria (format, correctness, content).                                                  | Same as Unsloth.                                                                                                                                                       | Both aim to provide numerical rewards for model outputs based on defined criteria.                               | AlphaMaze appears more focused on a specific maze-solving task (directions in the answer), while Unsloth's examples are more general, including evaluating whether a number prediction can be cast to integer .                                                         |
+| **Function Structure**     | Functions generally take `completions` (and sometimes `prompts`, `answer`) as input.  Return a list of floats (rewards).                                                | Same as Unsloth.                                                                                                                                                       | Both use functions that take model outputs (and sometimes inputs) and return lists of reward scores.             | AlphaMaze's `correctness_reward_func` calculates a reward based on the *length* of the correct answer (number of directions), while Unsloth's gives a fixed reward (2.0) for a correct answer, and 0 otherwise.                                          |
+| **Reward Types**            | - `correctness_reward_func`:  Checks if the extracted answer matches the ground truth.  Binary reward (2.0 or 0.0).<br> - `int_reward_func`: Checks if extracted answer is a digit. Binary reward (0.5 or 0.0).<br> - `strict_format_reward_func`, `soft_format_reward_func`:  Check for specific XML-like formatting using regular expressions. Binary reward (0.5 or 0.0).<br> - `xmlcount_reward_func`:  Counts XML tags, providing a fractional reward based on tag presence and penalizing trailing text. | - `correctness_reward_func`: Checks if extracted answer matches ground truth. Reward is proportional to answer length (0.2 per direction).<br> - `int_reward_func`: Checks if the answer consists of allowed tokens. The implementation in this code is not complete.   <br> - `xmlcount_reward_func`: Same as Unsloth's.<br> - `strict_format_reward_func` (commented out): Checks for a specific format using regex.<br> - `soft_format_reward_func` (commented out): Checks for a looser format.       | - Both have `correctness_reward_func`, `int_reward_func`, `xmlcount_reward_func` (though implemented slightly differently).<br>- Both use regular expressions for format checking. | - Unsloth uses a simpler binary reward for correctness. AlphaMaze uses a length-based reward.<br>- Unsloth's `int_reward_func` check if castable to integer, AlphaMaze's intends to check for allowed direction tokens (but the implementation is not finished).<br>- AlphaMaze's formatting functions are commented out. |
+| **`correctness_reward_func`** | Compares extracted answer to ground truth.  Prints debugging information. Returns 2.0 for correct, 0.0 otherwise.                                                | Compares extracted answer to ground truth, calculates reward based on the *length* of the correct answer (number of direction steps, 0.2 per step). Logs debugging information. | Both compare the extracted answer to the ground truth answer and print/log debugging information.                    | - Unsloth returns a fixed reward (2.0) for a correct answer.<br> - AlphaMaze's reward is proportional to the length of the correct answer (0.2 per direction).                                                                                                   |
+| **`int_reward_func`**      | Checks if the extracted response `isdigit()`. Returns 0.5 if true, 0.0 otherwise.                                                                                 | Intended to check if the response contains allowed direction tokens (`<|up|>`,`<|down|>`, etc.).  The provided code *does not* actually implement this check. The lines where the response is processes are incomplete and non-functional.              | Both are intended to evaluate specific characteristics of the extracted response.                               | - Unsloth's checks for digits.<br>- AlphaMaze's *intended* functionality is to check for specific tokens, but the code, as shown, does not implement this, and the reward return is not defined.                                                                |
+| **`xmlcount_reward_func`** | Same implementation in both. Counts opening/closing tags, penalizes extra text.                                                                                      | Same implementation in both.                                                                                                                                          | Identical implementation.                                                                                       | None.                                                                                                                                                                                                                                                       |
+| **Format Checking**         | Uses `strict_format_reward_func` and `soft_format_reward_func` with different regular expressions.                                                                  | Has `strict_format_reward_func` and `soft_format_reward_func` (commented out) with different regular expressions.                                                        | Both use regular expressions to check for specific formatting patterns.                                        | - Unsloth's format checks look for `<reasoning>` and `<answer>` tags.<br>- AlphaMaze's (commented out) checks look for `<think>` tags and a general structure.<br>- Unsloth's are active; AlphaMaze's are commented out.                                                |
+| **Extracted Answer**    | Both use an `extract_xml_answer` function (not shown in the provided snippets, but assumed to be defined elsewhere).                                                  | Same as Unsloth.                                                                                                                                                       | Both rely on an external function to extract the relevant part of the response for evaluation.                  | We don't know the exact implementation of `extract_xml_answer`, so there might be subtle differences.  However, the *use* is the same.                                                                                                                       |
--- a/docs/search-backends.md
+++ b/docs/search-backends.md
@ -0,0 +1,8 @@
+# Search backends
+
+- Purpose: adding more noise to the training process. (already did this in the initial dataset)
+- Different search strategy? - Semantic search, keyword search, BM25, actually api call
+- Embedding models, Retrieval mechanisms (BM25, dense, hybrid), Query expansion methods, Reranking strategies
+- Random search engine assignment per query
+- Noise and inconsistency injection to prevent shortcut learning
+
--- a/docs/self-verification.md
+++ b/docs/self-verification.md
@ -0,0 +1,5 @@
+# Self Verification
+
+- [x] Investigate this term: it's word is mentioned in the autodiact's about section and also in the deepseek R1 paper (not so detailed), but not in blogs or code base. I think this word is important and should be investigated
+    - Lol a "Verifier" is just a synonym of **reward function**
+    - <https://docs.unsloth.ai/basics/reasoning-grpo-and-rl#reward-functions-verifier>
--- a/embeddings.py
+++ b/embeddings.py
@ -0,0 +1,92 @@
+from typing import List, Union
+
+import torch
+import torch.nn.functional as F
+from langchain.embeddings.base import Embeddings
+from transformers import AutoModel, AutoTokenizer
+
+# Set a default model here
+DEFAULT_MODEL_NAME = "avsolatorio/NoInstruct-small-Embedding-v0"
+
+
+class CustomHuggingFaceEmbeddings(Embeddings):
+    """
+    A custom embeddings class that wraps a Hugging Face model for generating embeddings.
+
+    Supports two modes:
+    - "sentence": uses the [CLS] token representation for sentence/document embeddings.
+    - "query": uses mean pooling over tokens (weighted by the attention mask) for query embeddings.
+    """
+
+    def __init__(
+        self, model_name: str = DEFAULT_MODEL_NAME, default_mode: str = "sentence"
+    ):
+        self.model_name = model_name
+        # Set device to GPU if available, else CPU
+        self.device = "cuda" if torch.cuda.is_available() else "cpu"
+        self.model = AutoModel.from_pretrained(model_name).to(self.device)
+        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
+        self.default_mode = default_mode  # "sentence" or "query"
+        self.model.eval()  # Set model to evaluation mode
+
+    def get_embedding(self, text: Union[str, List[str]], mode: str = None):
+        if mode is None:
+            mode = self.default_mode
+        assert mode in (
+            "query",
+            "sentence",
+        ), f"Unsupported mode: {mode}. Only 'query' and 'sentence' are supported."
+
+        # Ensure we are working with a list of texts
+        if isinstance(text, str):
+            text = [text]
+
+        # Tokenize the input texts
+        inp = self.tokenizer(text, return_tensors="pt", padding=True, truncation=True)
+        # Move the input tensors to the same device as the model
+        inp = {key: value.to(self.device) for key, value in inp.items()}
+
+        # Forward pass (no gradients needed)
+        with torch.no_grad():
+            output = self.model(**inp)
+
+        if mode == "query":
+            # Mean pooling: weight by attention mask and average across tokens
+            vectors = output.last_hidden_state * inp["attention_mask"].unsqueeze(2)
+            vectors = vectors.sum(dim=1) / inp["attention_mask"].sum(dim=-1).view(-1, 1)
+        else:
+            # Sentence/document embedding: use the [CLS] token (first token) representation
+            vectors = output.last_hidden_state[:, 0, :]
+        return vectors
+
+    def embed_documents(self, texts: List[str]) -> List[List[float]]:
+        """
+        Compute embeddings for a list of documents (using sentence mode).
+        """
+        vectors = self.get_embedding(texts, mode="sentence")
+        return vectors.cpu().numpy().tolist()
+
+    def embed_query(self, text: str) -> List[float]:
+        """
+        Compute an embedding for a single query.
+        """
+        vector = self.get_embedding(text, mode="query")
+        return vector.cpu().numpy()[0].tolist()
+
+
+# For quick testing
+if __name__ == "__main__":
+    embeddings = CustomHuggingFaceEmbeddings()
+
+    # Example texts for document embeddings
+    texts = [
+        "Illustration of the REaLTabFormer model. The left block shows the non-relational tabular data model using GPT-2.",
+        "Predicting human mobility holds significant practical value, with applications in disaster planning and epidemic simulation.",
+        "As economies adopt digital technologies, policy makers are asking how to prepare the workforce for emerging labor demands.",
+    ]
+    doc_embeddings = embeddings.embed_documents(texts)
+    print("Document embeddings:", doc_embeddings)
+
+    # Example query embedding
+    query_embedding = embeddings.embed_query("Which sentence talks about jobs?")
+    print("Query embedding:", query_embedding)
--- a/generate_data.py
+++ b/generate_data.py
@ -0,0 +1,230 @@
+"""
+This script performs two main tasks:
+1. It loads a markdown document, splits it into chunks, generates embeddings,
+   and builds a FAISS index (which is saved locally).
+2. It generates QA pairs from the document using llama.
+   For each chunk (using a sliding window for context), it generates multiple question-answer pairs
+   with different difficulties. The generation is performed in batch with one retry for failed prompts.
+   Successfully generated QA pairs are saved to "saved_data/questions.json".
+
+Requirements:
+    pip install langchain faiss-cpu unsloth vllm
+"""
+
+import json
+import os
+import pickle
+import re
+from typing import Dict, List, Optional, Tuple
+
+from langchain.text_splitter import RecursiveCharacterTextSplitter
+
+# ========= Part 1: Document Processing and Embedding Generation =========
+# Load and split the markdown document using LangChain
+from langchain_community.document_loaders import UnstructuredMarkdownLoader
+from langchain_community.vectorstores import FAISS
+
+from embeddings import CustomHuggingFaceEmbeddings
+
+# Load your markdown file (adjust the path as needed)
+loader = UnstructuredMarkdownLoader("./data/mission_report.md")
+docs = loader.load()
+
+# Split the document into smaller chunks (each 1000 characters, no overlap)
+text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
+chunks = text_splitter.split_documents(docs)
+
+# Save chunks for later use
+os.makedirs("saved_data", exist_ok=True)
+with open("saved_data/chunks.pkl", "wb") as f:
+    pickle.dump(chunks, f)
+print(f"Saved {len(chunks)} chunks to saved_data/chunks.pkl")
+
+embeddings = CustomHuggingFaceEmbeddings()
+
+# Create a FAISS vector store from the document chunks and save it locally
+vectorstore = FAISS.from_documents(chunks, embeddings)
+vectorstore.save_local("faiss_index")
+print("Saved FAISS index to 'faiss_index'")
+
+# ========= Part 2: QA Generation using Llama Backend =========
+
+# Setup Llama backend via unsloth and vLLM
+from unsloth import FastLanguageModel
+from vllm import SamplingParams
+
+import rl_helpers  # Ensure you have this or remove if not used
+
+# Load the Llama model (adjust parameters as needed)
+model, tokenizer = FastLanguageModel.from_pretrained(
+    model_name="meta-llama/meta-Llama-3.1-8B-Instruct",
+    max_seq_length=4096,
+    load_in_4bit=True,  # Use 4-bit quantization if desired
+    fast_inference=True,  # Enable fast inference
+    gpu_memory_utilization=0.6,  # Adjust based on your GPU memory
+)
+
+# Define sampling parameters for generation
+sampling_params = SamplingParams(
+    temperature=0.3,
+    top_p=0.95,
+    max_tokens=4096,
+)
+
+
+def batch_generate(prompts: List[str]) -> List[str]:
+    """
+    Given a list of prompt strings, returns a list of generated outputs.
+    """
+
+    def format_input(text: str) -> str:
+        return tokenizer.apply_chat_template(
+            [{"role": "user", "content": text}],
+            tokenize=False,
+            add_generation_prompt=True,
+        )
+
+    formatted = [format_input(p) for p in prompts]
+    outputs = model.fast_generate(formatted, sampling_params=sampling_params)
+    return [output.outputs[0].text for output in outputs]
+
+
+def parse_qa_block(block: str) -> Optional[Tuple[str, str, str]]:
+    """
+    Parses a QA block that should contain exactly three non-empty lines:
+      - A line starting with "Question:"
+      - A line starting with "Answer:"
+      - A line starting with "Difficulty:"
+
+    If the markers are not present but the block contains exactly three lines,
+    those are used in order.
+
+    Returns a tuple (question, answer, difficulty) or None if parsing fails.
+    """
+    lines = [line.strip() for line in block.splitlines() if line.strip()]
+    if not lines:
+        return None
+
+    question, answer, difficulty = None, None, None
+    for line in lines:
+        lower = line.lower()
+        if question is None and lower.startswith("question:"):
+            question = line[len("question:") :].strip()
+        elif answer is None and lower.startswith("answer:"):
+            answer = line[len("answer:") :].strip()
+        elif difficulty is None and lower.startswith("difficulty:"):
+            difficulty = line[len("difficulty:") :].strip()
+
+    if question and answer and difficulty:
+        return question, answer, difficulty
+    if len(lines) == 3:
+        return lines[0], lines[1], lines[2]
+    return None
+
+
+def parse_multiple_qa_output(output: str) -> List[Tuple[str, str, str]]:
+    """
+    Splits the output into blocks (separated by one or more blank lines) and
+    attempts to parse each as a QA pair.
+
+    Returns a list of successfully parsed QA tuples.
+    """
+    blocks = re.split(r"\n\s*\n", output.strip())
+    qa_pairs = []
+    for block in blocks:
+        parsed = parse_qa_block(block)
+        if parsed:
+            qa_pairs.append(parsed)
+    return qa_pairs
+
+
+def generate_question_batch_for_chunks(
+    chunks: List, num_questions: int = 2, difficulty: str = None
+) -> List[Dict]:
+    """
+    Generates QA pairs for multiple chunks in batch.
+
+    For each chunk (except the first and last), a sliding window is used for context:
+      - before: previous chunk's content
+      - current: current chunk's content
+      - after: next chunk's content
+
+    Each prompt instructs the model to output exactly three lines per QA pair with markers.
+    Failed prompts are retried once in batch; if still unsuccessful, they are skipped.
+
+    Returns a list of dicts with keys: "chunk_id", "question", "answer", "difficulty".
+    """
+    prompts = []
+    chunk_ids = []
+
+    # Prepare prompts using a sliding window
+    for i in range(1, len(chunks) - 1):
+        before = chunks[i - 1].page_content
+        current = chunks[i].page_content
+        after = chunks[i + 1].page_content
+        prompt = (
+            f"From the text within ==BEGIN== and ==END==, generate {num_questions} questions with answers.\n"
+            "For each QA pair, output exactly three lines with no extra commentary:\n"
+            "Line 1: Question: <your question>\n"
+            "Line 2: Answer: <the answer>\n"
+            "Line 3: Difficulty: <easy, medium, or hard>\n"
+            "Do not include any additional text.\n\n"
+            "==BEGIN==\n"
+            f"{before}\n{current}\n{after}\n"
+            "==END==\n"
+        )
+        prompts.append(prompt)
+        chunk_ids.append(i)
+
+    # First batch generation
+    outputs = batch_generate(prompts)
+    results = [None] * len(outputs)
+    failed_indices = []
+
+    # Parse each output
+    for idx, output in enumerate(outputs):
+        qa_pairs = parse_multiple_qa_output(output)
+        if qa_pairs is None or len(qa_pairs) < num_questions:
+            failed_indices.append(idx)
+        else:
+            results[idx] = qa_pairs[:num_questions]
+
+    # Retry failed prompts in batch
+    if failed_indices:
+        print(f"Retrying {len(failed_indices)} failed prompt(s)...")
+        retry_prompts = [prompts[i] for i in failed_indices]
+        retry_outputs = batch_generate(retry_prompts)
+        for j, idx in enumerate(failed_indices):
+            qa_pairs = parse_multiple_qa_output(retry_outputs[j])
+            if qa_pairs is not None and len(qa_pairs) >= num_questions:
+                results[idx] = qa_pairs[:num_questions]
+            else:
+                results[idx] = None  # Mark as failed
+
+    # Build final output, skipping prompts that failed even after retry
+    final_questions = []
+    for i, qa_list in enumerate(results):
+        if qa_list is not None:
+            for qa in qa_list:
+                final_questions.append(
+                    {
+                        "chunk_id": chunk_ids[i],
+                        "question": qa[0],
+                        "answer": qa[1],
+                        "difficulty": qa[2],
+                    }
+                )
+    return final_questions
+
+
+# Generate QA pairs in batch (using a sliding window over the chunks)
+all_questions = generate_question_batch_for_chunks(
+    chunks, num_questions=2, difficulty="medium"
+)
+print(f"Generated {len(all_questions)} QA pairs.")
+
+# Save the QA pairs to a JSON file
+questions_path = os.path.join("saved_data", "questions.json")
+with open(questions_path, "w") as f:
+    json.dump(all_questions, f, indent=2)
+print(f"Saved questions to {questions_path}")
--- a/notebooks/.gitignore
+++ b/notebooks/.gitignore
@ -0,0 +1,2 @@
+unsloth_compiled_cache
+0_*
--- a/notebooks/Llama3_1_(8B)_GRPO.ipynb
+++ b/notebooks/Llama3_1_(8B)_GRPO.ipynb
--- a/notebooks/Qwen2_5_(3B)_GRPO.ipynb
+++ b/notebooks/Qwen2_5_(3B)_GRPO.ipynb
--- a/notebooks/train_autodidact.ipynb
+++ b/notebooks/train_autodidact.ipynb
--- a/requirements.txt
+++ b/requirements.txt
@ -0,0 +1,17 @@
+datasets
+faiss-cpu
+langchain
+langchain-community
+Markdown
+tokenizers
+transformers
+unsloth==2025.3.6
+unsloth_zoo==2025.3.4
+unstructured
+vllm
+wandb
+
+ipykernel
+python-dotenv
+loguru
+gradio
--- a/rl_helpers.py
+++ b/rl_helpers.py
@ -0,0 +1,540 @@
+"""
+RL helpers module for handling tool-based conversations.
+This module provides utility functions for handling chat-based tool interactions
+and calculating rewards based on the quality of responses.
+"""
+
+import asyncio
+import json
+import re
+from dataclasses import dataclass
+from datetime import datetime
+
+import nest_asyncio
+import torch
+
+from search_module import get_qa_dataset, search
+
+nest_asyncio.apply()
+from typing import Callable, List
+
+from trl.trainer.grpo_trainer import apply_chat_template
+
+
+# Constants for prompts and tool definitions
+def get_system_prompt():
+    """Get the system prompt with current date."""
+    current_date = datetime.now().strftime("%d %b %Y")
+    return f"""Cutting Knowledge Date: December 2023
+Today Date: {current_date}
+
+When you receive a tool call response, use the output to format an answer to the original user question.
+
+You are a helpful assistant with tool calling capabilities.
+"""
+
+
+# Tool definition for search corpus
+SEARCH_TOOL_DEFINITION = {
+    "type": "function",
+    "function": {
+        "name": "search_corpus",
+        "description": "Search over the knowledge corpus with a given query",
+        "parameters": {
+            "type": "object",
+            "properties": {
+                "query": {
+                    "type": "string",
+                    "description": "The query to search the knowledge corpus with",
+                },
+            },
+            "required": ["query"],
+        },
+    },
+}
+
+
+def build_user_prompt(q):
+    """
+    Build a user prompt with the question and search tool definition.
+
+    Args:
+        q (str): The question to ask
+
+    Returns:
+        str: Formatted user prompt
+    """
+    user_prompt = f"""You are a research assistant, and you use the search_corpus tool to find answers to questions.
+Given a question, answer it using by doing searches using the search_corpus tool.
+To use the search_corpus tool, respond with a JSON for a function call with its proper arguments.
+
+You may also reason in any message, thinking step by step about how to answer the question. Wrap your reasoning in <reasoning> and </reasoning> tags.
+
+{json.dumps(SEARCH_TOOL_DEFINITION, indent=2)}
+
+Question: {q}
+"""
+    return user_prompt
+
+
+def get_initial_chat(question):
+    """
+    Initialize a chat state with the question.
+
+    Args:
+        question (str): The question to ask
+
+    Returns:
+        dict: Initial chat state with system and user messages
+    """
+    return {
+        "messages": [
+            {"role": "system", "content": get_system_prompt()},
+            {"role": "user", "content": build_user_prompt(question)},
+        ]
+    }
+
+
+def extract_json_objects(text):
+    """
+    Extracts JSON objects (dictionaries) from a text that may contain multiple JSON objects.
+
+    Args:
+        text (str): The input text possibly containing JSON objects.
+
+    Returns:
+        list: A list of parsed JSON objects (dictionaries) extracted from the text.
+    """
+    results = []
+    length = len(text)
+    i = 0
+
+    while i < length:
+        # Look for the start of a JSON object
+        if text[i] == "{":
+            start = i
+            stack = 1
+            i += 1
+            # Continue until we find the matching closing brace
+            while i < length and stack > 0:
+                if text[i] == "{":
+                    stack += 1
+                elif text[i] == "}":
+                    stack -= 1
+                i += 1
+            # Only attempt to decode if the braces are balanced
+            if stack == 0:
+                candidate = text[start:i]
+                try:
+                    obj = json.loads(candidate)
+                    # Optionally, ensure it's a dictionary if that's what you expect
+                    if isinstance(obj, dict):
+                        results.append(obj)
+                except json.JSONDecodeError:
+                    # If it's not valid JSON, skip it.
+                    pass
+        else:
+            i += 1
+    return results
+
+
+def remove_reasoning(text: str) -> str:
+    """
+    Removes all content between <reasoning> and </reasoning> tags,
+    including the tags themselves.
+
+    Parameters:
+        text (str): The input text that may contain <reasoning>...</reasoning> tags.
+
+    Returns:
+        str: The text with the tags and their content removed.
+    """
+    # The regex pattern matches from <reasoning> to </reasoning> non-greedily.
+    pattern = r"<reasoning>.*?</reasoning>"
+    cleaned_text = re.sub(pattern, "", text, flags=re.DOTALL)
+    return cleaned_text
+
+
+def run_agent_generations(generate_fn, tokenizer, chat_states):
+    """
+    Run generation for chat states requiring assistant responses.
+
+    Args:
+        generate_fn: Function to generate responses
+        tokenizer: Tokenizer for processing text
+        chat_states: List of chat states
+
+    Returns:
+        list: Updated chat states
+    """
+    prompts = []
+    batch_indices = []
+    # Prepare prompts for chat states needing an assistant response.
+    for idx, chat_state in enumerate(chat_states):
+        if chat_state.get("finished"):
+            continue
+
+        if chat_state["messages"][-1]["role"] in ["ipython", "user"]:
+            prompt = apply_chat_template(chat_state, tokenizer=tokenizer)["text"]
+            prompts.append(prompt)
+            batch_indices.append(idx)
+
+    if prompts:
+        responses = generate_fn(prompts)
+        for i, idx in enumerate(batch_indices):
+            chat_state = chat_states[idx]
+            full_response = responses[i].outputs[0].text
+            assistant_response = full_response.split(
+                "<|start_header_id|>assistant<|end_header_id|>"
+            )[-1]
+            chat_state["messages"].append(
+                {"role": "assistant", "content": assistant_response}
+            )
+    return chat_states
+
+
+def check_finished_chats(chat_states):
+    """
+    Check which chat states are finished (no more function calls).
+
+    Args:
+        chat_states: List of chat states
+
+    Returns:
+        list: Updated chat states with finished flag
+    """
+    for chat_state in chat_states:
+        if chat_state.get("finished"):
+            continue
+        assert (
+            chat_state["messages"][-1]["role"] == "assistant"
+        ), "Expected the last role to be assistant"
+        assistant_response = chat_state["messages"][-1]["content"]
+        function_calls = extract_json_objects(assistant_response)
+        if len(function_calls) == 0:
+            chat_state["finished"] = True
+    return chat_states
+
+
+def run_tool_calls(chat_states):
+    """
+    Execute tool calls found in chat states.
+
+    Args:
+        chat_states: List of chat states
+
+    Returns:
+        list: Updated chat states with tool call results
+    """
+    for chat_state in chat_states:
+        if chat_state.get("finished"):
+            continue
+        assert (
+            chat_state["messages"][-1]["role"] == "assistant"
+        ), "Expected the last role to be assistant to run tool calls"
+        try:
+            assistant_response = chat_state["messages"][-1]["content"]
+            function_calls = extract_json_objects(assistant_response)
+            if len(function_calls) > 1:
+                raise ValueError(
+                    "Expected only one function call in assistant response"
+                )
+            elif len(function_calls) == 1:
+                function_call = function_calls[0]
+                query = function_call["function"]["parameters"]["query"]
+                results = search(query, return_type=str, results=2)
+                chat_state["messages"].append({"role": "ipython", "content": results})
+        except Exception as e:
+            chat_state["messages"].append(
+                {"role": "system", "content": f"Error during post-processing: {str(e)}"}
+            )
+            chat_state["finished"] = True
+    return chat_states
+
+
+def get_mask(text, tokenizer):
+    encoding = tokenizer(text, add_special_tokens=False)
+    start_header_id = tokenizer.convert_tokens_to_ids("<|start_header_id|>")
+    assistant_token = tokenizer.convert_tokens_to_ids("assistant")
+    end_header_id = tokenizer.convert_tokens_to_ids("<|end_header_id|>")
+    eot_id = tokenizer.convert_tokens_to_ids("<|eot_id|>")
+    assistant_ranges = []
+    i = 0
+    while i < len(encoding.input_ids) - 1:
+        if (
+            encoding.input_ids[i] == start_header_id
+            and encoding.input_ids[i + 1] == assistant_token
+        ):
+            i += 2
+            while (
+                i < len(encoding.input_ids) and encoding.input_ids[i] != end_header_id
+            ):
+                i += 1
+            i += 2
+            start_idx = i
+            while i < len(encoding.input_ids) and encoding.input_ids[i] != eot_id:
+                i += 1
+            end_idx = i
+            assistant_ranges.append((start_idx, end_idx))
+        else:
+            i += 1
+    mask = [0] * len(encoding.input_ids)
+    for start_idx, end_idx in assistant_ranges:
+        for idx in range(start_idx, end_idx):
+            mask[idx] = 1
+    return torch.tensor(mask, dtype=torch.int)
+
+
+def check_exceeded_max_new_tokens(chat_states, max_new_tokens, tokenizer):
+    for chat_state in chat_states:
+        if chat_state.get("finished"):
+            continue
+        initial_length = chat_state["initial_length"]
+        new_length = get_chat_num_tokens(chat_state, tokenizer)
+        if new_length - initial_length > max_new_tokens:
+            chat_state["finished"] = True
+    return chat_states
+
+
+@dataclass
+class AgenticOutputs:
+    prompt_tokens: list[torch.Tensor]
+    response_tokens: list[torch.Tensor]
+    response_masks: list[torch.Tensor]
+    final_response_str: list[str]
+    full_chat_states: list[dict]
+
+
+def get_chat_num_tokens(chat_state, tokenizer):
+    chat_text = apply_chat_template(chat_state, tokenizer=tokenizer)["text"]
+    return (
+        tokenizer(chat_text, add_special_tokens=False, return_tensors="pt")["input_ids"]
+        .squeeze()
+        .shape[0]
+    )
+
+
+def run_agent(
+    generate_fn, tokenizer, questions, max_generations=5, max_new_tokens=4096
+):
+    """
+    Run the agent to completion for a batch of questions.
+
+    Args:
+        generate_fn: Function to generate model responses
+        tokenizer: Tokenizer for processing text
+        batch: Batch of data containing questions
+        max_generations: Maximum number of generation steps
+
+    Returns:
+        list: Final answers for each question
+    """
+    chat_states = [get_initial_chat(q) for q in questions]
+    # set the initial_prompt length
+    for chat_state in chat_states:
+        chat_state["initial_length"] = get_chat_num_tokens(chat_state, tokenizer)
+
+    # agent loop
+    for i in range(max_generations):
+        chat_states = run_agent_generations(generate_fn, tokenizer, chat_states)
+        chat_states = check_finished_chats(chat_states)
+        chat_states = run_tool_calls(chat_states)
+        chat_states = check_exceeded_max_new_tokens(
+            chat_states, max_new_tokens, tokenizer
+        )
+
+    answers = []
+    for chat in chat_states:
+        answers.append(chat["messages"][-1]["content"])
+
+    def split_prompt_assistant(convo_text):
+        marker = "<|start_header_id|>assistant<|end_header_id|>"
+        idx = convo_text.find(marker)
+        if idx == -1:
+            raise ValueError("Could not find assistant marker in conversation text.")
+            return convo_text, ""
+        # Include the marker in the prompt by slicing up to the end of the marker.
+        prompt = convo_text[: idx + len(marker)]
+        # The assistant response is everything after the marker.
+        assistant_response = convo_text[idx + len(marker) :]
+        return prompt, assistant_response
+
+    str_chats = [
+        apply_chat_template(chat, tokenizer=tokenizer)["text"] for chat in chat_states
+    ]
+    prompt_toks, response_toks, response_masks = [], [], []
+    for str_chat in str_chats:
+        prompt, response = split_prompt_assistant(str_chat)
+        prompt_toks.append(
+            tokenizer(prompt, add_special_tokens=False, return_tensors="pt")[
+                "input_ids"
+            ].squeeze()
+        )
+        response_toks.append(
+            tokenizer(response, add_special_tokens=False, return_tensors="pt")[
+                "input_ids"
+            ].squeeze()[:max_new_tokens]
+        )
+        mask = get_mask(str_chat, tokenizer)[len(prompt_toks[-1]) :][:max_new_tokens]
+
+        response_masks.append(mask)
+
+    final_response_str = [chat["messages"][-1]["content"] for chat in chat_states]
+    full_chat_states = chat_states
+    agentic_outputs = AgenticOutputs(
+        prompt_tokens=prompt_toks,
+        response_tokens=response_toks,
+        response_masks=response_masks,
+        final_response_str=final_response_str,
+        full_chat_states=full_chat_states,
+    )
+
+    return agentic_outputs
+
+
+# Verification
+async def check_correctness(question, student_answer, answer):
+    """
+    Calculate reward for a given student answer.
+
+    Args:
+        question (str): The original question
+        student_answer (str): The model's answer
+        answer (str): The ground truth answer
+
+    Returns:
+        float: Reward value (1 for correct, 0 for incorrect)
+    """
+    # log to "./reward_func.log"
+    with open("reward_func.log", "a") as f:
+        f.write("\n" + "==" * 40 + "\n\n")
+        f.write(f"Question: {question}\n")
+        f.write(f"Student Answer: {student_answer}\n")
+        f.write(f"Answer: {answer}\n")
+        if student_answer.startswith("Error during"):
+            f.write(f"failed function call")
+            return 0
+        if len(student_answer) < 5:
+            f.write(f"failed Too short answer\n")
+            return 0
+        else:
+            f.write(f"last message didn't fail\n")
+            student_answer_clean = remove_reasoning(student_answer)
+            is_correct = await verify(student_answer_clean, question, answer)
+            f.write(f"Is Correct: {is_correct}, so reward is {int(is_correct)}\n")
+            return 1 if is_correct else 0
+
+
+def check_student_answers(
+    questions: List[str],
+    answers: List[str],
+    student_answers: List[str],
+    vllm_generate_func: Callable[[List[str]], List[str]],
+    tokenizer,
+    log_file: str = "qa_log.txt",
+) -> List[bool]:
+    """
+    Evaluates a list of student answers against the true answers using a vLLM generate function.
+    The function applies the chat template to each prompt before passing it to the generate function.
+    It also appends the details of each QA pair and the verifier's response to a log file.
+
+    Args:
+        questions: A list of strings representing the questions.
+        answers: A list of strings representing the correct answers.
+        student_answers: A list of strings containing the student's answers.
+        vllm_generate_func: A function that takes a list of chat-formatted prompt strings and returns a list of generated outputs.
+        tokenizer: The tokenizer used to apply the chat template.
+        log_file: Optional; path to the file where the QA pairs and verification responses will be appended.
+
+    Returns:
+        A list of booleans indicating whether each student's answer is correct.
+    """
+    if not (len(questions) == len(answers) == len(student_answers)):
+        raise ValueError(
+            "The number of questions, answers, and student answers must be equal."
+        )
+
+    prompts = []
+    for question, answer, student_ans in zip(questions, answers, student_answers):
+        # Construct the plain text prompt for each QA pair.
+        prompt_text = (
+            "You are grading a student's answer. For the following question, "
+            "compare the student's answer to the correct answer. Reply with 'Yes' if the student's answer is correct, or 'No' if it is completely incorrect.\n\n"
+            f"Question: {question}\n"
+            f"Correct Answer: {answer}\n"
+            f"Student Answer: {student_ans}\n"
+        )
+        # Apply the chat template to the prompt.
+        formatted_prompt = tokenizer.apply_chat_template(
+            [{"role": "user", "content": prompt_text}],
+            tokenize=False,
+            add_generation_prompt=True,
+        )
+        prompts.append(formatted_prompt)
+
+    # Get the model responses in batch (each response should ideally be "Yes" or "No")
+    responses = vllm_generate_func(prompts)
+    responses_text = [response.outputs[0].text for response in responses]
+
+    # Evaluate each response and mark as correct if "yes" appears in the answer (case-insensitive)
+    results = []
+    for response in responses_text:
+        results.append("yes" in response.lower())
+
+    # Append the QA details and verifier's response to the specified log file
+    with open(log_file, "a") as file:
+        for question, answer, student_ans, verifier_response in zip(
+            questions, answers, student_answers, responses_text
+        ):
+            file.write("Question: " + question + "\n")
+            file.write("Correct Answer: " + answer + "\n")
+            file.write("Student Answer: " + student_ans + "\n")
+            file.write("Verifier said: " + verifier_response + "\n")
+            file.write("-" * 40 + "\n")
+
+    return results
+
+
+def build_reward_correctness_fn(generate_fn, tokenizer):
+    def reward_correctness(prompts, completions, **reward_kwargs):
+        teacher_answers = reward_kwargs["answer"]
+        student_answers = [
+            completion["messages"][-1]["content"] for completion in completions
+        ]
+
+        correct = check_student_answers(
+            prompts,
+            teacher_answers,
+            student_answers,
+            vllm_generate_func=generate_fn,
+            tokenizer=tokenizer,
+        )
+        return correct
+
+    return reward_correctness
+
+
+def reward_formatting(prompts, completions, **reward_kwargs):
+    # make sure full chats doesn't have any error function calls
+    has_error = [False] * len(completions)
+    for i, chat in enumerate(completions):
+        for message in chat["messages"]:
+            if "Error during" in message["content"]:
+                has_error[i] = True
+                break
+    return [0.7 if not e else 0 for e in has_error]
+
+
+def run_eval(generate_fn, verify_fn, tokenizer):
+    train_dataset, test_dataset = get_qa_dataset()
+    questions = test_dataset["prompt"]
+    agentic_outputs = run_agent(generate_fn, tokenizer, questions)
+    full_chat_states = agentic_outputs.full_chat_states
+    final_responses = agentic_outputs.final_response_str
+    rewards = verify_fn(questions, full_chat_states, answer=test_dataset["answer"])
+
+    print("RESULTS:")
+    print("percentage of correct answers:", sum(rewards) / len(rewards))
+    print("=" * 30)
+
+    return full_chat_states
--- a/search_module.py
+++ b/search_module.py
@ -0,0 +1,194 @@
+"""
+Search module for RL training loop.
+This module provides functions to search through vectorized documents and retrieve question-answer pairs.
+"""
+
+import pickle
+import json
+import random
+import asyncio
+from typing import List, Tuple, Optional, Union, Dict, Any
+from enum import Enum
+from pydantic import BaseModel
+from langchain.vectorstores import FAISS
+from datasets import Dataset
+from embeddings import CustomHuggingFaceEmbeddings
+
+
+# Load pre-saved vectorstore
+def load_vectorstore():
+    """Load the pre-saved FAISS index"""
+    try:
+        import os
+
+        embeddings = CustomHuggingFaceEmbeddings()
+        # Load the FAISS index with absolute path
+        index_path = os.path.join(
+            os.path.dirname(os.path.abspath(__file__)), "faiss_index"
+        )
+        print(f"Loading FAISS index from: {index_path}")
+        vectorstore = FAISS.load_local(
+            index_path, embeddings, allow_dangerous_deserialization=True
+        )
+        print("Successfully loaded FAISS index")
+        return vectorstore
+    except Exception as e:
+        print(f"Error loading vectorstore: {e}")
+        import traceback
+
+        traceback.print_exc()
+        return None
+
+
+# Load the vectorstore when module is imported
+try:
+    vectorstore = load_vectorstore()
+    if vectorstore is None:
+        print("Warning: FAISS vectorstore could not be loaded.")
+except Exception as e:
+    print(f"Error loading vectorstore: {e}")
+    vectorstore = None
+
+
+def search(query: str, return_type=str, results: int = 5) -> Union[str, List[str]]:
+    """
+    Search for relevant chunks using similarity search.
+
+    Args:
+        query: The search query
+        return_type: Return as string or list (default: str)
+        results: Number of results to return (default: 5)
+
+    Returns:
+        Results as string or list depending on return_type
+    """
+    if vectorstore is None:
+        raise ValueError("Vectorstore not loaded. Please ensure FAISS index exists.")
+
+    search_results = vectorstore.similarity_search(query, k=results)
+
+    if return_type == str:
+        str_results = ""
+        for idx, result in enumerate(search_results, start=1):
+            str_results += f"Result {idx}:\n"
+            str_results += result.page_content + "\n"
+            str_results += "------\n"
+        return str_results
+    elif return_type == list:
+        return [result.page_content for result in search_results]
+    else:
+        raise ValueError("Invalid return_type. Use str or list.")
+
+
+# Load questions from saved data
+def load_qa_data():
+    """Load the pre-generated questions and document chunks"""
+    try:
+        import os
+
+        # Get absolute paths to data files
+        base_dir = os.path.dirname(os.path.abspath(__file__))
+        chunks_path = os.path.join(base_dir, "saved_data", "chunks.pkl")
+        questions_path = os.path.join(base_dir, "saved_data", "questions.json")
+
+        print(f"Loading chunks from: {chunks_path}")
+        print(f"Loading questions from: {questions_path}")
+
+        # Load the chunks
+        with open(chunks_path, "rb") as f:
+            chunks = pickle.load(f)
+
+        # Load the questions
+        with open(questions_path, "r") as f:
+            questions = json.load(f)
+
+        print(
+            f"Successfully loaded {len(chunks)} chunks and {len(questions)} questions"
+        )
+        return chunks, questions
+    except Exception as e:
+        print(f"Error loading QA data: {e}")
+        import traceback
+
+        traceback.print_exc()
+        return None, None
+
+
+# Load chunks and questions when module is imported
+try:
+    chunks, questions = load_qa_data()
+    if chunks is None or questions is None:
+        print("Warning: Could not load QA data.")
+except Exception as e:
+    print(f"Error initializing QA data: {e}")
+    chunks, questions = None, None
+
+
+def get_question_answer(
+    idx: Optional[int] = None, return_both: bool = True
+) -> Union[dict, str]:
+    """
+    Get a question-answer pair either by index or randomly.
+
+    Args:
+        idx: Index of the question to retrieve (if None, selects random question)
+        return_both: Whether to return both question and answer (default: True)
+
+    Returns:
+        Question and answer as tuple if return_both=True, otherwise just the question
+    """
+    if questions is None:
+        raise ValueError("Questions not loaded. Please ensure questions.json exists.")
+
+    if idx is None:
+        # Select a random question
+        qa_pair = random.choice(questions)
+    elif 0 <= idx < len(questions):
+        # Select question by index
+        qa_pair = questions[idx]
+    else:
+        raise ValueError(
+            f"Index out of range. Must be between 0 and {len(questions)-1}"
+        )
+
+    question = qa_pair["question"]
+    answer = qa_pair["answer"]
+
+    if return_both:
+        return {"question": question, "answer": answer}
+    else:
+        return question
+
+
+# Function to get the total number of questions
+def get_question_count() -> int:
+    """Get the total number of available questions"""
+    if questions is None:
+        raise ValueError("Questions not loaded. Please ensure questions.json exists.")
+    return len(questions)
+
+
+def get_qa_dataset():
+    """
+    Return a HuggingFace Dataset containing question and answer pairs.
+
+    This dataset is constructed from the loaded questions data (questions.json).
+    Each element in the dataset is a dictionary that includes at least:
+      - "question": The question text.
+      - "answer": The corresponding answer text.
+    Additional keys present in the original questions data will also be included.
+
+    Returns:
+        A HuggingFace Dataset object.
+    """
+    if questions is None:
+        raise ValueError("Questions not loaded. Please ensure questions.json exists.")
+
+    qa_dataset = Dataset.from_list(questions)
+    full_dataset = qa_dataset.shuffle(seed=42)
+    train_dataset = full_dataset.train_test_split(test_size=0.1, seed=42)["train"]
+    test_dataset = full_dataset.train_test_split(test_size=0.1, seed=42)["test"]
+    # rename the column of the dataset from "question" to "input"
+    train_dataset = train_dataset.rename_column("question", "prompt")
+    test_dataset = test_dataset.rename_column("question", "prompt")
+    return train_dataset, test_dataset
--- a/simple_qa.py
+++ b/simple_qa.py
@ -0,0 +1,200 @@
+#!/usr/bin/env python3
+"""
+Simple command-line Q&A environment for testing with search functionality.
+"""
+
+import asyncio
+import json
+import random
+import sys
+import time
+from typing import Any, Dict
+
+# Import our search module (ensure these functions follow the new interfaces)
+from search_module import get_question_answer, get_question_count, search
+
+
+class SimpleQAEnvironment:
+    """Simple command-line environment for Q&A with search capability."""
+
+    def __init__(self):
+        self.score = {"correct": 0, "incorrect": 0, "total": 0}
+        self.session_data = []
+        self.current_question = None
+
+    def display_welcome(self):
+        """Display welcome message and instructions."""
+        print("\n===== Search & Answer Environment =====")
+        print("Answer questions using the search tool to find relevant information.")
+        print("Type 'q' to quit, 'h' for help.\n")
+
+    def display_help(self):
+        """Display help information."""
+        print("\n===== Commands =====")
+        print("n          - Get a new question")
+        print("s <query>  - Search for information (e.g., s program launch date)")
+        print("a <answer> - Submit your answer")
+        print("h          - Display this help message")
+        print("q          - Quit the program\n")
+
+    def display_question(self, question: str):
+        """Display the current question."""
+        print("\n===== QUESTION =====")
+        print(question)
+        print("=====================\n")
+
+    def get_new_question(self) -> str:
+        """Get a new random question and set it as current."""
+        total_questions = get_question_count()
+        question_id = random.randint(0, total_questions - 1)
+
+        # Updated to match new interface: get_question_answer now returns a dict.
+        qa = get_question_answer(question_id)
+        question = qa["question"]
+        correct_answer = qa["answer"]
+
+        question_data = {
+            "id": question_id,
+            "question": question,
+            "correct_answer": correct_answer,
+            "start_time": time.time(),
+            "searches": [],
+        }
+        self.current_question = question_data
+        return question
+
+    def perform_search(self, query: str):
+        """Perform a search with the given query."""
+        if not query:
+            print("Please provide a search query.")
+            return
+
+        try:
+            print("\n===== SEARCH RESULTS =====")
+            results = search(query)
+            print(results)
+            print("==========================\n")
+
+            # Record search in current question data if available.
+            if self.current_question is not None:
+                self.current_question["searches"].append(query)
+
+        except Exception as e:
+            print(f"Error searching: {str(e)}")
+
+    async def process_answer(self, user_answer: str):
+        """Process and verify the user's answer."""
+        if self.current_question is None:
+            print("Please get a question first.")
+            return
+
+        if not user_answer:
+            print("Please provide an answer.")
+            return
+
+        # Record answer and calculate time taken.
+        self.current_question["user_answer"] = user_answer
+        self.current_question["end_time"] = time.time()
+        self.current_question["time_taken"] = (
+            self.current_question["end_time"] - self.current_question["start_time"]
+        )
+
+        try:
+            print("\nVerifying your answer...")
+            correct = await verify(
+                user_answer,
+                self.current_question["question"],
+                self.current_question["correct_answer"],
+                router,
+            )
+
+            # Update score and inform the user.
+            self.score["total"] += 1
+            if correct:
+                self.score["correct"] += 1
+                print("\n✓ Your answer is CORRECT!")
+            else:
+                self.score["incorrect"] += 1
+                print("\n✗ Your answer is INCORRECT.")
+                print(
+                    f"\nThe correct answer is:\n{self.current_question['correct_answer']}"
+                )
+
+            print(f"\nScore: {self.score['correct']}/{self.score['total']}")
+
+            # Record the result and add the current question to the session data.
+            self.current_question["is_correct"] = correct
+            self.session_data.append(self.current_question)
+
+            # Clear the current question.
+            self.current_question = None
+
+        except Exception as e:
+            print(f"Error verifying answer: {str(e)}")
+
+    def save_session(self):
+        """Save the session data to a file."""
+        if not self.session_data:
+            return
+
+        timestamp = time.strftime("%Y%m%d_%H%M%S")
+        filename = f"qa_session_{timestamp}.json"
+
+        session_data = {
+            "timestamp": time.strftime("%Y-%m-%d %H:%M:%S"),
+            "score": self.score,
+            "questions": self.session_data,
+        }
+
+        try:
+            with open(filename, "w") as f:
+                json.dump(session_data, f, indent=2)
+            print(f"\nSession data saved to {filename}")
+        except Exception as e:
+            print(f"Error saving session data: {str(e)}")
+
+    async def run(self):
+        """Run the main command loop."""
+        self.display_welcome()
+
+        while True:
+            command = input("\n> ").strip()
+
+            if not command:
+                continue
+
+            # Process commands.
+            if command.lower() == "q":
+                break
+            elif command.lower() == "h":
+                self.display_help()
+            elif command.lower() == "n":
+                question = self.get_new_question()
+                self.display_question(question)
+            elif command.lower().startswith("s "):
+                query = command[2:].strip()
+                self.perform_search(query)
+            elif command.lower().startswith("a "):
+                answer = command[2:].strip()
+                await self.process_answer(answer)
+            else:
+                print("Unknown command. Type 'h' for help.")
+
+        # Save session data on exit.
+        self.save_session()
+        print("\nThank you for using the Q&A environment!")
+
+
+async def main():
+    """Main function to start the application."""
+    env = SimpleQAEnvironment()
+    await env.run()
+
+
+if __name__ == "__main__":
+    try:
+        asyncio.run(main())
+    except KeyboardInterrupt:
+        print("\nProgram terminated by user.")
+    except Exception as e:
+        print(f"\nError: {str(e)}")
--- a/train_autodidact.py
+++ b/train_autodidact.py
@ -0,0 +1,189 @@
+# %%
+from unsloth import FastLanguageModel
+
+# %%
+from unsloth import is_bfloat16_supported
+import torch
+
+max_seq_length = 4096 * 2  # Can increase for longer reasoning traces
+lora_rank = 64  # Larger rank = smarter, but slower
+
+model, tokenizer = FastLanguageModel.from_pretrained(
+    model_name="meta-llama/meta-Llama-3.1-8B-Instruct",
+    max_seq_length=max_seq_length,
+    load_in_4bit=True,  # False for LoRA 16bit
+    fast_inference=True,  # Enable vLLM fast inference
+    max_lora_rank=lora_rank,
+    gpu_memory_utilization=0.6,  # Reduce if out of memory
+)
+
+model = FastLanguageModel.get_peft_model(
+    model,
+    r=lora_rank,  # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
+    target_modules=[
+        "q_proj",
+        "k_proj",
+        "v_proj",
+        "o_proj",
+        "gate_proj",
+        "up_proj",
+        "down_proj",
+    ],  # Remove QKVO if out of memory
+    lora_alpha=lora_rank,
+    use_gradient_checkpointing="unsloth",  # Enable long context finetuning
+    random_state=3407,
+)
+
+# %%
+import re
+from datasets import load_dataset, Dataset
+from search_module import search, get_question_answer, get_question_count
+from rl_helpers import get_qa_dataset
+
+train_dataset, test_dataset = get_qa_dataset()
+
+# %% [markdown]
+# <a name="Train"></a>
+# ### Train the model
+#
+# Now set up GRPO Trainer and all configurations!
+
+# %%
+import os
+
+os.environ["WANDB_PROJECT"] = "bootstrap-search-rl"
+
+# %%
+# from UnslothGRPOTrainerTemp import UnslothGRPOConfig, _UnslothGRPOTrainer
+import UnslothGRPOTrainerTemp
+
+training_args = UnslothGRPOTrainerTemp.UnslothGRPOConfig(
+    use_vllm=True,  # use vLLM for fast inference!
+    use_agentic_generate=True,  # use agentic generation
+    learning_rate=5e-6,
+    adam_beta1=0.9,
+    adam_beta2=0.99,
+    weight_decay=0.1,
+    warmup_ratio=0.1,
+    lr_scheduler_type="cosine",
+    optim="paged_adamw_8bit",
+    logging_steps=1,
+    bf16=is_bfloat16_supported(),
+    fp16=not is_bfloat16_supported(),
+    per_device_train_batch_size=8,
+    gradient_accumulation_steps=1,  # Increase to 4 for smoother training
+    num_generations=8,  # Decrease if out of memory
+    max_prompt_length=1024,
+    max_completion_length=1024,
+    # num_train_epochs = 1, # Set to 1 for a full training run
+    max_steps=101,
+    save_steps=50,
+    max_grad_norm=0.1,
+    report_to="none",  # Can use Weights & Biases
+    output_dir="full_local_training",
+)
+
+# %%
+
+
+import rl_helpers
+# importlib.reload(rl_helpers)
+
+
+def agentic_generate(
+    prompts: list[str],
+    generate_fn,
+    max_generations: int = 6,
+):
+    return run_agent(generate_fn, tokenizer, prompts, max_generations)
+
+
+model.agentic_generate = agentic_generate
+
+
+from vllm import SamplingParams
+
+verifier_sampling_params = SamplingParams(
+    temperature=0.1,
+    top_p=0.95,
+    max_tokens=4096,
+)
+
+
+def verifier_generate_fn(inputs):
+    return model.fast_generate(
+        inputs,
+        sampling_params=verifier_sampling_params,
+    )
+
+
+run_agent = rl_helpers.run_agent
+reward_correctness = rl_helpers.build_reward_correctness_fn(
+    verifier_generate_fn,
+    tokenizer,
+)
+reward_formatting = rl_helpers.reward_formatting
+
+import UnslothGRPOTrainerTemp
+
+trainer = UnslothGRPOTrainerTemp.UnslothGRPOTrainer(
+    model=model,
+    processing_class=tokenizer,
+    reward_funcs=[
+        reward_correctness,
+        reward_formatting,
+    ],
+    args=training_args,
+    train_dataset=train_dataset,
+)
+
+# %%
+trainer.train()
+
+# %% [markdown]
+# <a name="Inference"></a>
+# ### Inference
+# Now let's try benchmark the model we trained!
+
+# %%
+from vllm import SamplingParams
+import rl_helpers
+
+sampling_params = SamplingParams(
+    temperature=0.5,
+    top_p=0.95,
+    max_tokens=4096,
+)
+
+
+def eval_generate_fn(inputs):
+    return model.fast_generate(
+        inputs,
+        sampling_params=sampling_params,
+        lora_request=model.load_lora(
+            "full_local_training/checkpoint-101"
+        ),  # load the trained LoRA
+    )
+
+
+rl_helpers.run_eval(
+    generate_fn=eval_generate_fn,
+    verify_fn=reward_correctness,
+    tokenizer=tokenizer,
+)
+
+
+# %%
+# eval w/o lora
+def eval_generate_fn(inputs):
+    return model.fast_generate(
+        inputs,
+        sampling_params=sampling_params,
+    )
+
+
+rl_helpers.run_eval(
+    generate_fn=eval_generate_fn,
+    verify_fn=reward_correctness,
+    tokenizer=tokenizer,
+)
--- a/ugly_code_file.py
+++ b/ugly_code_file.py
@ -1,26 +0,0 @@
-import sys,os,random,time
-def f(x):return x*2 if x%2==0 else x+1
-class C:
- def __init__(self,v):self.v=v
- def p(self):print("Value:",self.v)
-def m(l):return [f(x) for x in l]
-x=[random.randint(1,100) for _ in range(10)]
-print("Original:",x)
-print("Processed:",m(x))
-for i in range(len(x)):
- if i%2==0:
-  x[i]*=2
- elif i%3==0:
-  x[i]+=3
- else:
-  x[i]-=1
-c=C(sum(x))
-c.p()
-try:
- for i in range(5):print(i,x[i],f(x[i]))
-except:pass
-with open("temp.txt","w") as f:f.write("Hello, world!")
-while True:
- time.sleep(0.1)
- if random.random()>0.9:break
-print("Done!")