# Worklog ## Backlog - [ ] @thinhlpg transfers the project to @bachvudinh - [ ] Modify `generate_dataset.py` (**ONLY AFTER** the simple training and benchmark works): - [ ] As a data dataset maker, I want to change from LLama 3.1 8B to API call, like claude, gemini or openai. Originally they use 3.1 8B for `Self-Bootstrapping` demonstration, but the dataset quality is low, for sure. - [ ] Experimenting with different chunking strategies - [ ] [search-backends.md](search-backends.md) design (for more dataset noise (**ONLY AFTER** the simple training dataset works)) - [ ] Research a little bit on Agentic Reward Modeling (for designing better reward function maybe?) - - - - ## yymmdd - [ ] task description ## 250324 - [ ] Train the model v0 - [ ] Make the dataset v0 - [ ] Upload dataset v0 to HF Hub - Initial dataset from AutoDidact - Paraphrased sdataset - [ ] Make a simple gradio demo app ## 250323 - brain.exe and back.exe refused to work 😭 ## 250322 - [x] Moving all the scattered and disorganized stuffs that've been working on for the past week into this repo. - [x] Write proposal for DeepSearch - [x] [evaluation.md](evaluation.md) design (list out the metrics and why) - [x] [dataset.md](dataset.md) design (pipeline, data structure,...) - [x] [reward-functions.md](reward-functions.md) design (list out the functions and why) - [x] As a new member of research team, i'm curious on how did we do GRPO with Alphamaze?, so that I can inherit the good stuff and improve the workflow!!! - [Alphamaze](https://github.com/menloresearch/visual-thinker)? - - - > Our training process involved two key stages: creating a specialized dataset and then using a combination of supervised fine-tuning (SFT) and reinforcement learning (RL) to train the model. - LLaMA-Factory for SFT **(1.5B 6xA6000 1.5 hour)** and Unsloth for GRPO - 💡 Hmm so for SFT we have 50% successful data and 50% retry data, and full successful data for GRPO. Can I also apply this to DeepSearch as well? #HACK ## 250321 - [x] Inspect the code of AutoDidact in a more detailed way ## 250320 - Research on GRPO ## 250319 - Research on GRPO - Run the training script of AutoDidact ## 250318 - Idea received