You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
3.0 KiB
3.0 KiB
Worklog
Backlog
-
@thinhlpg transfers the project to @bachvudinh
-
Modify
generate_dataset.py
(ONLY AFTER the simple training and benchmark works):- Optimize speed (different LLM models, api, tools, etc.)
- Optimize quality. As a data dataset maker, I want to change from LLama 3.1 8B to API call, like claude, gemini or openai. Originally they use 3.1 8B for
Self-Bootstrapping
demonstration, but the dataset quality is low, for sure. - Experimenting with different chunking strategies
-
search-backends.md design (for more dataset noise (ONLY AFTER the simple training dataset works))
-
Research a little bit on Agentic Reward Modeling (for designing better reward function maybe?)
- https://medium.com/@techsachin/agentic-reward-modeling-combine-human-preferences-with-verifiable-correctness-signals-for-reliable-76c408b3491c
- https://arxiv.org/pdf/2502.19328
- https://github.com/THU-KEG/Agentic-Reward-Modeling
- https://www.themoonlight.io/en/review/agentic-reward-modeling-integrating-human-preferences-with-verifiable-correctness-signals-for-reliable-reward-systems
yymmdd
- task description
250324
- Train the model v0
- Make the dataset v0
- Upload dataset v0 to HF Hub - Initial dataset from AutoDidact - Paraphrased sdataset
- Make a simple gradio demo app
250323
- brain.exe and back.exe refused to work 😭
250322
- Moving all the scattered and disorganized stuffs that've been working on for the past week into this repo.
- Write proposal for DeepSearch
- evaluation.md design (list out the metrics and why)
- dataset.md design (pipeline, data structure,...)
- reward-functions.md design (list out the functions and why)
- As a new member of research team, i'm curious on how did we do GRPO with Alphamaze?, so that I can inherit the good stuff and improve the workflow!!!
- Alphamaze?
- https://www.menlo.ai/blog/alpha-maze
- https://arxiv.org/pdf/2502.14669
-
Our training process involved two key stages: creating a specialized dataset and then using a combination of supervised fine-tuning (SFT) and reinforcement learning (RL) to train the model.
- LLaMA-Factory for SFT (1.5B 6xA6000 1.5 hour) and Unsloth for GRPO
- 💡 Hmm so for SFT we have 50% successful data and 50% retry data, and full successful data for GRPO. Can I also apply this to DeepSearch as well? #HACK
250321
- Inspect the code of AutoDidact in a more detailed way https://github.com/menloresearch/DeepSearch/issues/4
250320
- Research on GRPO https://github.com/menloresearch/DeepSearch/issues/2
250319
- Research on GRPO https://github.com/menloresearch/DeepSearch/issues/2
- Run the training script of AutoDidact
250318
- Idea received https://github.com/menloresearch/DeepSearch/issues/1