You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

4.4 KiB

Worklog

Backlog

  • @thinhlpg transfers the project to @bachvudinh

  • Modify generate_dataset.py (ONLY AFTER the simple training and benchmark works):

    • Optimize speed (different LLM models, api, tools, etc.)
    • Optimize quality. As a data dataset maker, I want to change from LLama 3.1 8B to API call, like claude, gemini or openai. Originally they use 3.1 8B for Self-Bootstrapping demonstration, but the dataset quality is low, for sure.
    • Experimenting with different chunking strategies
  • search-backends.md design (for more dataset noise (ONLY AFTER the simple training dataset works))

  • Train SFT first stage, then GRPO (new idea from @tikikun 250326)

    • I think this idea is already implemented in search-r1 repo, i'll double check it later.
  • Implement quality of life scripts from brain-rotting-multiple-gpu-workflow-for-dummies.md

  • Better verification logic please (should be a fixed for every experiments, not the base model it self)

yymmdd

  • task description

250329

  • brain.exe and back.exe refused to work

250328

  • Watch solo leveling with bro @tikikun 🔥
  • Figuring out how to keep multiple experiments organized. the repos in the server are a mess 💀💀 (but at least they worked for now)

250328 - D-Day

  • Show the results, or demo

250327

  • CLEAN THE REPO PLEASE IT'S A MESS 😭😭😭
    • Double checked all script, runned well :3
  • Write script to train x-deepseek-r1-distil models (original script only support Llama -instruct models)
  • Script to continue training from last checkpoint
  • Make a simple demo app (or just cli inference script should be good)
  • Upload datasets to HF Hub
  • Research a little bit on Agentic Reward Modeling (for designing better reward function maybe?) agentic-reward-modeling.md

250326

  • Fix exact match reward function bug
  • Enhance the training script with better logging and monitoring
  • Train new models
  • Write new eval script

250325

  • Read Search-R1 to get more ideas on how to improve the reward functions (pretty similar idea i suppose)
  • update new reward functions in reward-functions.md
  • Train the model v0 (with new data and reward functions) (might be another 2 hours)
    • spoiler: it's not good

250324

  • Make the dataset v0
  • Train with new data and default reward functions (it took 2 hours on 1xA6000 😭)
    • Got poor result (50% Accuracy down to 35%) 📉

250323

  • brain.exe and back.exe refused to work 😭

250322

  • Moving all the scattered and disorganized stuffs that've been working on for the past week into this repo.
  • Write proposal for DeepSearch
  • As a new member of research team, i'm curious on how did we do GRPO with Alphamaze?, so that I can inherit the good stuff and improve the workflow!!!
    • Alphamaze?
    • https://www.menlo.ai/blog/alpha-maze
    • https://arxiv.org/pdf/2502.14669
    • Our training process involved two key stages: creating a specialized dataset and then using a combination of supervised fine-tuning (SFT) and reinforcement learning (RL) to train the model.

    • LLaMA-Factory for SFT (1.5B 6xA6000 1.5 hour) and Unsloth for GRPO
    • 💡 Hmm so for SFT we have 50% successful data and 50% retry data, and full successful data for GRPO. Can I also apply this to DeepSearch as well? #HACK

250321

250320

250319

250318

Graveyard 💀