Explore Help

darius-atlas

/

ReZero-Search-LLM-Agent-Fork

1

0

You've already forked ReZero-Search-LLM-Agent-Fork

Code Issues Pull Requests Packages Projects Releases Wiki Activity

You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

main

${ item.name }

Create tag ${ searchTerm }

Create branch ${ searchTerm }

from 'a58722e16f'

${ noResults }

ReZero-Search-LLM-Agent-Fork/docs/00_worklog.md

2.8 KiB

Raw Blame History

Worklog

Backlog

Modify generate_dataset.py (ONLY AFTER the simple training and benchmark works):
- As a data dataset maker, I want to change from LLama 3.1 8B to API call, like claude, gemini or openai. Originally they use 3.1 8B for Self-Bootstrapping demonstration, but the dataset quality is low, for sure.
- Experimenting with different chunking strategies
search-backends.md design (for more dataset noise (ONLY AFTER the simple training dataset works))
Research a little bit on Agentic Reward Modeling (for designing better reward function maybe?)

yymmdd

task description

250324

@thinhlpg transfers the project to @bachvudinh

250323

Train the model
Make the dataset
Upload datasets to HF Hub - Initial dataset from AutoDidact - Paraphrased sdataset
Make a simple gradio demo app

250322

Moving all the scattered and disorganized stuffs that've been working on for the past week into this repo.
Write proposal for DeepSearch
- evaluation.md design (list out the metrics and why)
- dataset.md design (pipeline, data structure,...)
- reward-functions.md design (list out the functions and why)
As a new member of research team, i'm curious on how did we do GRPO with Alphamaze?, so that I can inherit the good stuff and improve the workflow!!!
- Alphamaze?
- https://www.menlo.ai/blog/alpha-maze
- https://arxiv.org/pdf/2502.14669
- Our training process involved two key stages: creating a specialized dataset and then using a combination of supervised fine-tuning (SFT) and reinforcement learning (RL) to train the model.
- LLaMA-Factory for SFT (1.5B 6xA6000 1.5 hour) and Unsloth for GRPO
- 💡 Hmm so for SFT we have 50% successful data and 50% retry data, and full successful data for GRPO. Can I also apply this to DeepSearch as well? #HACK

250321

Inspect the code of AutoDidact in a more detailed way https://github.com/menloresearch/DeepSearch/issues/4

250320

Research on GRPO https://github.com/menloresearch/DeepSearch/issues/2

250319

Research on GRPO https://github.com/menloresearch/DeepSearch/issues/2
Run the training script of AutoDidact

250318

Idea received https://github.com/menloresearch/DeepSearch/issues/1

Powered by Gitea Version: 1.18.1 Page: 26ms Template: 1ms

English

Bahasa Indonesia Deutsch English Español Français Italiano Latviešu Magyar nyelv Nederlands Polski Português de Portugal Português do Brasil Suomi Svenska Türkçe Čeština Ελληνικά Български Русский Українська فارسی മലയാളം 日本語简体中文繁體中文（台灣）繁體中文（香港） 한국어

Licenses API