ReZero-Search-LLM-Agent-Fork/docs/grpo-idea.md

# GRPO idea

- The training flow of R1 is really simple (thanks my friend professional yapper @vTuanpham) for initially clarifing my dumbness 🤣

```python
1. Train một con base biết dùng tool bằng sft thuần để boost
Tuan
2. Sau đó thả rông bằng gpro, syntax gần đúng 0.5, syntax đúng params lệch quá thì 0.65, cả hai đều được thì 0.85,...
```

## Unsloth's guide

- <https://unsloth.ai/blog/r1-reasoning>
- Heheboi let's steal this notebook <https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1_(8B)-GRPO.ipynb>
- <https://docs.unsloth.ai/basics/reasoning-grpo-and-rl> - This is like the most simple

## Hugigngface's GRPO trainer

- <https://github.com/huggingface/trl/blob/main/docs/source/grpo_trainer.md>