You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
20 lines
781 B
20 lines
781 B
# GRPO idea
|
|
|
|
- The training flow of R1 is really simple (thanks my friend professional yapper @vTuanpham) for initially clarifing my dumbness 🤣
|
|
|
|
```python
|
|
1. Train một con base biết dùng tool bằng sft thuần để boost
|
|
Tuan
|
|
2. Sau đó thả rông bằng gpro, syntax gần đúng 0.5, syntax đúng params lệch quá thì 0.65, cả hai đều được thì 0.85,...
|
|
```
|
|
|
|
## Unsloth's guide
|
|
|
|
- <https://unsloth.ai/blog/r1-reasoning>
|
|
- Heheboi let's steal this notebook <https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1_(8B)-GRPO.ipynb>
|
|
- <https://docs.unsloth.ai/basics/reasoning-grpo-and-rl> - This is like the most simple
|
|
|
|
## Hugigngface's GRPO trainer
|
|
|
|
- <https://github.com/huggingface/trl/blob/main/docs/source/grpo_trainer.md>
|