diff --git a/README.md b/README.md index a5f3d2d..fd0adbc 100644 --- a/README.md +++ b/README.md @@ -6,7 +6,7 @@ ReZero trains a small language model to develop effective search behaviors instead of memorizing static data. It interacts with multiple synthetic search engines, each with unique retrieval mechanisms, to refine queries and persist in searching until it finds exact answers. The project focuses on reinforcement learning, preventing overfitting, and optimizing for efficiency in real-world search applications. -[**Quick Demo**](#quick-demo-) | [**Setup**](#setup-๏ธ) | [**Data and Training**](#data-and-training-) | [**Models**](#models-) | [**References**](#references-) | [**Acknowledgements**](#acknowledgements-) +[**Quick Demo**](#quick-demo-) | [**Setup**](#setup-๏ธ) | [**Data and Training**](#data-and-training-) | [**Models**](#models-) | [**Experiments**](#experiments-) | [**References**](#references-) | [**Acknowledgements**](#acknowledgements-) @@ -68,6 +68,14 @@ You can find our models on Hugging Face ๐Ÿค—! We're committed to open-source and |-------|----------|------|------| | ReZero-v0.1 | Llama-3.2-3B | 3B | [๐Ÿค— Menlo/ReZero-v0.1-llama-3.2-3b-it-grpo-250404](https://huggingface.co/Menlo/ReZero-v0.1-llama-3.2-3b-it-grpo-250404) | +## Experiments ๐Ÿงช + +| Run ID | Model Config | Dataset | Steps | Hardware | TensorBoard | Description | +|--------|--------------|---------|-------|----------|-------------|-------------| +| exp-01 | [Llama-3.2-3b-instruct](https://huggingface.co/janhq/250404-llama-3.2-3b-instruct-grpo-01) | Apollo Mission Report | 300 | ~2 hours on 1xH200 | [๐Ÿ“Š](https://huggingface.co/janhq/250404-llama-3.2-3b-instruct-grpo-01/tensorboard) | Added reward_search_strategy and reward_search_quality. Reward weights: [4.0, 2.0, 1.0, 1.0, 1.0, 1.0]. Loss crashed after step 400. Best accuracy: 31.25% at step 400. Max agent turns: 10. | +| exp-02 | [Llama-3.2-3b-instruct](https://huggingface.co/janhq/250404-llama-3.2-3b-instruct-grpo-02) | Apollo Mission Report | 1000 | ~7 hours on 1xH200 | [๐Ÿ“Š](https://huggingface.co/janhq/250404-llama-3.2-3b-instruct-grpo-02/tensorboard) | Improved reward_retry logic to only reward search when answers found. Increased max agent turns to 20. Reward weights: [4.0, 2.0, 1.0, 1.0, 1.0, 1.0]. Best accuracy: 46.88% at step 250. Higher early reward_correctness (~0.6 vs 0.4-0.5). Loss stable but reward crashed after step 350. | +| exp-03 | [Llama-3.2-3b-instruct](https://huggingface.co/janhq/250409-llama-3.2-3b-instruct-grpo-01-no-retry) | Apollo Mission Report | 1000 | ~7 hours on 1xH200 | [๐Ÿ“Š](https://huggingface.co/janhq/250409-llama-3.2-3b-instruct-grpo-01-no-retry/tensorboard) | Same as exp-02 but without the retry reward function. | + ## References ๐Ÿ“– ## Acknowledgements ๐Ÿค