# Evaluation - **Goal**: - 1. Better performance than the original one (by auto eval script) - 2. Better performance by real human eval/preference ## Implementation Phases - [x] 1. Just take the eval function from the original repo (it simply uses accuracy (exact match)) and simple quick glance on the output quality. - [ ] 2. Find some more common and conventional dataset and benchmarks (still auto script) - [ ] 3. Setup human eval ## Baseline - Info from autodidact - After just 100 steps of GRPO training (1 hour on a single RTX 4090 GPU), Llama-8B significantly improved its ability to research and answer questions from the Apollo 13 mission report - On a validation set of 68 questions, accuracy more than doubled from 23% to 59%. - Training log: idk why but the result that I got from acutally running the training is a bit lower. ```bash ceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor). completion_ids = [torch.tensor(ids, device=device) for ids in completion_ids] Processed prompts: 100%|████████████████| 16/16 [00:00<00:00, 39.27it/s, est. speed input: 6827.13 toks/s, output: 81.01 toks/s] rewards_per_func: tensor([0.6875, 0.7000], device='cuda:0'):05, 2.55it/s, est. speed input: 385.80 toks/s, output: 5.11 toks/s] {'loss': 0.0003, 'grad_norm': 0.5810762047767639, 'learning_rate': 0.0, 'rewards/reward_correctness': 0.6875, 'rewards/reward_formatting': 0.699999988079071, 'reward': 1.3875000476837158, 'reward_std': 0.44403791427612305, 'completion_length': 224.125, 'kl': 0.00834659393876791, 'epoch': 0.34} {'train_runtime': 7992.2854, 'train_samples_per_second': 0.202, 'train_steps_per_second': 0.013, 'train_loss': 0.0005197484556535774, 'epoch': 0.34} 100%|███████████████████████████████████████████████████████████████████████████████████████| 101/101 [2:13:12<00:00, 79.13s/it] Processed prompts: 100%|████████████████| 67/67 [00:20<00:00, 3.28it/s, est. speed input: 950.44 toks/s, output: 394.51 toks/s] Processed prompts: 100%|███████████████| 66/66 [00:20<00:00, 3.15it/s, est. speed input: 2383.55 toks/s, output: 323.82 toks/s] Processed prompts: 100%|███████████████| 20/20 [00:17<00:00, 1.13it/s, est. speed input: 1320.49 toks/s, output: 146.76 toks/s] Processed prompts: 100%|████████████████| 17/17 [00:16<00:00, 1.04it/s, est. speed input: 1620.28 toks/s, output: 98.35 toks/s] Processed prompts: 100%|██████████████████| 9/9 [00:15<00:00, 1.73s/it, est. speed input: 1165.77 toks/s, output: 71.38 toks/s] Processed prompts: 100%|████████████████| 67/67 [00:04<00:00, 16.31it/s, est. speed input: 3617.28 toks/s, output: 61.11 toks/s] RESULTS: percentage of correct answers: 0.5074626865671642 ============================== Processed prompts: 100%|███████████████| 67/67 [00:15<00:00, 4.46it/s, est. speed input: 1292.29 toks/s, output: 561.32 toks/s] Processed prompts: 100%|███████████████| 44/44 [00:18<00:00, 2.44it/s, est. speed input: 1800.84 toks/s, output: 244.13 toks/s] Processed prompts: 100%|███████████████| 13/13 [00:12<00:00, 1.05it/s, est. speed input: 1209.04 toks/s, output: 126.32 toks/s] Processed prompts: 100%|███████████████| 10/10 [00:13<00:00, 1.32s/it, est. speed input: 1225.46 toks/s, output: 109.78 toks/s] Processed prompts: 100%|██████████████████| 7/7 [00:12<00:00, 1.86s/it, est. speed input: 1149.18 toks/s, output: 76.05 toks/s] Processed prompts: 100%|████████████████| 67/67 [00:02<00:00, 31.53it/s, est. speed input: 6047.70 toks/s, output: 83.31 toks/s] RESULTS: percentage of correct answers: 0.19402985074626866 ============================== [rank0]:[W320 07:13:50.651270455 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 (function operator()) ``` - Training log with paraphrased dataset (no new reward function yet!) - Disappointing results ```bash .2587745785713196, 'completion_length': 374.3125, 'kl': 0.004571444820612669, 'epoch': 0.34} {'train_runtime': 7419.1437, 'train_samples_per_second': 0.218, 'train_steps_per_second': 0.014, 'train_loss': 0.00037626780881639505, 'epoch': 0.34} 100%|████████████████████████████████████████████████████████| 101/101 [2:03:39<00:00, 73.46s/it] Processed prompts: 100%|█| 67/67 [00:19<00:00, 3.51it/s, est. speed input: 1016.34 toks/s, outpu Processed prompts: 100%|█| 66/66 [00:21<00:00, 3.03it/s, est. speed input: 2086.78 toks/s, outpu Processed prompts: 100%|█| 19/19 [00:14<00:00, 1.28it/s, est. speed input: 1326.10 toks/s, outpu Processed prompts: 100%|█| 14/14 [00:14<00:00, 1.03s/it, est. speed input: 1363.04 toks/s, outpu Processed prompts: 100%|█| 9/9 [00:13<00:00, 1.55s/it, est. speed input: 1153.10 toks/s, output: Processed prompts: 100%|█| 67/67 [00:02<00:00, 28.46it/s, est. speed input: 5843.91 toks/s, outpu RESULTS: percentage of correct answers: 0.3582089552238806 ============================== Processed prompts: 100%|█| 67/67 [00:20<00:00, 3.20it/s, est. speed input: 925.56 toks/s, output Processed prompts: 100%|█| 36/36 [00:13<00:00, 2.63it/s, est. speed input: 1755.08 toks/s, outpu Processed prompts: 100%|█| 11/11 [00:09<00:00, 1.19it/s, est. speed input: 1254.10 toks/s, outpu Processed prompts: 100%|█| 8/8 [00:09<00:00, 1.15s/it, est. speed input: 1192.77 toks/s, output: Processed prompts: 100%|█| 4/4 [00:06<00:00, 1.67s/it, est. speed input: 1063.38 toks/s, output: Processed prompts: 100%|█| 67/67 [00:02<00:00, 29.78it/s, est. speed input: 5244.11 toks/s, outpu RESULTS: percentage of correct answers: 0.2835820895522388 ============================== [rank0]:[W324 11:21:27.955684565 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 (function operator()) ``` ## Getting some sense of the eval data or benchmark - > For example, benchmarks like ARC-AGI, which involve visual reasoning, remain challenging for these models, even though they might seem straightforward to a human. (ichigo) - LLama3 1B on my local machine, with new retry reward function ``` torch.tensor(ids, device=device) for ids in completion_ids Processed prompts: 100%|█████| 8/8 [00:03<00:00, 2.33it/s, est. speed input: 611.91 toks/s, output: 285.83 toks/s] rewards_per_func: tensor([0.1250, 0.1750, 0.0225], device='cuda:0')eed input: 611.91 toks/s, output: 285.83 toks/s] {'loss': 0.0001, 'grad_norm': 0.5529439449310303, 'learning_rate': 0.0, 'rewards/reward_correctness': 0.125, 'rewards/reward_formatting': 0.17499999701976776, 'rewards/reward_retry_behavior': 0.02252296172082424, 'reward': 0.32252296805381775, 'reward_std': 0.6055484414100647, 'completion_length': 333.125, 'kl': 0.002497631125152111, 'epoch': 0.17} {'train_runtime': 2145.4442, 'train_samples_per_second': 0.377, 'train_steps_per_second': 0.047, 'train_loss': 7.476110755337125e-05, 'epoch': 0.17} 100%|████████████████████████████████████████████████████████████████████████████| 101/101 [35:45<00:00, 21.24s/it] Processed prompts: 100%|██████████████████| 67/67 [01:27<00:00, 1.30s/it, est. speed input: 221.09 toks/s, output: 446.71 toks/s] Processed prompts: 20%|▏| 2/10 [00:02<00:11, 1.45s/it, est. speed input: 713.29 toks/s, output: 4Processed prompts: 100%|███████████████| 10/10 [00:06<00:00, 1.51it/s, est. speed input: 1464.06 toks/s, output: 255.67 toks/s] Processed prompts: 100%|██████████████████| 1/1 [00:00<00:00, 2.20it/s, est. speed input: 3494.66 toks/s, output: 59.45 toks/s] Processed prompts: 100%|███████████████| 67/67 [00:03<00:00, 18.46it/s, est. speed input: 5495.01 toks/s, output: 154.33 toks/s] RESULTS: percentage of correct answers: 0.3283582089552239 ============================== Processed prompts: 100%|████████████████| 67/67 [01:20<00:00, 1.21s/it, est. speed input: 238.22 toks/s, output: 529.58 toks/s] Processed prompts: 100%|███████████████| 14/14 [00:06<00:00, 2.20it/s, est. speed input: 2025.55 toks/s, output: 315.92 toks/s] Processed prompts: 100%|█████████████████| 3/3 [00:00<00:00, 3.05it/s, est. speed input: 4166.43 toks/s, output: 125.21 toks/s] Processed prompts: 100%|███████████████| 67/67 [00:04<00:00, 16.35it/s, est. speed input: 6068.44 toks/s, output: 184.25 toks/s] RESULTS: percentage of correct answers: 0.29850746268656714 ============================== [rank0]:[W325 18:27:54.262290956 ProcessGroupNCCL.cpp:1496] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see (function operator()) (.venv) (base) ``` - LLama3 1B with new data and 4 reward functions ```bash 2025-03-25 21:59:47.993 | DEBUG | rl_helpers:check_student_answers:493 - Verification result: True 2025-03-25 21:59:47.993 | DEBUG | rl_helpers:check_student_answers:493 - Verification result: True 2025-03-25 21:59:47.993 | DEBUG | rl_helpers:check_student_answers:493 - Verification result: False 2025-03-25 21:59:47.993 | DEBUG | rl_helpers:check_student_answers:493 - Verification result: False 2025-03-25 21:59:47.993 | INFO | rl_helpers:check_student_answers:495 - Verification complete. 11/32 answers correct 2025-03-25 21:59:47.994 | INFO | rl_helpers:run_eval:634 - EVALUATION RESULTS: 2025-03-25 21:59:47.994 | INFO | rl_helpers:run_eval:635 - Percentage of correct answers: 0.344 2025-03-25 21:59:47.994 | INFO | rl_helpers:run_eval:636 - ============================== [rank0]:[W325 21:59:48.406952498 ProcessGroupNCCL.cpp:1496] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator()) (.venv) (base) ~/code/DeepSearch dev ✗ ``` - Llama3 7B with new data and 4 reward functions (bro wtf :()) ```bash 2025-03-25 17:07:05.533 | DEBUG | rl_helpers:check_student_answers:493 - Verification result: False 2025-03-25 17:07:05.533 | DEBUG | rl_helpers:check_student_answers:493 - Verification result: False 2025-03-25 17:07:05.533 | DEBUG | rl_helpers:check_student_answers:493 - Verification result: False 2025-03-25 17:07:05.533 | DEBUG | rl_helpers:check_student_answers:493 - Verification result: False 2025-03-25 17:07:05.533 | INFO | rl_helpers:check_student_answers:495 - Verification complete. 1/32 answers correct 2025-03-25 17:07:05.535 | INFO | rl_helpers:run_eval:634 - EVALUATION RESULTS: 2025-03-25 17:07:05.535 | INFO | rl_helpers:run_eval:635 - Percentage of correct answers: 0.031 2025-03-25 17:07:05.535 | INFO | rl_helpers:run_eval:636 - ============================== [rank0]:[W325 17:07:06.452081140 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 (function operator()) ``` - Llama3 7B with new data only ```bash 2025-03-25 16:48:33.168 | DEBUG | rl_helpers:check_student_answers:493 - Verification result: False 2025-03-25 16:48:33.168 | DEBUG | rl_helpers:check_student_answers:493 - Verification result: True 2025-03-25 16:48:33.168 | DEBUG | rl_helpers:check_student_answers:493 - Verification result: True 2025-03-25 16:48:33.168 | DEBUG | rl_helpers:check_student_answers:493 - Verification result: False 2025-03-25 16:48:33.168 | DEBUG | rl_helpers:check_student_answers:493 - Verification result: False 2025-03-25 16:48:33.168 | DEBUG | rl_helpers:check_student_answers:493 - Verification result: True 2025-03-25 16:48:33.168 | INFO | rl_helpers:check_student_answers:495 - Verification complete. 9/32 answers correct 2025-03-25 16:48:33.176 | INFO | rl_helpers:run_eval:634 - EVALUATION RESULTS: 2025-03-25 16:48:33.177 | INFO | rl_helpers:run_eval:635 - Percentage of correct answers: 0.281 2025-03-25 16:48:33.177 | INFO | rl_helpers:run_eval:636 - ============================== [rank0]:[W325 16:48:34.303740078 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 (function operator()) ``` ```bash === Evaluation 2025-03-26 14:06:01 === Model: meta-llama/Llama-3.2-1B-Instruct Checkpoint: trainer_output_meta-llama_Llama-3.2-1B-Instruct_gpu0_20250326_140600/checkpoint-101 Trained model accuracy: 0.281 (28.1%) Base model accuracy: 0.188 (18.8%) I mprovement: 0.094 (9.4%) ``` ```bash === Evaluation 2025-03-26 15:25:13 === Model: meta-llama/Llama-3.1-8B-Instruct Checkpoint: trainer_output_meta-llama_Llama-3.1-8B-Instruct_gpu1_20250326_134236/checkpoint-101 Trained model accuracy: 0.281 (28.1%) Base model accuracy: 0.188 (18.8%) Improvement: 0.094 (9.4%) ``` - waht the f*ck they have the same accuracy??? this is really bullshit. - new eval script ```bash orrect Sample outputs saved to trainer_output_meta-llama_Llama-3.1-8B-Instruct_gpu0_20250326_223903/lora_model_debug_outputs.txt Evaluation of LoRA model completed Accuracy: 0.7500 Results saved to ./test_results/20250326_223637/model_comparison_results.txt Model comparison completed. Base Model Accuracy: 0.6562 LoRA Model Accuracy: 0.7500 Improvement: 0.0938 Results saved to ./test_results/20250326_223637/model_comparison_results.txt [rank0]:[W326 22:42:43.889848919 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 (function operator()) Model comparison results saved to ./test_results/20250326_223637/model_comparison_results.txt ``` ## ✅ BRO YAY IT'S FKING WORKING ```bash Llama 3.1 8B + 4 Reward functions Base Model Accuracy: 0.0938 LoRA Model Accuracy: 0.3125 Improvement: 0.2188 Llama 3.1 8B + 2 reward fucntions Base Model Accuracy: 0.0625 LoRA Model Accuracy: 0.2188 Improvement: 0.1562 ``` - Bro the 1B model suck 👀 ```bash Sample outputs saved to trainer_output_meta-llama_Llama-3.2-1B-Instruct_gpu0_20250327_110154/lora_model_debug_outputs.txt Evaluation of LoRA model completed Accuracy: 0.0312 Results saved to model_comparison_results.txt Model comparison completed. Base Model Accuracy: 0.0625 LoRA Model Accuracy: 0.0312 Improvement: -0.0312 ```