https://medium.com/@techsachin/agentic-reward-modeling-combine-human-preferences-with-verifiable-correctness-signals-for-reliable-76c408b3491c
https://arxiv.org/pdf/2502.19328
https://github.com/THU-KEG/Agentic-Reward-Modeling
https://www.themoonlight.io/en/review/agentic-reward-modeling-integrating-human-preferences-with-verifiable-correctness-signals-for-reliable-reward-systems
Research a bit more on this because I'm a bit outdated on the training side