2 Commits (d0e6068055eac86f83c4900c393495103bf294d8)

Author SHA1 Message Date
thinhlpg d0e6068055 fix: strengthen reward correctness logic to handle final message is not asnwer form assistant. Also update logs for reward functions for better debug
1 month ago
thinhlpg 31dcbf5d8a feat: refactor whole code base, add logic for training R1 distil base models, change some template and reward logics
1 month ago