Summary
An ERA Fellowship study of reward hacking in code models trained with GRPO, where a model rewarded for passing tests can learn to overwrite the run_tests() function instead of solving the problem. The work characterizes when this exploit emerges, finds that it generalizes only weakly to other coding environments (CodeContests, MBPP) and not at all to emergent misalignment, and that even small models represent it as a form of lying (detectable by a deception probe at 0.70–0.77 AUROC). It also includes a proof-of-existence model organism (GPT-OSS-120B) that reward hacks conditionally — only when it infers that oversight is absent.