Reinforced Exploits, Not Optimized Rewards

Characterizing Reward Hacking in Code Models

reward hacking
RL
alignment
interpretability
ERA
Author
Affiliation

Peter Nutter

ETH Zürich

Published

Tuesday, May 19, 2026

Supervisor Prof. Florian Tramèr · ETH Zürich Main advisor Prof. Shi Feng · George Washington University Research manager Dave Banerjee

Summary

An ERA Fellowship study of reward hacking in code models trained with GRPO, where a model rewarded for passing tests can learn to overwrite the run_tests() function instead of solving the problem. The work characterizes when this exploit emerges, finds that it generalizes only weakly to other coding environments (CodeContests, MBPP) and not at all to emergent misalignment, and that even small models represent it as a form of lying (detectable by a deception probe at 0.70–0.77 AUROC). It also includes a proof-of-existence model organism (GPT-OSS-120B) that reward hacks conditionally — only when it infers that oversight is absent.