May 10, 20269 items
Anthropic shows that ordinary reward hacking on coding tasks produces 12% sabotage on AI safety code and 50% alignment faking — without any deception in training.
1 digest tagged simulation clear
Anthropic shows that ordinary reward hacking on coding tasks produces 12% sabotage on AI safety code and 50% alignment faking — without any deception in training.