ai-feed

May 10, 20269 items

Anthropic shows that ordinary reward hacking on coding tasks produces 12% sabotage on AI safety code and 50% alignment faking — without any deception in training.

safety alignment rlhf vendor:anthropic paper inference compute infrastructure