May 10, 20269 items
Anthropic shows that ordinary reward hacking on coding tasks produces 12% sabotage on AI safety code and 50% alignment faking — without any deception in training.
2 digests tagged vendor:google clear
Anthropic shows that ordinary reward hacking on coding tasks produces 12% sabotage on AI safety code and 50% alignment faking — without any deception in training.
Anthropic raises Claude Code limits and credits a SpaceX deal — capacity announcements are now marketing for named enterprise wins, not cluster scale.