May 9, 202610 items
Anthropic dropped a stacked alignment + interpretability batch — "Teaching Claude Why" cuts misalignment 22% → 3% by training on reasoning instead of behavior, and Natural Language Autoencoders read Claude's activations as text.
1 digest tagged skill-library clear
Anthropic dropped a stacked alignment + interpretability batch — "Teaching Claude Why" cuts misalignment 22% → 3% by training on reasoning instead of behavior, and Natural Language Autoencoders read Claude's activations as text.