Monday, May 11, 2026

2 runs · 37 raw items · 21 sources

Run 2 · 12:13

Independent CVE benchmark puts Claude Opus 4.6 ahead of GPT-4 CoT on real-world C/C++ vulnerability detection — but the gain comes almost entirely from reasoning-rigor prompting and a Sonnet-4.6 verifier on top, not the base model.

ZeroPath: Opus 4.6 hits 23.3% pair-correct precision / 28.9% CVE recall on 435 real C/C++ vulns — the lift comes from scaffolding

Across 435 vulnerability/fix pairs drawn from open-source projects, the ablation is the story: a bare 'report what you see' prompt scores 13.6% pair-correct precision; demanding 'full proof of reachability' with structured execution traces pulls it to 20.3%; adding a Sonnet-4.6 verifier on top gets 23.3% and 28.9% recall. GPT-4 with chain-of-thought lands at 12.94%, well below any Opus 4.6 configuration. The interesting reading isn't 'Anthropic wins' — it's that two thirds of the gap over GPT-4 CoT is unlocked by harness work (forced justification + a separate verification agent), and only a third by the base-model choice. Validates the thesis Anthropic put forward in their October cyber-defenders post but reframes where the value lives.

ZeroPath benchmark (GitHub)Anthropic, Building AI for cyber defenders

benchmark security evaluation vendor:anthropic agent

Ryan Hiebert: AI harnesses should adopt OAuth's constrained-delegation model

Today's harnesses do one of two bad things when an agent needs more access than it has: hand over full credentials, or stall waiting for a human. Hiebert's prescription is OAuth's playbook — scoped tokens, agent-initiated escalation requests, and hierarchical delegation so a parent agent can issue properly-scoped child tokens to sub-agents. Right shape for the problem; the part that should land for harness builders is that 'every tool call must trace to a properly delegated authority' is a hard architectural choice you make once, not something you bolt on. The agent-infra people building this in a hurry are inventing weaker versions of OAuth as we speak.

Ryan Hiebert

agent security infrastructure authorization

Flow-OPD: porting on-policy distillation from LLM post-training to flow-matching text-to-image

Multi-reward RL alignment on diffusion models hits a 'seesaw effect' — push GenEval and OCR degrades, push OCR and aesthetics rot. Flow-OPD's recipe: train single-reward GRPO teachers per axis, then distill into a unified student via on-policy sampling, task-routing labels, and a manifold-anchor regularizer to keep aesthetics from collapsing. On SD 3.5 Medium they push GenEval 63→92 and OCR 59→94. The headline numbers are noisy; the interesting thread is methodological convergence — OPD was an LLM post-training trick and it's now leaking into diffusion to solve the same multi-objective alignment failure mode. Expect more of this cross-modality borrowing as RLHF/RLAIF stalls on single-reward optimization.

Flow-OPD paper (HF)

paper diffusion multimodal training rlhf

Themes

Scaffolding eats capability

The ZeroPath ablation says it directly — two thirds of the gap on real-world vuln detection comes from forced reasoning and a verifier, not the base-model swap. Hiebert's OAuth piece says the same thing one layer up: the agent's ceiling is set by what the harness lets it do safely. Flow-OPD is the diffusion-side echo: the win comes from training procedure, not architecture. The model is no longer the lever you pull; the system around the model is.

Worth reading in full

ZeroPath Opus 4.6 vulnerability benchmark — Clean ablation showing how much of the win is prompt-rigor + verifier vs base model — the actual decomposition matters more than the headline number.
What AI Harnesses Should Learn from OAuth (Ryan Hiebert) — If you're building agent tooling, this is the design conversation you'll wish you'd had before rolling your own permissions scheme.

Skipped: NVIDIA Developer reissued another 8 GTC-era reposts (Nemotron 3 Super, CUDA 13.2, Falcon-H1 in Megatron, disaggregated serving) — same backfill noise the morning saw, ignored again. METR and Anthropic Research backfills are 2024-vintage and 2025-Q2 republishes, already in the corpus. Wired's 'using AI for 10 minutes makes you lazy' and the Guardian's worker-surveillance op-ed both refused to fetch; the headlines fit today's productivity-realism arc but with only headlines visible there's no honest take to write — flagging them as headlines is the most that can be claimed. HN's Show HN churn (Studis, PandaFlow, Gawk Dev), the FT Iran-war piece, and the Medium 'Claude Skills Bible' (which boils down to 'write specific skill descriptions') don't clear the bar.

Run 1 · 00:14

The AI-productivity backlash crosses from blog discourse into operational reality: RPCS3 starts banning undisclosed-AI PRs, two top engineering essays argue the 10x productivity story collapses once maintenance is priced in, and the NYT corrects an article where an AI summary was rendered as a fabricated quote.

RPCS3 will ban contributors who submit undisclosed AI-generated PRs

The PlayStation 3 emulator team announced enforcement against undisclosed AI submissions and pointedly told contributors to 'learn how to debug and code instead of generating slop that you don't understand and that doesn't work.' This is the operational artifact of the wider OSS-maintainer revolt that's been building since Godot publicly contemplated hiring extra maintainers just to triage AI-slop PRs. The pattern to watch: large-codebase OSS projects converging on disclosure-or-ban policies while frontier-model vendors keep selling agent autonomy.

Kotaku

ai-slop open-source policy agent code-review

James Shore: AI productivity is a debt trade unless maintenance cost falls in lockstep

Shore's math is blunt: if AI doubles code output, maintenance cost per line must halve to break even, and current evidence runs the wrong way — agent-generated code raises maintenance burden. His crowd-sourced numbers (~10 days year-one maintenance per month written, plus 5 days/year thereafter) push normal-project break-even at month 31; double the maintenance load and it collapses to month 10. The lock-in argument is the sharp one: once the debt accumulates, removing the agent doesn't remove the burden.

James Shore

productivity code-quality engineering-economics agent

Shrivu Shankar: most AI productivity gains cap at 10–20% absent simultaneous personal and org change

Diagnoses the failure modes on both sides: skipped planning, agent-verification loops the human stays inside, and throwaway sessions that don't compound into reusable context on the personal side; visible-usage-as-KPI, unreviewed AI artifacts, and intact functional handoffs that bottleneck on compressed execution on the org side. Worth reading alongside Shore — this is the post-hype, post-benchmark literature finally landing. The question is no longer 'does it write code' but 'does it survive in your operating model.'

sshh.io

productivity agent engineering-management adoption

NYT correction: AI summary rendered as a direct quote, reporter failed to verify

The Times updated a Canadian-politics article after learning a remark attributed to Pierre Poilievre was an AI-generated summary of his views that the tool had presented as a quotation; the actual speech never used the term 'turncoats.' This isn't a hallucination story — it's a verification-discipline story. The reporter treated AI output as a primary source, which is exactly the failure mode the post-deployment safety literature has been pointing at, now landing in flagship-publication corrections.

Editors' Note via Simon Willison

hallucinations journalism ai-ethics verification

Themes

The productivity-realism wave is here, and it's not coming from skeptics

Shore and Shankar are practitioners writing for practitioners; RPCS3 is open-source maintainers reacting to volume; the NYT correction is a newsroom auditing its own workflow. None of this is the 'AI is fake' crowd — it's the operating-it-in-production crowd reporting back. The unit of analysis has shifted from 'model capability' to 'what happens to a team / codebase / publication after six months of using it,' and vendor marketing hasn't caught up yet.

Worth reading in full

You Need AI That Reduces Maintenance Costs (James Shore) — Cleanest framing yet of why 'doubled output' is the wrong KPI; the lock-in section should change how teams evaluate agent ROI.
How AI Productivity Fails (Shrivu Shankar) — Personal/org dual-axis failure taxonomy reads like a checklist for what to actually fix when AI rollouts plateau.

Skipped: The NVIDIA Developer feed dumped 8 items that are all April-2 GTC-period reposts (Vera Rubin POD, Groq 3 LPX, BlueField-4 CMX, Dynamo 1.0, OpenShell) already covered in prior digests; the METR backfill (eight 2024-vintage policy responses and the Vivaria platform page) is not new signal; the Anthropic Research backfill (Petri, introspection, economic-index reports, sample-poisoning) all republished October-2025 items already digested. HN's churn (Show HN posts, regional opinion pieces, the Wired AI-toys piece) didn't surface anything load-bearing beyond the four developments above. One genuinely fun item — Adam Dunkels getting Claude Haiku 4.5 to act as a userspace IP stack with 42.6-second ping latency — is a curiosity, not a development.