Thursday, May 14, 2026

1 run · 27 raw items · 20 sources, 1 failed

00:14

1Password's monolith-refactor postmortem is the most useful data point of the day: agents delivered a 20–30% lift on the hard parts, but only after humans built the scaffolding and caught the speculation.

1Password publishes concrete numbers on agent-driven refactor of its Go monolith

B5 is 1Password's Go monolith and they used agents to decompose it for the Unified Access product, touching over 3,000 call sites. The honest numbers are modest: 20–30% productivity gain on service extraction, with the headline win being a database transaction migration that 'took a matter of hours' once a human had specified the change and built the tooling. The named failure modes are useful: agents tried to backfill UUID columns before updating insertion code (data-loss risk), and speculated about formats like ULID when context was missing, triggering rollbacks. This is the rare deployment post that says how it actually went rather than how it could go.

1Password blog HN thread

agent deployment software-engineering case-study

ToolCUA pushes OSWorld-MCP to 46.85% by learning when to drop GUI clicks for tool calls

Computer-use agents have a hybrid action space — clicks and types versus API/tool calls — and ToolCUA treats the switching decision as the actual hard problem, not the individual actions. They synthesize interleaved GUI–Tool trajectories from existing static GUI data, then do a staged SFT → single-turn RL → online agentic RL pipeline with a Tool-Efficient Path Reward. 46.85% on OSWorld-MCP is a 66% relative jump over the baseline at comparable scale; the more interesting bit is that the field is now optimizing path orchestration, not pixel grounding.

Paper

agent computer-use paper rl benchmark

MCP-Cosmos bolts a generative world model onto MCP agents

The thesis: MCP gave agents a uniform tool interface but no internal model of the environment, so they either plan blindly upfront (ReAct) or react without foresight. MCP-Cosmos lets you 'Bring Your Own World Model' to simulate state transitions in latent space before committing to a real tool call. The results on 20+ MCP-Bench tasks show gains in tool success rate and parameter accuracy, which is exactly where current MCP agents bleed. World models for agents is a direction worth watching — it's the same insight that drove model-based RL fifteen years ago, re-purposed for the LLM tool-use stack.

Paper

agent paper world-model mcp

Token-Superposition Training claims 2.5× pre-training speedup at 10B MoE scale

TST trains in two phases: a 'superposition' phase that bags many contiguous tokens together and trains with multi-hot cross-entropy, then a recovery phase that reverts to standard training. No changes to parallelism, optimizer, tokenizer, data, or architecture — a drop-in replacement. Validated at 270M and 600M, then on 3B and a 10B A1B MoE; 2.5× wall-clock reduction at equal loss. If this replicates at frontier scale it's a free lunch most labs will adopt within a quarter; the catch is always whether 'equal loss' actually maps to equal downstream behavior in a regime where training time costs nine figures.

Paper

paper pretraining efficiency moe

Sean Goedecke walks through the actual physics of space datacenter cooling

Conventional wisdom says vacuum makes cooling impossible because conduction and convection don't work. Goedecke's point is that radiation does, and radiating heat against the 3K background of shaded space is genuinely easier than dumping it into a 300K Earth atmosphere. The cost is area: ~2,500 m² of radiator per megawatt, so a 100 MW facility needs ~250,000 m² — a 250× scale-up over the ISS's radiators and 100–500 Starship launches of hardware. The piece is a useful corrective when 'space datacenters are pure hype' becomes the lazy take; the actual blocker is launch cost and area, not thermodynamics.

Sean Goedecke

infrastructure compute physics

Themes

The 'what is an agent' fog refuses to lift

Boris Mann (via Simon Willison) nails it: '11 AI agents' is as informative as '11 spreadsheets' or '11 browser tabs.' The 1Password write-up is the practical other side of the same coin — what they actually shipped looks nothing like a generalist agent and everything like a specialized harness with structured outputs, human-written specifications, and rollback gates. The industry's marketing vocabulary and its engineering vocabulary are now fully detached from each other.

Agent research is converging on path orchestration

ToolCUA optimizes when to switch between GUI actions and tool calls; MCP-Cosmos optimizes which tool call to commit to by simulating it first. Both are about the meta-decision layer above the action, not the action itself. Pixel grounding and tool invocation are increasingly treated as solved primitives — the interesting margins are upstream.

Worth reading in full

1Password: What we learned using AI agents to refactor a monolith — The few production-grade agent postmortems out there are worth more than a hundred demo videos; this one names its failure modes.
Anders Hejlsberg on TypeScript, C#, Turbo Pascal — and AI's future for software engineering — The architect of three of the most widely-used languages of the last 40 years on where the craft is headed; even a brief paywall preview is worth your time.
Sean Goedecke: AI datacenters in space do not have a cooling problem — Physics-grounded contrarian piece; the kind of analysis missing from most space-compute discourse.

Skipped: Anthropic Research's RSS backfilled eight 2024-dated posts (Circuits Updates, Claude's Character, etc.) — interesting historically but not news. The GitHub Blog roguelikes piece is not AI. Most HN low-vote items are self-promo (yeah CLI, Robyx-AI, free-scene movie maker). NVIDIA Developer's video-AI and XANI X-ray posts are vendor marketing. Microsoft Research feed timed out; will retry next slot.