Friday, May 15, 2026

1 run · 17 raw items · 20 sources, 1 failed

12:14

Codex goes mobile at 4M weekly users while an Ontario audit guts the doctor's-AI-scribe story — agent-first wins one product surface and loses another on the same day.

Codex in ChatGPT mobile, 4M weekly users

OpenAI shipped a fully-featured mobile companion that drives Codex sessions on your laptop, dev box, or remote env — review diffs, approve commands, dispatch new tasks from the phone. The interesting number is in the companion Sea/Shopee post: 87% weekly-active among rolled-out users at one large engineering org. That's an adoption curve you only see when a tool stops being optional.

Codex mobile Sea case study

agent coding-agents vendor:openai product

NVIDIA Vera Rubin + Groq LPX: agentic inference gets its own fabric

The pitch is that multi-turn agents with growing KV caches make traditional data-center networking the bottleneck — jitter compounds across hundreds of round-trips per session. Rubin NVL72 (3,600 PFLOPS NVFP4, 20.7 TB HBM4, 1.6 PB/s per rack) pairs with Groq's LPX scale-up — 96 C2C links/LPU at 112 Gbps, compiler-scheduled static routing, plesiosynchronous timing across thousands of chips, 128 GB unified SRAM per domain. Headline claim: 400 tok/s/user on a trillion-param MoE at 400K context, 35× perf/MW. The architectural bet is clear: deterministic, jitter-free, non-RoCE fabric for the agent era.

NVIDIA blog

compute infrastructure vendor:nvidia agent inference

Ontario auditors gut the AI-scribe category

The Office of the Auditor General evaluated 20 approved AI scribe vendors used by 5,000+ Ontario physicians: 45% fabricated treatment-plan content never said in the recording, 60% inserted incorrect drug information, 85% missed mental-health flags. Most damning is the procurement math — accuracy weighed 4% of vendor scoring while 'Ontario domestic presence' weighed 30%. First hard external audit of the medical-AI scribe pitch, and it lands while Abridge is doing victory-lap podcasts about being the OS of healthcare.

healthcare evaluation policy hallucination

SU-01: gold-medal IMO/IPhO from a 30B-A3B with a 'simple, unified' recipe

Reverse-perplexity-curriculum SFT on ~340K sub-8K trajectories, then 200 steps of two-stage RL (verifiable-rewards → proof-level), then test-time scaling out to 100K+ token traces. The interesting claim isn't the medal — frontier labs already cleared IMO — it's that a 30B MoE with a published recipe reportedly reproduces it. If it replicates, the moat on olympiad-level reasoning is mostly post-training engineering, not parameter count.

HF paper

paper reasoning rlhf open-weight

Hashimoto: programming languages are no longer lock-in

Off the back of Bun's Zig→Rust migration: 'Bun has shown they can be in probably any language they want in roughly a week or two. Rust is expendable.' Willison adds an anecdote about a coding-agent-driven rewrite of paired iOS/Android apps to React Native, on the logic that if it turns out wrong, agents can port it back. The lock-in calculus that decided architecture choices for two decades is dissolving in real time.

Willison post

coding-agents agentic-engineering industry

Themes

Agent-first IDE convergence (the Conductor crab)

Smol AI's framing today: GitHub Copilot App, VS Code Agents, Codex mobile, and Conductor are independently evolving the same body plan — parallel task management, multi-agent supervision, 'review on phone, run on box.' The code-first IDE is becoming a vestigial sub-mode inside an agent-coordination shell.

Multimodal-memory benchmarks are catching up to the marketing

Two HF papers today (MemLens, MemEye) both target long-horizon multimodal memory and converge on the same conclusion: long-context VLMs degrade as conversations grow, memory-augmented agents lose visual fidelity under compression, and multi-session reasoning caps most systems below 30%. Hybrid retrieval + long-context is the only path forward — but nobody has it yet.

Worth reading in full

Latent Space — AI-Native Healthcare with Abridge — The sell-side framing the Ontario audit just collided with — read both back-to-back.
Pragmatic Engineer — Forward deployed engineering heats up again — Paywalled preview, but the FDE meme is back as the org-design answer to agentic coding.
Simon Willison — Not so locked in any more — Language-fungibility thesis with a concrete React Native rewrite anecdote.

Skipped: The AI-poop-app-sells-user-data story, an HN flurry of one-off AI-policy blog posts and local-LLM-without-GPU content, and the Meta-keystroke-logging-for-AI-training piece. The Anthropic Research feed only returned 2023 backfill on a fetch failure. The Verge 'AI slop in peer review' and ArsTechnica Claude Code interview both 403'd on deep-fetch — RSS excerpts too thin to summarize without inventing detail.