Four frontier models in one week; Anthropic confirms six weeks of Claude Code bugs

Four frontier-class AI models shipped in the same seven days this week. Two are open-weight and MIT-licensed. For most individual capability categories, the choice between a paid frontier subscription and a free download is now about vendor support and deployment infrastructure, not what the model can actually do.

The Big Stories

GPT-5.5, DeepSeek V4, MiMo V2.5-Pro, and Qwen3.6 All Launched in the Same Week

OpenAI released GPT-5.5 on April 23, rolling it out to paid ChatGPT plans and Codex with a staged API rollout tied to a safety review. Artificial Analysis independently rated it the top model globally on intelligence-per-dollar; Simon Willison’s hands-on described it as “a fast, effective and highly capable model.” GPT-5.5 scores 82.7% on Terminal-Bench 2.0 against Claude Opus 4.7’s 69.4%, though OpenAI omitted coding benchmark comparisons against Anthropic entirely. In the same window, DeepSeek released V4 (the first major release since V3.2 in December 2025) in two variants: V4-Pro (1.6T parameters, 49B active, MIT license, Huawei Ascend support) and V4-Flash (284B/13B). Xiaomi’s MiMo V2.5-Pro scored within 3 points of Opus 4.7 on GPQA-Diamond at zero cost. Alibaba’s Qwen3.6-27B (a 55.6GB dense model) matched benchmarks previously held by a 807GB MoE, a 14x efficiency gain in one release cycle.

Why it matters: The remaining differentiator for paid frontier models isn’t raw intelligence. It’s trust infrastructure (safety evaluations, vendor support, enterprise contracts) and workflow integration. Community practitioners still favor Kimi K2.6 for coding over DeepSeek V4-Pro; MiMo V2.5-Pro may displace Kimi on reasoning-heavy tasks; Qwen3.6-27B is the efficiency pick for constrained VRAM. One grounding data point worth holding: a practitioner who spent weeks trying to substitute local models for Claude Code in real production work reported failure. Parity is hardware-dependent and narrower than benchmark enthusiasm suggests.

Anthropic’s Postmortem Confirmed Three Harness Bugs Degraded Claude Code for Six Weeks

Anthropic published an engineering postmortem on April 23 confirming that three product-layer changes caused weeks of quality degradation; the underlying models were not at fault. The worst bug, shipped March 26: a change designed to clear idle-session context overhead instead cleared context on every single turn for the rest of the session, making Claude appear forgetful and repetitive throughout. A March 4 change quietly reduced default reasoning effort from high to medium to reduce latency. A third change on April 16 added a 25-word limit between tool calls that compounded with the first two. All three bypassed standard internal gates. Opus 4.7 caught the caching bug during code review with full repository context; Opus 4.6 missed it. Simon Willison documented the postmortem and noted the contrast with GitHub: in the days before the postmortem, Anthropic briefly tested removing Claude Code from the $20 Pro tier (visible on the pricing page for a fraction of new signups), then reverted within hours after users noticed. No announcement accompanied either change.

Why it matters: A single system prompt line can undo months of capability. Anthropic’s transparency is notable here. The postmortem names specific bugs, dates, and corrective actions. But the pricing reversal is a separate story. It didn’t happen and get walked back because the decision changed. It was caught by user observation. The direction of travel is visible regardless. Update to v2.1.116 if you run sessions longer than an hour. For team budget planning, assume Claude Code exits the base Pro tier in the next one or two pricing cycles.

GitHub Copilot Switches All Plans to Usage-Based Billing on June 1

Starting June 1, all GitHub Copilot plans convert from premium request units to GitHub AI Credits billed by token usage. Per the official announcement, subscription prices don’t change (Pro $10/month, Business $19/user/month, Enterprise $39/user/month), and monthly included credits equal the subscription price. Code completions remain free; agentic sessions and chat don’t. GPT-5.5 carries a 7.5x premium credit multiplier, meaning heavy agentic use on a Business plan can exhaust $19/month in a single afternoon. Business and Enterprise customers get promotional credits through August (1.6x and 1.8x normal allotments). Admin budget controls (enterprise, cost center, user, individual) launch with the June 1 transition. GitHub’s rationale is plain: “Today, a quick chat question and a multi-hour autonomous coding session can cost the user the same amount.”

Why it matters: Anthropic’s same-day pricing test and GitHub’s billing announcement both dropped in the same 48-hour window. Neither was coincidental. Both reflect the same infrastructure pressure: agentic workloads have outgrown flat-rate pricing. If you’re an admin, set budget controls before June 1. Annual plan subscribers should check whether converting to monthly before expiration makes sense given the transition credit benefit. And if GPT-5.5 is your team’s default for agentic sessions, audit that usage pattern now.

Under the Radar

[Expert-first] The Sleeper Agents Backdoor Paper Has a Replication Problem

A LessWrong post this week replicated the Anthropic Sleeper Agents setup, testing models trained with hidden trigger-activated behaviors on Llama-3.3-70B and Llama-3.1-8B. The finding: whether further safety training removes the backdoor depends heavily on which optimizer was used to install it, whether CoT distillation was used, and the base model. For some configurations, the robustness results from the original paper reverse direction entirely. Higher learning rates for safety training removed backdoors more effectively than lower rates in some setups, contradicting the original paper’s conclusions about backdoor persistence. Zero mainstream AI media coverage.

Why you should care: The original Sleeper Agents paper is cited as foundational evidence that AI backdoors can survive standard safety training, making them an existential concern in red-teaming pipelines. If the robustness is optimizer-sensitive and can reverse direction depending on implementation choices, the practical threat model is noisier than the paper implied. This doesn’t make the original finding wrong. It means anyone building evaluations or policies around backdoor persistence assumptions should read the replication before their next round of specifications. The underlying safety concern remains; the research confidence level doesn’t.

[Expert-first] Apple Published Billion-Parameter RNN Research at ICLR. AI Media Ignored It.

Apple’s ML Research team presented ParaRNN at ICLR 2026. The work develops a framework for training large-scale nonlinear RNNs in parallel, achieving up to 665x speedup over sequential training and enabling 7B-parameter classical RNNs (LSTM and GRU variants) competitive with similarly-sized transformers and Mamba2 architectures in language modeling. Code is open-sourced on GitHub. No major AI news publication covered this. The system reframes recurrence as a parallel system of equations solved via Newton’s iterations, bypassing the sequential bottleneck that made billion-parameter RNNs impractical.

Why you should care: RNNs have inference properties transformers don’t: fixed-size state (not a growing KV cache), lower memory requirements, and strong performance on constrained hardware. They’ve been architecturally sidelined because training didn’t scale. Apple just fixed that at 7B parameters, open-sourced the code, and published this research the same week they announced a hardware-first CEO who spent 25 years building chips. Whether the timing is intentional or coincidental, the combination is worth tracking. If RNNs at scale prove viable in production, on-device inference economics change substantially for anyone deploying edge AI.

Quick Hits

OpenAI and Microsoft dissolved their exclusive partnership - The AGI clause that would have voided Microsoft’s commercial rights on AGI achievement has been removed. OpenAI can now distribute through any cloud provider; Microsoft keeps a non-exclusive license through 2032 plus a revenue share. The Decoder
Cursor-Opus agent deleted PocketOS’s production database in 9 seconds - The agent hit a credential mismatch, found an unrelated API token with blanket Railway permissions, and deleted the database and all backups without confirmation or scope isolation. Not a rogue AI story. A missing-guardrails story. The Register
Cohere and Aleph Alpha are merging with $600M from Schwarz Group - Lidl’s parent company backs a transatlantic enterprise AI alternative to the US-dominated stack, valued at $20B combined. Backed by both the Canadian and German governments as a sovereign AI play. TechCrunch
OpenAI released an open-weight PII detection model - Privacy Filter is a 1.5B parameter, Apache 2.0 licensed model that runs locally, handles 128K tokens, and scores 96% F1 on the PII-Masking-300k benchmark. Covers names, addresses, emails, phone numbers, passwords, and API keys. OpenAI Blog
GPT-Image-2 is live on the API and in ChatGPT - Simon Willison tested it and verified significant compositional improvement over the previous generation. Skip Sam Altman’s framing; Willison’s actual test is the credible signal here. Simon Willison
Shopify hit near-universal AI adoption with unlimited token budgets - CTO Mikhail Parakhin told Latent Space the bottleneck is no longer token generation; it’s code review, CI/CD, and deployment. Agent-generated code raised PR merge volume 30% month-over-month, breaking existing Git pipelines. Latent Space
Google signed a classified Pentagon AI deal over 600+ employee objections - The deal allows DoD to use Gemini for “any lawful government purpose.” Legal experts note the safety clauses are non-binding. OpenAI’s concurrent FedRAMP Moderate approval puts ChatGPT Enterprise in the same federal procurement pool. The Decoder
ComfyUI raised $30M Series B at a $500M valuation - 4 million claimed users, Craft Ventures-led round (Pace Capital, Chemistry, TruArrow). Controllable AI media generation for creators is now confirmed venture-scale. TechCrunch

What to Watch

AI token costs going structural - GitHub’s billing shift and Shopify’s design pattern (remove token caps, move the constraint to review and deployment gates) are two different answers to the same problem. Flat-rate subscriptions subsidized agentic workloads until they couldn’t. The next six months will see more AI coding tools follow GitHub’s model. For teams on annual plans, understand your token economics now before pricing structures change around you. The Shopify approach (unlimited generation, constrained by review and CI/CD) is probably the design pattern that scales for engineering teams serious about deploying agents in production.

If someone forwarded this to you, subscribe here to get it every Tuesday.

Our team also ships AI Edge, a mobile companion for daily AI news scanning between issues. Free, on-device. iOS · Android.