An AI agent ran 700 machine learning experiments on Andrej Karpathy’s training script over two days and found 20 optimizations he’d missed over 20 years. A single GPU. No internet access. Then Shopify’s CEO ran the same idea on internal data and cut rendering time by 53%. Autonomous research isn’t coming. It’s already running overnight jobs.

The Big Stories

An AI Agent Found 20 Improvements Karpathy Had Missed in Two Decades

Andrej Karpathy released autoresearch on GitHub: a 630-line Python framework that lets an AI agent run ML experiments autonomously. The agent reads a training script, forms a hypothesis, modifies the code, trains for five minutes, evaluates results, and loops. Two days on a single GPU produced 700 experiments and an 11% training speedup, including improvements Karpathy says he hadn’t found in 20 years of working on the same codebase. Shopify CEO Tobias Lütke ran the same approach on internal data: 37 overnight experiments, 19% performance gain, 53% faster Liquid template rendering, 93 automated commits, all 974 unit tests passing. The repo hit 42,000 GitHub stars in its first week. (The Decoder, Fortune, Latent Space)

Why it matters: The architecture is deliberately minimal. A single program.md file carries the agent’s instructions, constraints, and stopping criteria. That’s the entire design. One constraint applies: autoresearch only works where you can measure quality with a single scalar metric. Alignment, interpretability, and product decisions don’t qualify. But for ML training, hyperparameter tuning, and performance benchmarking, the default starting point changed this week. Run autoresearch before touching anything by hand.

Senator Warren Calls the Pentagon’s Anthropic Blacklisting “Retaliation” as Injunction Hearing Looms

Senator Elizabeth Warren sent a letter to Secretary of Defense Pete Hegseth calling the “supply chain risk” designation against Anthropic “what appears to be retaliation.” That designation (historically reserved for Huawei and Kaspersky) forces every Pentagon contractor to certify they don’t use Anthropic products. It’s effective exclusion from the entire U.S. government supply chain. Warren’s argument: if DoD’s objection was AI safety guardrails, normal contract termination was available; applying a foreign-adversary-grade blacklist was a deliberate choice. She simultaneously opened a parallel investigation into OpenAI over a DoD contract signed one day after Anthropic was blacklisted, with answers demanded by April 6. District Judge Rita Lin was scheduled to hear Anthropic’s preliminary injunction motion that same week. (TechCrunch, The Decoder, warren.senate.gov)

Why it matters: Warren’s letter adds congressional pressure but no enforcement mechanism. The real decision point is Judge Lin’s injunction ruling. Granted, it preserves Anthropic’s market position through the full trial. Denied, the blacklisting stands as the status quo. This story has now appeared in six consecutive weekly sessions. The April 6 deadline for the OpenAI investigation response is the next concrete marker before any ruling lands.

Jensen Huang Declares “We’ve Achieved AGI”: The Definition Is Doing All the Work

On the Lex Fridman Podcast last week, Nvidia CEO Jensen Huang declared: “I think it’s now. I think we’ve achieved AGI.” His definition: a system capable of autonomously creating something of economic value, with billion-dollar AI apps cited as examples. He also pushed back on the “AI will destroy software” narrative, arguing agents will use existing software layers rather than replace them, which conveniently aligns with Nvidia’s dependency on the full software stack. Three news sources covered the statement. Zero independent AI researchers weighed in. (The Verge, The Decoder)

Why it matters: Huang’s definition is doing strategic work: under this framing, AGI is already here, uncertainty is resolved, keep buying GPUs. The implication most coverage missed is more concrete. Many OpenAI enterprise agreements include AGI clauses that trigger governance changes when AGI is reached. Huang’s economic-threshold definition, if adopted as any kind of industry standard, would mean those clauses are active today. Anyone holding contracts with AGI-contingent terms should check the specific language.

Under the Radar

[Expert-first] The Case That AI Self-Improvement Is Already “Lossy”

Nathan Lambert’s Interconnects newsletter published an analysis arguing we’re already in a phase of “lossy self-improvement.” His case: labs are optimizing coding assistants and agent swarms faster than they’re improving alignment, interpretability, or error correction. Gains in functional capability are real; the ability to catch and correct mistakes is atrophying in proportion. Lambert frames this as an oligopoly consolidation moment: a few labs pulling ahead on performance metrics while governance infrastructure lags. Lambert runs AI2’s post-training work. No mainstream tech publication covered this piece. (Interconnects)

Why you should care: The “lossy” framing is precise. Systems aren’t getting worse overall; specific capabilities are degrading relative to performance metrics: knowing when the system is wrong, understanding what it’s optimizing for, being able to correct it once it achieves sufficient autonomy. If you’re deploying systems with scalar feedback loops, RLHF variants, or any agent that improves through repeated reward, this is the actual risk model. Not “AI goes rogue.” Lossy self-improvement.

[Expert-first] China Built an LLM Specifically for Electronic Warfare

Import AI Issue 450 (Jack Clark) reported that China has developed an electronic warfare LLM trained to analyze and disrupt enemy electronic systems. First confirmed military AI model targeting EW capabilities, as distinct from command-and-control, logistics, or drone guidance. Zero mainstream tech coverage this week. (Import AI)

Why you should care: Electronic warfare operates at radio-frequency speed, targeting radar, communications, and navigation in real time. A model trained specifically to attack these systems is a different threat category than autonomous drones or surveillance AI. It accelerates adversarial capability in a domain where the U.S. and allies have historically maintained technical superiority. Defense contractors, critical infrastructure operators, and anyone working in communications should treat this as a concrete capability signal, not a theoretical one. Clark isn’t given to sensationalism; he flagged this because it cleared his bar.

Quick Hits

  • Claude gets desktop control - Anthropic’s computer use feature for Claude Code and Claude Cowork is in preview for Pro/Max subscribers: mouse, keyboard, and app navigation on macOS. Windows x64 coming. The Decoder

  • Gemini task automation on Android phones - Limited beta on Pixel 10 Pro and Galaxy S26 Ultra for food delivery and rideshare; first functioning phone-based AI agent in production use. The Verge

  • Altman steps down from Helion board; OpenAI buying 12.5% of its power output - OpenAI is vertically integrating into its own energy supply for AI compute. TechCrunch

  • Gimlet Labs: $80M Series A for hardware-agnostic AI inference - Runs inference across NVIDIA, AMD, Intel, ARM, and Cerebras simultaneously; relevant if you’re evaluating single-vendor lock-in risk. TechCrunch

  • OpenSeeker matches Alibaba AI search with 11,700 training examples - Open-source search agent challenging the assumption that search quality requires massive proprietary training data. The Decoder

  • Simon Willison on Git with coding agents - New practical guide on using Git as a safety and recovery mechanism for agentic engineering workflows. Simon Willison

  • Delve AI accused of “fake compliance” - AI compliance startup allegedly falsely certified hundreds of customers; worth tracking as AI compliance services proliferate. TechCrunch

What to Watch

The Anthropic injunction ruling. Judge Rita Lin in the Northern District of California heard Anthropic’s preliminary injunction motion this week. A ruling could come any day. Granted, Anthropic’s market access is preserved while the full case runs. Denied, the supply-chain-risk blacklisting stays in effect through trial. This is the most consequential near-term legal decision in AI policy. Watch the federal docket for the Northern District of California.

The scalar measurement problem in autonomous research. Karpathy’s autoresearch framework is the most credible demonstration yet of agents running real scientific work independently. The hard constraint is evaluation: it only works with a single quantifiable metric. Every major lab is now heading toward some version of this. The research question to follow: who solves evaluation design for non-scalar domains. Natural language quality, alignment, interpretability: the first credible method for measuring those automatically unlocks autonomous improvement in domains that still require human judgment today.

If someone forwarded this to you, subscribe here to get it weekly.

Keep reading