This Article Might Make You Upset
All posts
AI AutoResearch Agentic Coding Productivity Software Engineering Claude Code

This Article Might Make You Upset

Tim IllguthMarch 28, 202610 min read

AI isn't going to take your job but it might make you obsolete. How can both of these statements be true?

On March 7, 2026, Andrej Karpathy quietly released a tool to GitHub that you need to know about. It is 630 lines of Python, MIT-licensed, and runs on a single GPU. It has no slick landing page. There was no product launch. Just a repository, a README, and a pattern that changes what a developer can accomplish in a single night.

The tool is called AutoResearch. And if you understand what it actually does, you will understand why some developers right now have what feels like a legitimate superpower — and why the gap between them and everyone else is compounding every week.


The Ratchet Loop

To understand AutoResearch, you need to understand the pattern at its core. Karpathy calls it the ratchet loop, and it is elegantly simple.

Here is the full picture:

  1. A human provides high-level direction in a Markdown file
  2. An AI agent proposes a change to the codebase
  3. The agent runs a fast, fixed-time evaluation against a strict metric
  4. If the change improves the metric without degrading anything else — it gets committed to git
  5. The loop starts over, immediately, and runs indefinitely

The key design decision — the one that makes everything work — is that the evaluation harness is immutable. The agent cannot change the rules of the scorecard. It cannot game the metric. It can only find genuine improvements that satisfy the criteria you defined at the start.

This is a direct nod to Goodhart's Law: when a measure becomes a target, it ceases to be a good measure. By making the eval harness untouchable from inside the loop, Karpathy built a system where every committed change represents a real, validated win — not an agent that learned to look good on paper.

The result is an AI running your iterative experimental work at silicon speed. Not assisting you. Running it. Autonomously. Overnight.


What It Produced

Karpathy ran AutoResearch on a single-GPU nanochat LLM training setup — code he had already spent serious personal time optimizing. Not a fresh codebase with obvious wins waiting to be found. His own polished work.

The AI agent executed 650 to 700 short experiments over approximately two days.

It discovered and validated ~20 genuine, additive improvements: normalization scaling oversights, regularization tweaks on value embeddings, AdamW beta adjustments, initialization changes, attention modifications. These improvements stacked cleanly and transferred from a small depth-12 model up to a larger depth-24 model.

The measurable result: an ~11% improvement on the public "Time to GPT-2" leaderboard benchmark.

On code a world-class ML researcher had already optimized.

The agent also found subtle bugs Karpathy had missed for months.

Think about what that means. Highly tuned, expert-level code still had low-hanging fruit that a tireless, systematic loop found in 48 hours. No meetings. No context switching. No calling it a day at 5pm. Just relentless iteration against a defined goal.


The Karpathy Loop Goes Viral

The community response was immediate.

The GitHub repo hit ~50,000 stars and 8,000 forks within days. Media coverage from Fortune, VentureBeat, and others followed quickly. The pattern got a name: the Karpathy Loop.

Shopify CEO Tobias Lütke adapted it overnight. After 37 autonomous experiments, his implementation produced a 0.8B model that outperformed his hand-tuned 1.6B model — a +19% quality and speed improvement. He said he learned more from watching the agent reason through one night of experiments than from months of following ML papers.

Then developers started applying the pattern outside of ML entirely:

  • Prompt optimization — iterating LLM prompts against measurable output quality metrics
  • Code performance — finding micro-optimizations in hot paths automatically
  • Marketing copy — running experiments on email subject lines and ad variants with conversion as the eval
  • Business workflows — any process with a fast, quantifiable feedback loop

The insight that spread: give an agent a clear scoreboard and a safe iteration loop, and it will drive relentless improvement while you sleep. The domain almost does not matter. The pattern works anywhere you can define a goal and measure progress toward it.


This Is What Agentic Coding Actually Means

AutoResearch is the most dramatic example of a broader shift in how software gets built. That shift is called agentic coding — and it is not what most people think it is.

Agentic coding is not using an AI chat window to answer questions. That is assisted Googling with better formatting. It captures maybe 10% of what these tools can do.

Agentic coding means your AI is operating autonomously inside your development environment — reading your files, writing code, running tests, catching its own errors, and iterating — while you operate at the level of goals, architecture, and review.

Compare the three modes:

The old way: You write a function. You get stuck. You paste the error into ChatGPT. You read the explanation. You go back to your editor and fix the line. Repeat.

The agentic way: You describe the feature, the constraints, the data model, and the edge cases you care about. Your agent reads the relevant files in your codebase, writes the implementation, runs the tests, finds the failing cases, fixes them, and surfaces the result for your review.

The AutoResearch way: You define the goal. You define the metric. You define the eval harness. You go to sleep. You wake up to a git history of validated improvements.

Each mode is not just faster — it is operating at a fundamentally different level. The agent is not helping you execute. You are directing an agent that executes autonomously.


Why This Creates a Superpower Gap

Before AutoResearch, the bottleneck in iterative development was human time. You formed a hypothesis, you implemented it, you waited for results, you interpreted them, you formed the next hypothesis. Each cycle took hours, days, sometimes weeks.

The constraint was how fast a human brain could move through the loop.

AutoResearch removes that constraint for any problem you can define a measurable evaluation for.

The agent does not get tired after experiment 47 produces nothing. It does not stop at 5pm. It does not need a meeting to decide whether to try the next hypothesis. It just runs.

Developers who know how to set this up — who know how to define a real goal, build a trustworthy eval, and let an agent iterate against it — are not 20% more productive. They are operating at a categorically different speed. A weekend's worth of agentic iteration can cover ground that would take a non-agentic developer months.

Snail versus racecar is not an exaggeration. They are both going the same direction. One of them arrives on Friday. The other arrives next year.


The One Thing the Agent Cannot Do

Here is the part that makes this a superpower for developers rather than a replacement of them.

The agent cannot define the goal. It cannot decide what is worth measuring. It cannot build the eval harness and protect it from being gamed. It cannot tell you whether the improvements it found are the right improvements for your actual users.

That judgment is the job. And it requires knowing your domain.

The developers getting the most out of AutoResearch and agentic coding are not the ones who handed everything to an AI. They are the ones who understood their problem deeply enough to define it precisely — and then got out of the agent's way.

Karpathy did not just point AutoResearch at a codebase and walk away. He wrote the Markdown file. He chose the benchmark. He designed the eval so the agent could not cheat it. The 700 experiments were the agent's work. The setup was his.

That setup is the leverage point. Write it well, and you get 700 validated experiments. Write it poorly, and you get 700 confidently wrong ones.


How to Start Using This Today

You do not need a research lab or a GPU cluster to apply the ratchet loop pattern. The core idea — define a goal, build a fast eval, let an agent iterate — works in any domain with a measurable feedback loop.

The tools that make agentic coding real right now:

Claude Code — Terminal-based agent with deep filesystem access, git integration, and the ability to run commands. Reads a CLAUDE.md context file automatically. Strong for complex multi-file tasks.

Cursor — VS Code fork with agent mode for multi-file edits. Reads .cursorrules for standing context. Strong for developers who want to stay in a familiar editor.

Agent-Zero — A self-hosted agentic coding platform that works with any LLM. It provides a framework for defining goals, metrics, and safe iteration loops.Run with Ollama and your local models to keep your data private and you from runnng out of tokens. This is where you really move to the next level.

The differentiator is not which tool is smartest. It is which one you have given the best context. A well-contextualized agent in any of these tools outperforms a poorly-contextualized agent in the best one.

Start with a context file. Write down what your project is, how to build and test it, and what the agent should never touch without asking. Then give it a goal specific enough that you can tell whether it succeeded. Then get out of the way.

The racecar is real. It is available today. And the gap between developers running it and developers who are not is getting wider every week.


Karpathy has framed AutoResearch as "the final boss battle" for frontier AI labs — eventually all major labs will run swarms of these agents, promote promising ideas to larger scales, and collaborate across branches. That is complex engineering. But as he put it: it is just engineering, and it is inevitable.

For individual developers, the version of that future is already here. It fits in 630 lines of Python. It runs on your GPU. It commits to your git repo while you sleep.

The developers who understand this are not waiting for the future. They are already in it.