Karpathy Autoresearch: 700 AI Experiments in 2 Days With One GPU

By Beau Johnson·March 23, 2026·13 min read

Karpathy Autoresearch: 700 AI Experiments in 2 Days With One GPU

Andrej Karpathy open sourced a 630-line Python script that runs hundreds of ML experiments while you sleep. One GPU. No human babysitting. The agent modifies training code, checks if the model improved, keeps what worked, throws away what didn't, and loops. Karpathy ran 700 experiments in 48 hours and found 20 improvements in code he thought was already optimized. Including a bug he'd missed for months.

That's the short version. Here's why this matters for way more than machine learning.

700 experiments in 2 days vs. a human researcher doing 8-10 per day
20 genuine improvements found in already-optimized code
11% efficiency gain on Karpathy's time-to-GPT-2 benchmark
3 files, 630 lines total. The whole repo.
MIT licensed and fully open source on GitHub

Who Is Andrej Karpathy and Why Does This Matter?

Quick context if you're not deep in AI Twitter. Karpathy was the head of AI at Tesla. He built the neural networks behind Autopilot. He was a founding member of OpenAI. He created nanoGPT, which basically taught the internet how language models work under the hood. When this guy releases something, people pay attention.

Autoresearch hit GitHub in early March 2026 and exploded. 26,000 stars in under a week. VentureBeat covered it. Fortune wrote about "The Karpathy Loop." Reddit's r/singularity had a field day. And Shopify's CEO tried it the same night it dropped.

The reason everyone lost it? Because the idea is stupid simple. And it works.

How Autoresearch Actually Works (3 Files, That's It)

The entire repo is basically three files. No massive framework. No dependency hell. Three files.

prepare.py sets up the training data. You don't touch it.

train.py is 630 lines of training code. This is the only file the AI agent modifies. Every experiment, the agent reads this file, forms a hypothesis, makes a change, runs training for exactly 5 minutes, and checks the result.

program.md is where it gets brilliant. This is a plain English markdown file that tells the agent what to explore. Think of it as research directions you'd give a PhD student. You're not writing code anymore. You're writing instructions.

Component	What It Does	Who Touches It
prepare.py	Data preprocessing	Nobody (set and forget)
train.py	630 lines of training code	The AI agent only
program.md	Research directions in plain English	You (the human)

Karpathy put it perfectly. The human's job is no longer writing training code. The human's job is writing research directions. You're the research advisor. The agent is the PhD student who never sleeps and never complains.

The Design Choices That Make It Work

There are some really intentional constraints baked into autoresearch that make the whole thing click.

Fixed 5-minute time budget. Every experiment gets exactly 5 minutes of GPU time. That means all results are comparable. You can actually tell if one change was better than another because conditions were identical. Like a controlled science experiment, not guesswork.

One metric: val_bpb. Validation bits per byte. Lower is better. That's it. Did the number go down? Keep the change. Did it go up? Throw it away. The agent doesn't need to understand why something works. It just needs to know if the number moved in the right direction.

Single file modification. Only train.py gets changed. This constraint is actually what makes it powerful. When you limit what can change, the agent has to be creative within those boundaries. You're not giving it the keys to the whole house. You're giving it one room and saying make this room as good as possible.

These three constraints (fixed time, single metric, one file) create the perfect sandbox for autonomous experimentation. Remove any one of them and the whole thing falls apart. Too many variables and the agent wastes time on dead ends. No clear metric and it can't tell progress from noise.

Real Results: What Karpathy and Others Actually Found

Here's where it gets wild. Karpathy had been hand-tuning this code for months. This is a guy with 20 years of deep learning experience. The agent found 20 genuine improvements in code he thought was done.

One of those improvements was a bug in his attention implementation. A missing scalar multiplier that was making attention too spread out across heads. He'd missed it. For months. The AI caught it because it doesn't get tired, doesn't get distracted, and doesn't decide to get coffee after the fifteenth failed experiment.

Stacking all improvements together dropped his time-to-GPT-2 metric from 2.02 hours down to 1.80 hours. That's an 11% efficiency gain on code that one of the best researchers in the world already considered optimized.

Shopify CEO Tobi Lutke applied the same pattern to an internal query expansion model overnight. He woke up to a 0.8 billion parameter model scoring 19% higher than his previous hand-tuned 1.6 billion parameter model. A smaller model beat one literally twice its size because the agent optimized architecture for his specific hardware instead of defaulting to "bigger is better."

Hyperspace AI distributed the pattern across a peer-to-peer network. On March 8th, 35 autonomous agents ran 333 experiments unsupervised. Agents on big H100 GPUs used brute force. Agents on regular laptops with just CPUs had to get creative, focusing on initialization strategies and normalization choices. They shared discoveries in real time via a gossip protocol. When one agent found an initialization technique that dropped loss by 21%, that discovery spread through the network. Within hours, 23 other agents had incorporated it.

In 17 hours, these agents independently rediscovered ML milestones that took human researchers at Google Brain and OpenAI nearly 8 years to formalize.

This Pattern Works Way Beyond Machine Learning

Even if you never train a language model. Even if you don't have a GPU. Even if PyTorch makes your eyes glaze over. This pattern matters for you.

Think about what's actually happening. You have an agent. You give it a clear goal. A measurable metric. Constraints. And you let it run experiments, keep what works, throw away what doesn't, compound improvements over time.

That's not just ML research. That's everything.

Business Area	What to Test	The Metric	How to Iterate
Landing pages	Headlines, CTAs, layout	Conversion rate	A/B test variants, keep winners
Email marketing	Subject lines, send times	Open rate / click rate	Batch test, promote best performers
Content strategy	Topics, formats, hooks	Views / engagement	Publish, measure, double down on what hits
Pricing	Price points, tiers, anchors	Revenue per visitor	Test cohorts, track conversion x revenue
Prompt engineering	System prompts, temperature, structure	Output quality score	Eval loop with rubric, keep highest scores
YouTube thumbnails	Text, colors, expressions	Click-through rate	Upload variants, measure CTR at 48h

The feedback loop is the product. That's the real insight. Karpathy just demonstrated it with neural network training. But the loop is universal. Clear objective. Measurable metric. Constrained experiments. Autonomous iteration. Compounding gains.

A community member on the Claude Code subreddit already built a skill that applies this exact pattern to any task, not just ML. If you have something you can measure, you can autoresearch it.

How to Get Started With Autoresearch Today

The repo is MIT licensed and fully open source. Here's what you need.

If you have an NVIDIA GPU: Clone the repo, install PyTorch, and run it tonight. The README walks you through setup in about 10 minutes. You'll need a coding agent (Claude Code, Codex, or similar) that can read a markdown file and modify a Python script.

If you're on Mac (Apple Silicon): The community has forked it with MPS support. Same workflow, slightly different setup.

If you have no GPU at all: You can still apply the pattern. Write a program.md for your specific optimization task. Set up a metric. Use any AI coding agent to run the experiment loop. The GPU part is only needed for the ML training use case. The pattern itself just needs a metric and an agent.

Step by step:

Identify one thing you want to optimize that has a clear, measurable metric
Write a program.md describing what the agent should explore and what constraints to respect
Set up the metric measurement (even if it's manual at first)
Let the agent run. Start with 10 experiments. Then 50. Then let it go overnight.
Review results in the morning. Stack the winners.

The hardest part isn't the technical setup. The hardest part is having the discipline to actually throw away what doesn't work instead of getting emotionally attached to it. The agent doesn't have that problem. It doesn't care about sunk costs. The number went the wrong direction, it moved on. No ego. Just the metric.

When Autoresearch Is NOT the Right Tool

Let's be real. This pattern doesn't work for everything.

If your metric is fuzzy or subjective, autoresearch won't help. "Does this feel better?" isn't a metric. You need something the agent can measure without human judgment on every iteration.

If experiments are expensive or slow, the loop breaks down. Autoresearch works because each experiment takes 5 minutes. If your experiments take 3 hours each, you'll get maybe 4 overnight instead of 100. Still useful, but the compounding effect is much weaker.

If you're optimizing for multiple competing objectives, a single metric loop won't cut it. Trading off accuracy vs. speed vs. cost requires human judgment about which tradeoff is acceptable. The agent can optimize one at a time, but multi-objective optimization needs a human in the loop to set priorities.

If you don't have a baseline to compare against, start there first. You need to know where you are before you can measure improvement. Run your current setup, establish a benchmark, then let the agent try to beat it.

The API costs are worth watching too. Running 100+ experiments means 100+ agent calls. With Claude or GPT-5.4 as the coding agent, expect $5-30 per overnight session depending on complexity. Not huge, but not zero.

The Bigger Picture: What This Means for Builders

Karpathy wrote a sci-fi intro to the repo that honestly gave me chills. He described a future where research is "entirely the domain of autonomous swarms of AI agents running across compute cluster megastructures in the skies." Human researchers don't do the experiments anymore. They set the direction. They write the program.md. They decide what questions are worth asking. The agents do the actual work of answering those questions at a speed no human team could match.

Some people hear that and get scared. I hear it and get excited. Because this is the exact opportunity I've been talking about since day one. The people who learn to set up these loops, who learn to think in terms of metrics and experiments and autonomous iteration, those are the people who are going to win.

It doesn't matter if you can't code. It doesn't matter if you don't have a CS degree. What matters is understanding the pattern. Clear objective. Measurable metric. Constrained experiments. Let the agent iterate. Keep what works.

That's the formula. Whether you're training a language model or testing YouTube thumbnails or optimizing your email subject lines. The loop is the same.

FAQ

What is Karpathy's autoresearch?

Autoresearch is an open source tool by Andrej Karpathy that lets an AI agent autonomously run machine learning experiments on a single GPU. The agent modifies training code, measures results using validation bits per byte (val_bpb), keeps improvements, and discards failures. It ran 700 experiments in 2 days on Karpathy's setup.

Do I need an NVIDIA GPU to run autoresearch?

The original repo targets NVIDIA GPUs, but the community has forked it for Mac (Apple Silicon via MPS), Windows, and AMD GPUs. You also need a coding agent like Claude Code, Codex, or any agent that can read markdown and modify Python scripts.

Can autoresearch be used for things other than ML training?

Yes. The core pattern (clear metric, constrained experiments, keep winners, discard losers) applies to any optimization task. People are already adapting it for landing page testing, prompt engineering, pricing experiments, and content optimization. You need a measurable metric and a willingness to let the agent iterate.

How much does it cost to run autoresearch overnight?

GPU compute is free if you own the hardware. The main cost is the AI agent API calls. Running 100 experiments overnight with Claude Code or Codex typically costs between $5 and $30 depending on model choice and experiment complexity.

What results did Karpathy get from autoresearch?

Karpathy found 20 genuine improvements in code he had already hand-tuned for months, including a bug in his attention implementation. Stacking all improvements dropped his time-to-GPT-2 metric from 2.02 hours to 1.80 hours, an 11% efficiency gain. Shopify CEO Tobi Lutke used the same pattern and got a 19% performance gain with a model half the size of his previous one.

If you want to learn how to set up autonomous loops like this for your own business, that's exactly what we do inside Shipping Skool. Over a hundred members building with AI agents every day. Live calls six times a week. Real people shipping real products. Come check it out.

Ready to start building with AI?

Join Shipping Skool and ship your first product in weeks.

Join Shipping Skool