artwhisper
← Blog The Warp Engine: How Self-Improving Agents Change Everything

The Warp Engine: How Self-Improving Agents Change Everything

In March 2026, Andrej Karpathy's AutoResearch script ran 50 experiments overnight, revealing the importance of the Karpathy Loop's primitives: editable assets, scalar metrics, and time-boxed cycles. This pattern extends beyond ML training, but teams must also consider the agent's own instruction loop and the need for clear, version-controlled `program.md` documents.

Mar 15, 2026

March 2026 · 8 min read

Karpathy ran 50 experiments overnight with a 630-line script. The community connected the dots. Here is what the convergence means — and why taste is now infrastructure.


On the night of March 7, Andrej Karpathy pushed a Python script to GitHub and went to sleep. By morning, the agent had run 50 experiments, found a better learning rate, and committed the results to git — without a single human instruction in between. The story making the rounds is about autonomous ML research. The more important story is about what this design pattern reveals when you follow it to its logical end.

Three Primitives, Not One

AutoResearch is not magic. It is built from three precise constraints working together:

Primitive What it does
Editable asset One file the agent may modify
Scalar metric One number that defines "better"
Time-boxed cycle Fixed duration — every run is comparable

Each constraint is doing specific engineering work. The fixed-time budget makes every experiment directly comparable. The single editable file keeps the search space interpretable — the git log becomes a legible experiment journal. The scalar metric eliminates ambiguity about what "better" means, and it must be computable without human judgment.

Together, these three things — not the GPU, not the model architecture — are what make the loop generalizable far beyond ML training.

The Most Underappreciated File in the Repo

It is not train.py. It is program.md.

This single Markdown document carries three registers simultaneously: instructions (what to search for), constraints (what must not change), and stopping criteria (when to wrap up). No other common format handles all three. YAML encodes structure but not reasoning. Python is executable but not legible as a strategy. JSON has no narrative.

"Writing a good program.md is the highest-leverage skill in an autonomous experiment loop. Not writing the training script. Not configuring the agent. The document."

The broader pattern is already visible across the tooling ecosystem. CLAUDE.md files in Claude Code repositories govern agent behaviour per codebase. Cursor rules files encode project conventions. System prompt documents are now version-controlled alongside application code. Teams across the industry are independently converging on the same discovery: structured natural language documents are the most durable way to encode human intent for autonomous agents.

The git history of program.md is as valuable as the git history of train.py. Teams that treat it as a throwaway prompt will get throwaway results.

The Pattern Generalizes

The ML training loop is the first instantiation of the Karpathy Loop — not its definition. Any system that exposes a scriptable asset, produces a measurable scalar outcome, and tolerates a time-boxed evaluation cycle is a candidate for the same pattern.

Domain Editable asset Scalar metric Fixed constraint
ML training train.py (hyperparams) Validation bits per byte Dataset, eval split
DB query tuning Query config file p95 latency Schema, benchmark dataset
Support routing Routing rules / prompt Accuracy on hold-out set Category taxonomy
Agent optimisation agent.py (prompts, tools) Eval score (LangSmith) Evaluation harness, test cases

Harrison Chase, founder of LangChain, validated this within days of Karpathy's release — publishing autoresearch-agents, adapting the loop entirely for agent optimisation. The editable asset is agent.py. The fixed components are the evaluation harness and test dataset. The metric is automated. The loop runs overnight.

The Missing Piece: Skills That Degrade in Silence

The Karpathy Loop optimises execution. But there is a second, equally important loop that most teams leave completely open: the loop that governs the agent's own instructions.

Vasilije Trifunović (@tricalt) named the problem precisely: "Skills are usually static, while the environment around them is not." A skill that worked three weeks ago can quietly start failing when the codebase changes, when the model behaves differently, or when the kinds of tasks users ask for shift over time. In most systems, those failures are invisible until someone notices the output is worse — or it stops working entirely.

The proposed fix — cognee-skills — closes that loop:

Skill runs → Execution observed → Failures accumulate
                                          ↓
Skill updated ← Amendment evaluated ← Inspection + amend

Every change is tracked with its rationale. Self-improvement becomes a structured, auditable process — not uncontrolled drift.

Stack this on top of the Karpathy Loop and you get something qualitatively different from previous automation. The agent improves its task execution and its self-knowledge simultaneously — and every failure becomes training signal rather than invisible debt.

The Warp Engine

Traditional automation is additive. You save time linearly. What Karpathy and the cognee-skills pattern describe together is multiplicative — compounding intelligence.

The reason it functions as a warp engine is the feedback structure. Each cycle, the agent gets smarter about how to get smarter. That is not linear productivity gain. It is recursive self-improvement within a bounded domain. The program.md is the only thing keeping it from going sideways.

This is also where it becomes clear why the loop is different from previous automation waves. Old automation replaced repetitive human actions. This replaces repetitive human judgment calls — the "should I keep this change?" decisions that previously required expertise just to evaluate. When those decisions run overnight at 50x the human rate, the compound effect is dramatic.

The Unsolved Problem at the Center

Both loops still depend on one thing that cannot be automated: the evaluation criterion. What does "better" actually mean?

For Karpathy it is val_bpb. For cognee-skills, that question is still open. Fewer errors? User satisfaction? Task completion rate? This remains the hard, human-judgment-requiring problem. The loop is only as good as the scalar you feed it.

"The loop is a taste amplifier, not a taste generator. It cannot see outside the walls. It cannot question whether the walls are in the right place. It cannot decide the walls are ugly."

This is not a soft limitation. It is a structural one. A better camera does not make a better photographer. A faster agent running 1,000 experiments overnight only finds the best version of what you knew to ask for.

Taste Is Infrastructure

The uncomfortable corollary of all this is that AI automation widens the gap between people with good judgment and people without — it does not close it.

Someone with shallow taste plus a powerful autonomous loop optimises toward mediocrity very efficiently. Someone with deep taste and vision plus the same loop reaches the ceiling of their vision overnight instead of over years. The leverage is identical. The output quality is completely different.

The human role is not diminished in this system — it shifts. The researcher's contribution is no longer running experiments. It is experimental design: deciding what to hold fixed, what to vary, what the goal actually is. That 30-minute investment in a well-written program.md is the binding constraint on everything that follows. A researcher who cannot articulate those three things does not yet understand the problem well enough to experiment on it usefully.

The gap between "running experiments manually" and "having an agent run experiments autonomously" is smaller than most teams assume. The primary investment required is in document authorship, not infrastructure. The discipline of writing clear, constraining, version-controlled instruction documents will define which teams produce reliable results — and which produce confidently optimised noise.

As autonomous experiment loops mature from ML training into evaluation research, the skill of writing precise program.md-style documents will become genuinely scarce. Not coding. Not infrastructure. Not even ML. The ability to specify what you want with enough precision that a fast, tireless agent can find it — while knowing what you want in the first place.

In the age of autonomous loops, taste is infrastructure.


AutoResearch: github.com/karpathy/autoresearch cognee-skills: github.com/topoteretes/cognee Agent loop adaptation: github.com/hwchase17/autoresearch-agents