Karpathy's 630-line script points the way to self-improving AI agents

The script pointed an AI agent at Karpathy's own training code and gave it a single metric to optimise. The agent ran 700 experiments and found 20 genuine improvements, including a bug in Karpathy's attention implementation that he had missed.

It cut training time by 11% on a codebase Karpathy had already optimised for months.

The agent was not smarter than Karpathy. It was faster, more persistent and immune to boredom.

The script has since been bookmarked 41,000 times and spawned a pattern now being called the Karpathy loop. Its implications stretch well beyond training code.

That is the argument made by Nate B. Jones, the Seattle-based AI strategist and former head of product at Amazon Prime Video, in a recent episode of his AI News & Strategy Daily podcast.

Jones spent more than 15 years building products across three continents and led global initiatives in machine learning personalisation, content delivery and interactive formats for more than 200 million viewers at Amazon.

He now advises Fortune 500 leaders on translating AI breakthroughs into business value and publishes a widely read Substack on applied AI strategy.

His analysis of the Karpathy loop focuses not on the code itself but on what it means for organisations trying to deploy self-improving agents.

Three files, one metric, no fatigue

The Karpathy loop is deliberately minimal.

It consists of three files, only one of which the agent can edit.

The agent proposes a change, runs a short training experiment, checks a single metric and either commits the change or reverts it.

The human's job is to define the constraints: one file, one metric, one fixed time budget.

A productive human researcher might manage eight to 10 experiment cycles in a working day, mostly waiting for the GPU.

The agent executes around 12 experiments an hour, roughly 100 overnight, without waiting or context-switching.

The hit rate may not be high.

The iteration rate is inhuman.

Shopify chief executive Toby Lutke applied the same pattern to internal company data and achieved a 19% performance gain from 37 experiments in eight hours.

A separate team ran 910 experiments on a 16-GPU Kubernetes cluster in eight hours and discovered important insights about scaling model width and using faster GPUs for validation.

The total compute cost was under $300.

From training code to agent behaviour

Jones argues that the more consequential development is what happens when the loop is applied not to model weights but to the scaffolding around the model, the system prompt, tools and orchestration logic that determine how an agent behaves.

A YC startup called Third Layer did exactly this.

A meta agent rewrote the task agent's entire scaffolding overnight, claiming scores of 96.5% on the spreadsheet benchmark and 55.1% on the terminal benchmark.

Those scores have not appeared on official leaderboards and remain unverified.

The highest verified spreadsheet benchmark entry is Opus 4.6 at 34%.

But the direction matters more than the specific numbers.

The pattern works by splitting the system into two agents.

The meta agent acts as a harness engineer, reading failure traces from the task agent and diagnosing what went wrong.

It then modifies the harness and runs the benchmark again.

Same-model pairings dramatically outperform cross-model pairings because the meta agent has an implicit understanding of how the inner model reasons.

The team observed behaviours that were never programmed, including the meta agent inventing spot-checking, forced verification loops and formatting validators.

It also discovered progressive disclosure and built task-specific sub-agents on its own initiative.

The infrastructure most companies lack

The pattern sounds simple.

The prerequisites are not.

Jones is clear-eyed about this.

Auto-improvement requires a context layer, domain understanding, evaluation infrastructure, governance and technical depth that most organisations have not yet built.

The most foundational problem is memory.

Without structured external memory, every agent session reinvents its definition of done and guesses at what happened before.

A meta agent optimising a task agent with no persistent memory and no structured state is optimising in the dark.

The optimisation loop is only as good as the infrastructure underneath it.

For most organisations, Jones says, that infrastructure is currently held together with conversation history and hope.

Evaluation is the second gap.

Most teams struggle to write a reliable evaluation suite.

They measure activity instead of outcome, use metrics that do not correlate with business results, or lack testing infrastructure entirely.

Auto-improvement without accurate scoring is not improvement.

It is drift.

Then there is governance.

Who owns the output?

Who reviews changes?

Who decides what gets promoted to production?

Organisations that already struggle with these questions will not find them easier when an agent is running 100 experiments a night.

Small teams hold the structural advantage

Karpathy's auto-research was built by one person.

Sky Pilot scaled auto-research for under $300 in compute.

The iteration speed advantage of small teams is multiple orders of magnitude, according to Jones.

A team of three to five people that can define metrics, build harnesses and let the loop run will move faster than an enterprise team that needs months to spec, approve and execute a similar optimisation cycle.

Enterprise teams can close the gap, but only if leaders intentionally remove obstacles and empower small groups to move quickly.

The failure patterns associated with agents are often a product of organisational complexity, not technical limitation.

The safety problem is quiet, not dramatic

In a business context, Jones argues, the safety concerns around self-improving agents are not about intelligence explosions.

They are about quiet, specific failure modes.

Agents can game metrics, producing inflated scores that do not reflect real capability.

They can optimise in ways that harm customer trust or find creative interpretations of rules that a human reviewer would flag immediately.

Silent degradation is particularly dangerous.

Subtle policy drifts and quality erosion can persist undetected when monitoring infrastructure is inadequate.

Contamination is another risk, where the agent's optimisation loop influences the data it is being evaluated against.

A bad optimisation in one system can cascade into interconnected business processes, compounding errors across the organisation.

The mitigations are straightforward in principle: tight loops, clear baselines, version control, the ability to revert any change, and a human inspecting the results.

The honest path forward

The Karpathy loop is a graduate-level capability.

It demands that an organisation has already solved basic agent deployment before attempting self-improvement.

Jones recommends starting by picking one business system and defining what he calls the Karpathy triplet: one editable surface, one metric and one time budget.

Build the scoring function so it accurately reflects business value.

Write a test suite that covers failure modes.

Create a sandboxed execution environment where experiments can run without touching production.

Start with systems where failure is cheap, not customer-facing workflows or compliance processes.

Design for auditability from the beginning, logging every experiment, every edit and every metric trajectory.

The experiment log becomes institutional knowledge about which optimisations work in a specific domain.

Most organisations will underinvest in evaluation infrastructure because it does not produce visible outputs.

That is a mistake.

Automation is not possible without scoring.

The role of humans in this system is not lower-skilled.

It is higher-leverage, requiring deep domain knowledge, clear thinking about metrics and the ability to spot when an agent is gaming the system.

Anthropic and OpenAI are already pursuing the same loop at larger scale, with ambitions to build fully recursive research systems.

The open-source versions operate on smaller systems with narrower objectives, but the mechanism is identical.

The tools are available now as MIT-licensed open-source projects.

The infrastructure to support them is what most organisations still need to build.

Jones believes the companies that get this right in the next six months will build compounding advantages that are difficult to reverse.

The ones that skip the prerequisites will fail in ways that are equally instructive.

Subscribe to Our Newsletter

Karpathy's 630-line script points the way to self-improving AI agents

Honor's humanoid robot beats human world record to win Beijing half-marathon

Microsoft launches 'Code Red' overhaul of Copilot as shares hit weakest quarter since 2008

Robotaxis face their toughest test in five cities that could define the future of autonomous transport

MacBook Neo's runaway success leaves Apple short of chips to sustain production

Explore topics

Tech

Artificial Intelligence

Business

Entertainment & Sport

Top tags

Karpathy's 630-line script points the way to self-improving AI agents

Honor's humanoid robot beats human world record to win Beijing half-marathon

Microsoft launches 'Code Red' overhaul of Copilot as shares hit weakest quarter since 2008

Robotaxis face their toughest test in five cities that could define the future of autonomous transport

Tesla expands driverless robotaxi service to Dallas and Houston

MacBook Neo's runaway success leaves Apple short of chips to sustain production