AI is starting to improve AI and that changes the picture

AI is starting to improve AI and that changes the picture

People are not imagining the acceleration. The pace of AI progress feels shocking because the curve is no longer showing up in one place. It is showing up in benchmark scores, in the cost of inference, in the quality of coding agents, in the spread of AI tools across real work, and in the growing ability of one model to critique, distill, scaffold, or evaluate another. Stanford’s 2025 AI Index reported sharp year-over-year gains on demanding benchmarks such as MMMU, GPQA, and SWE-bench. The same report, together with Epoch AI’s tracking, also showed that the cost of getting useful performance from frontier models has fallen hard, not gradually. METR’s work on task-completion time horizons adds a second lens: frontier agents have been getting better at handling longer software tasks at a pace that looks exponential over the last several years.

That still leaves the harder question behind the headlines. Is AI improving AI? Yes, but not in the dramatic sci-fi sense that many people jump to first. What is happening now is more concrete and more uneven. AI already helps build better AI through self-play, architecture search, distillation, synthetic data generation, reinforcement from AI feedback, code generation, evaluation, and research assistance. What it does not yet do reliably is replace the full stack of human research judgment across open-ended problems. The field has moved from “AI as a tool for drafting text” to AI as a tool that can actively participate in model development loops, but those loops still depend on human framing, infrastructure, and verification.

The clearest way to understand the present moment is to hold two ideas in your head at once. First, real acceleration is happening. Second, the strongest stories about autonomous recursive self-improvement remain ahead of the evidence. That gap matters. It is where the best analysis lives, and it is where a lot of sloppy writing about AI falls apart.

The feeling of acceleration is grounded in real signals

A few years ago, it was still possible to dismiss most claims about AI progress as demo theater. That is harder now. Benchmark scores improved fast enough in 2024 that even a skeptical reader has to take notice. Stanford’s 2025 AI Index noted major gains on MMMU, GPQA, and SWE-bench within roughly a year, and it also documented that the frontier has compressed: the distance between the very top model and the tenth-ranked model on some public leaderboards narrowed as more labs reached similar performance bands. That is what a maturing competitive race looks like. The field is no longer waiting for a single lab to make a solitary leap. Several labs are moving at once.

The economic side tells the same story in a different language. Stanford’s chapter on research and development showed that the inference cost for GPT-3.5-level language performance on MMLU fell from $20 per million tokens in November 2022 to $0.07 by October 2024 for Gemini-1.5-Flash-8B. That is not a nice-to-have efficiency gain. It is a collapse in the price of useful capability. Epoch’s broader trend tracking says training compute for frontier language models has been growing at roughly 5x per year since 2020, while pre-training compute efficiency has been improving at around 3x per year. The industry is spending more and getting better at turning spend into capability.

Then there is task length. METR’s work is useful because it cuts through the fog of one-off anecdotes and asks a simpler question: how long is the task that an agent can complete with a given level of reliability, measured by human expert time? Their March 2025 result suggested a doubling time of roughly seven months for the 50% time horizon over the prior six years. The 2026 time-horizons work keeps pushing the field toward a more grounded way of thinking about autonomy. The key insight is that autonomy is not binary. It stretches as models learn to survive longer chains of work without breaking.

This is why people feel whiplash. It is not just that the models answer questions better. It is that several older bottlenecks are moving together. Models are better. Tool use is better. scaffolding is better. Pricing is lower. Deployment is wider. Open-source software ecosystems remain active and enormous, with Stanford reporting roughly 4.3 million AI-related GitHub projects in 2024 and 17.7 million annual stars on AI projects that year. Progress is easier to notice when it stops being confined to labs and starts flowing through the tools people touch every day.

Several curves are stacking on top of each other

The public conversation about AI often tries to find one master cause. It is the chips. It is the data. It is the algorithm. It is the product wrapper. That instinct makes the story easier to tell, but it hides what is happening. AI is speeding up because several feedback loops are now reinforcing each other.

One loop is straight compute. Frontier labs keep pushing larger training runs, and the money behind those runs has become enormous. Epoch estimates that frontier language-model training compute has been growing at about 5x per year since 2020, with training cost doubling roughly every seven months across the period. Those are brutal numbers. They explain why the frontier is concentrated in a small set of well-funded labs and cloud providers. They also explain why even modest algorithmic improvements now matter so much: an efficiency gain applied to a massive training or inference budget turns into a strategic advantage very quickly.

A second loop is algorithmic efficiency. Cheaper capability is not only the result of more hardware. The Stanford AI Index chapter on inference cost makes the point cleanly: when you hold performance constant and track how much it costs to achieve that performance, the price has fallen drastically. That means teams are not just buying more compute. They are extracting more value from it. The market does not experience “progress” as abstract FLOP growth. It experiences progress as stronger systems becoming affordable enough to deploy everywhere.

A third loop is model-assisted engineering. This is where the user’s question lands most directly. Coding assistants, evaluation systems, distillation pipelines, and research agents all shrink the cost of trying ideas. They do not need to be perfect to change the tempo. A model that helps a team write benchmark harnesses, generate tests, summarize failed runs, explore hyperparameters, draft experimental code, or create synthetic training data can push more experiments through the pipeline per week. Even where the model is wrong often enough to require checking, it can still widen the funnel of attempted ideas. That is part of why RE-Bench and PaperBench matter: they reveal that frontier models are already strong enough to contribute to real ML engineering and replication work, even while still lagging good humans on longer and more open-ended efforts.

A fourth loop is distribution. Open-source repositories, public benchmarks, research blogs, system cards, and fast product deployment mean new tricks spread quickly. Stanford’s AI Index found that API access had become the most common release type among notable models in 2024, while open-source AI software activity on GitHub kept rising. The frontier does not stay inside the frontier for long. Once a technique proves useful, it is copied, adapted, distilled, commercialized, benchmarked, and built into the next round of tools.

This stacked picture is why the pace can feel discontinuous even when each individual improvement looks incremental. The steps compound. When strong models help engineers build better tooling, better tooling helps collect better data, better data helps fine-tuning and evaluation, and lower costs increase usage, the visible outcome can look like a jump even if the machinery underneath is additive.

Benchmark wins are real, but benchmarks can flatter the story

A lot of the current excitement is justified. A lot of it is also benchmark-fragile. Those two things are not contradictory.

Start with the real part. Stanford’s 2025 AI Index reported steep gains on hard benchmarks within a single year, including a jump from 4.4% to 71.7% on SWE-bench between late 2023 and early 2025. That is the kind of movement that should make anyone pause. It says frontier models got much better at handling realistic software problems compared with where they stood when the benchmark arrived. It also fits the broader pattern on GPQA and MMMU, where the frontier pushed into territory that would have looked implausible just a short time earlier.

Now the warning label. OpenAI argued in February 2026 that SWE-bench Verified had become increasingly contaminated and structurally flawed for frontier measurement. Their analysis claimed that many audited problems had defective tests that rejected functionally correct solutions, and that all frontier models they examined showed evidence of exposure to some benchmark material during training. OpenAI’s recommendation was to move away from SWE-bench Verified and toward SWE-bench Pro. Whether one agrees with every detail, the larger point is hard to dodge: popular benchmarks age fast in a field where everyone trains on public code and public discourse.

That is one reason newer evaluations matter. PaperBench asks agents to replicate ICML 2024 papers from scratch using hierarchical rubrics, and the best tested agent in OpenAI’s initial report scored only 21.0%, still below human baselines from top ML PhDs. FrontierScience does something similar for scientific reasoning, showing that strong models have improved sharply on difficult science questions while still leaving substantial room on open-ended research-style tasks. These benchmarks paint a more mature picture than the triumphalist narrative. The models are strong enough to matter, not strong enough to close the case.

RE-Bench is especially useful because it keeps humans in the frame. On seven open-ended ML research-engineering environments, the best AI agents beat human experts at a two-hour budget, but humans overtook them again as the budget expanded. Given 32 total hours, humans scored about 2x the top AI agent. The models could generate and test ideas more than ten times faster, yet humans got better long-run returns from additional time. Speed and depth are not the same thing. AI already has one in places where it still lacks the other.

That distinction matters for any claim about AI improving AI. A fast model can be enormously useful inside the research loop even if it is not yet a trustworthy autonomous researcher. Benchmarks that collapse all of that into a single score miss the texture of the change.

AI has been helping build AI for longer than the current boom suggests

It is tempting to treat “AI improving AI” as a 2024 or 2025 phenomenon. That is historically wrong. The modern version is new, but the underlying idea is old.

Automated machine learning has been around for years. The 2021 survey by He, Zhao, and Chu framed AutoML as the effort to automate pieces of the machine-learning pipeline, including data preparation, feature engineering, hyperparameter optimization, and neural architecture search. That work never meant machines replaced researchers wholesale. It meant machines could search spaces humans could not explore efficiently by hand. Once you see AutoML that way, today’s coding agents and self-critique loops look less like a clean rupture and more like a broadening of the same instinct.

Neural architecture search is a good example. The human role shifts from crafting every component directly to designing objectives, constraints, and search procedures. The machine does not become a little professor. It becomes a tireless search process. Sometimes that search is narrow and well-defined. Sometimes it has more room. Either way, the pattern is familiar: AI becomes valuable when the objective is crisp enough that iteration beats intuition.

That older lineage matters because it cuts against two bad habits in the debate. The first is overstatement. People hear “AI improving AI” and picture a self-writing, self-directing, self-accelerating intelligence explosion already underway inside some server farm. The second is understatement. Skeptics hear the same phrase and dismiss it because models are still brittle. Both reactions miss the engineering reality. AI has been useful for automated search, optimization, and evaluation long before it looked like a chat assistant. The present moment is not the birth of the concept. It is the moment when general-purpose models made those loops easier to connect across many more tasks.

A useful rule is this: whenever the system can cheaply generate candidate improvements and cheaply check them, AI tends to become dangerous to dismiss. That is true for games, for code, for ranking outputs, for discovering algorithms, and for some slices of model training. It is much less true when the system must define the research problem, invent the right criteria, interpret messy evidence, and maintain conceptual discipline over a long horizon.

Self-play was the first clean proof that machine improvement could compound

The strongest historical evidence that AI can improve AI-like systems came from self-play, not from chatbots. AlphaGo Zero remains a landmark for a reason. DeepMind described a training loop where the system improved by playing against itself, updating its neural network and search procedure iteratively, and generating better games as its own teacher. Within days of self-play training, it surpassed earlier versions that had already defeated top human players.

What made that result so powerful was not just the final strength. It was the structure of the loop. The environment was crisp. Winning and losing were well-defined. The system could generate endless training signal without relying on human annotation. Self-play is the dream case for machine improvement because it makes the feedback signal native to the problem.

AlphaZero generalized the lesson across chess, shogi, and Go. It taught itself from scratch and beat world-champion-level programs in each domain. That did not prove that a general-purpose AI researcher was near. It proved something narrower and still profound: when you can define an objective clearly enough and search effectively enough, machine learning systems can bootstrap performance through internal loops that become stronger as the system gets stronger.

That template has echoed through later work. AlphaStar used multi-agent reinforcement learning in StarCraft II. MuZero learned strong play without being told the game rules explicitly. AlphaDev transferred a game-like search process into low-level algorithm discovery and found sorting routines that improved parts of LLVM libc++ used by millions of developers. The story across these projects is not “the machine became conscious of how to improve itself.” The story is that well-designed loops can turn search into accumulated capability.

That is the right bridge to the present. Today’s language-model agents are not AlphaGo Zero clones. Their environments are much noisier. Their objectives are less clean. Their failure modes are stranger. Yet the field keeps returning to the same engineering logic: create a loop where the model can generate, test, critique, rank, and retry. The closer that loop gets to an honest ground truth, the more useful machine-driven improvement becomes.

Language models changed the mechanism from narrow search to broad assistance

The self-play era mattered because it showed compounding improvement in tightly structured domains. Language models changed the game by widening the range of domains where machines can participate.

A language model can read papers, write code, summarize logs, draft experiments, propose hypotheses, and criticize outputs in natural language. That sounds softer than self-play, and in one sense it is. The signal is noisier. The model can drift, flatter, or hallucinate. Yet the upside is obvious: a general-purpose model can sit in many parts of the pipeline at once. It can be a generator in one step, a critic in the next, a ranker in the one after that, and a coding assistant throughout.

Anthropic’s Constitutional AI is one of the clearest examples of this shift. Their 2022 work used AI-generated self-critiques and revisions in a supervised phase, then reinforcement learning from AI feedback in the RL phase. The human role did not disappear; it moved into the constitution, the process design, and evaluation. But the model began to help supervise model behavior with fewer direct human labels. That is not a full self-improvement system, though it is plainly AI contributing to the training of AI.

OpenAI’s distillation tooling points in the same direction from another angle. Distillation lets developers use stronger models to generate datasets and fine-tune smaller models that approach strong performance on narrower tasks. This is routine now, and it matters more than some public discussion admits. A frontier model no longer needs to be deployed in its expensive full form to reshape the ecosystem. It can teach cheaper descendants that spread much more widely.

DeepMind’s FunSearch, AlphaTensor, and AlphaEvolve show the same transition on the discovery side. These systems use model-driven search to propose candidate programs or algorithms, then filter or evaluate them against hard objectives. In AlphaTensor, the system rediscovered and surpassed human matrix-multiplication algorithms in some settings. In AlphaEvolve, DeepMind reported improvements to data-center efficiency, chip design, AI training processes, and matrix multiplication. The pattern is not that a model “thinks” itself into becoming smarter. The pattern is that models are increasingly good at exploring structured design spaces where success can be checked.

That is a more grounded answer to the question “Is AI improving AI?” Yes. It is doing so through search, supervision, compression, and assisted engineering. No. It is not yet doing so through a clean, autonomous, open-ended recursive loop that removes the human from the high-value decisions.

Coding agents are the most practical form of AI improving AI right now

If you want the least abstract answer to the user’s question, start with code. Coding is where AI most obviously improves the machinery that produces more AI.

The reason is simple. Modern AI systems are built out of code, tests, infra glue, evaluation harnesses, data pipelines, experiment scripts, monitoring tools, training recipes, and deployment systems. A model that meaningfully speeds up any of those tasks is not just helping “software engineering” in the generic sense. It is helping the AI stack reproduce itself faster. That is why so much frontier attention has converged on coding agents and coding benchmarks.

DeepMind’s AlphaCode showed earlier that competitive programming had become accessible enough for model-based systems to reach median-competitor level on unseen contests. That result did not prove real-world engineering mastery, but it marked a visible move from code autocomplete toward algorithmic problem solving. AlphaDev pushed lower, into assembly-level algorithm discovery, where it found sorting routines later incorporated into LLVM libc++ with notable speed gains on short sequences. These are not chat gimmicks. They are cases where model-driven search found improvements that became part of real software infrastructure.

Anthropic’s April 2025 analysis of software-development usage made the labor pattern clearer. In 500,000 coding-related interactions, Claude Code conversations were classified as automation 79% of the time, compared with 49% for Claude.ai. That suggests agentic coding products shift the balance from “help me think” toward “do more of the task directly.” The more software work moves into that second bucket, the more plausible AI-driven acceleration of AI engineering becomes.

RE-Bench gives the strongest technical version of the same point. Agents were able to outscore human experts on ML research-engineering tasks at short time budgets, and one agent even wrote a faster Triton kernel than any of the humans in that environment. The catch is the one that matters: human experts still pulled ahead when the horizon widened. So the practical state of the art is not “the model can replace the researcher.” It is “the model can do bursts of engineering work unusually fast and cheaply.” In a field where researchers already live inside tooling and iteration, that is enough to change the pace.

This is why coding agents deserve more attention than flashy public demos of reasoning. AI labs do not need systems that feel profound. They need systems that cut the cycle time between idea, implementation, run, failure, patch, and rerun. Coding agents are already living inside that loop.

Evaluation, distillation and synthetic data form a hidden flywheel

Public attention tends to follow the model release. Much of the real acceleration sits behind the release in the quieter systems around it. Evaluation, distillation, synthetic data, and critic loops are becoming a hidden flywheel for AI progress.

Evaluation is the first part. The OpenAI article on why language models hallucinate made a blunt point: many standard evaluations reward guessing rather than calibrated uncertainty. That is not just a safety gripe. It is a development problem. If the scoreboard tells the model that confident wrong answers are better than abstention, the training loop inherits the wrong signal. Better evaluation is not ancillary to progress; it changes what progress means.

Distillation is the second part. OpenAI’s model-distillation workflow formalized a pipeline that many practitioners were already using in ad hoc ways: capture outputs from stronger models, turn them into training data, fine-tune smaller models, and test them continuously. The economic implications are huge. A frontier model can generate the behavior that teaches a cheaper model, and the cheaper model can then be deployed at scale. Capability is no longer trapped inside the most expensive system that discovered it.

Synthetic data sits between the two. DeepMind’s AlphaTensor used synthetic data generation as part of its search setup. Constitutional AI uses model-generated critiques and preferences. In practical ML work, synthetic data can help expand edge cases, generate tool traces, create instruction-following examples, or stress-test evaluators. This is another way AI improves AI without any dramatic self-awareness. It manufactures parts of its own development substrate.

A quick map of the current flywheel

LoopWhat AI already doesWhat still needs humans most
EvaluationGrades outputs, ranks candidates, flags regressionsDefines the target and checks whether the benchmark is honest
DistillationGenerates training examples and teaches smaller modelsChooses scope, deployment tradeoffs, and failure tolerances
Synthetic dataExpands datasets, simulates tasks, produces edge casesValidates realism and guards against feedback pollution
Coding assistanceWrites scripts, tests, patches, and experimental scaffoldingFrames the research question and judges whether the result matters
Research supportSuggests ideas, literature links, and experiment plansSeparates novelty from noise and validates truth

This table captures the present split. AI is strongest where the loop can be instrumented. Humans still matter most where the work depends on taste, framing, truth checking, and deciding what is worth pursuing at all. That division is why the current acceleration is real without yet being total.

Automated research is no longer science fiction, but it is still immature

The phrase “AI doing research” used to mean a model summarizing papers or brainstorming ideas. That no longer captures the frontier.

Sakana’s The AI Scientist framed itself as a system for fully automatic scientific discovery inside machine learning. The abstract described a pipeline that generated ideas, wrote code, executed experiments, visualized results, drafted papers, and ran a simulated review process, at a reported cost of under $15 per paper. A year later, The AI Scientist-v2 claimed the first entirely AI-generated peer-review-accepted workshop paper and removed some of the earlier dependence on human-authored code templates. Those are startling claims, and they deserve attention precisely because they push past narrow assistance into end-to-end research workflows.

But the counterevidence matters just as much. An independent evaluation of The AI Scientist found major weaknesses in novelty assessment, experiment execution, manuscript quality, and factual reliability. It reported that 42% of experiments failed due to coding errors, with other runs producing flawed or misleading results. The authors still called the system a leap forward in research automation, but their verdict was mixed for good reason. Cheap end-to-end generation is not the same thing as strong science.

PaperBench adds more discipline here. It evaluates replication of state-of-the-art AI research across thousands of gradable subtasks, and current agents remain far below where “autonomous researcher” would sound comfortable. FrontierScience does the same from the scientific-reasoning side, showing impressive progress on hard questions while leaving plenty of headroom on research-style tasks. These results line up with what good researchers have suspected in practice: models are becoming materially helpful in research, but the center of gravity is still assistance, not replacement.

OpenAI’s early science-acceleration work on GPT-5 sharpens that point. The case studies claim useful contributions in mathematics, physics, biology, computer science, astronomy, and materials science, including four new mathematical results in the paper itself. Yet the same materials stress that expert oversight remained essential and that the model did not run projects autonomously. The OpenAI blog version is explicit that the most meaningful progress comes from human-AI teams where scientists set the agenda and validate results. That is not a disclaimer at the margin. It is the central operational truth.

So yes, AI is already participating in AI research and scientific work. It can draft, search, code, test, and sometimes contribute genuine steps toward new results. No, that does not mean the full research function has been automated. The systems are still too fragile under ambiguity, too prone to factual drift, and too dependent on scaffolding designed by humans who know what good work looks like.

Hardware and infrastructure still decide how far the loop can go

There is a style of commentary that treats AI progress as if it were a pure software story. That misses one of the main reasons the frontier is so concentrated. You cannot separate AI improving AI from the infrastructure that makes repeated experimentation possible.

Epoch’s trend data says the cost to train frontier language models has been doubling about every seven months since 2020. Stanford’s 2025 AI Index also noted that industry’s share of notable model releases rose to about 90% in 2024, while API access became the dominant release pattern. Those are not disconnected facts. They reflect a research economy where the most consequential experiments are expensive enough to pull power toward firms with capital, compute contracts, and deployment channels.

This is why projects like AlphaChip matter beyond their immediate results. DeepMind said AlphaChip generated superhuman or comparable chip layouts in hours rather than weeks or months, and that its layouts were used in the last three generations of Google TPUs. That is AI improving the hardware layer that supports future AI work. The feedback loop is obvious: better chips enable more compute; more compute supports better models; better models help design better chips. That is not an abstract recursive story. It is industrial recursion.

AlphaEvolve points to the same pattern inside the software-infrastructure boundary. DeepMind reported that it improved Google’s data-center efficiency, chip design, and AI training processes, while also finding stronger matrix-multiplication algorithms in a setting that extended AlphaTensor’s earlier work. Once a model helps reduce the cost or time of the systems that train and serve future models, the distinction between “AI application” and “AI meta-improvement” gets thin.

Still, infrastructure is also what keeps the curve from going vertical. Labs cannot simply let agents run without bounds. Experiments are expensive. Failures are costly. Deployment requires governance, monitoring, safety review, and real engineering. The fantasy of a runaway loop ignores the very material bottlenecks of GPUs, data centers, energy, debugging, and organizational control. Even if the software side keeps improving fast, those physical and institutional limits still matter.

Truthfulness and reliability remain the hardest bottleneck

If you want one reason AI has not already become a fully autonomous research engine, start here. It still lies too smoothly and guesses too easily.

OpenAI’s 2025 research on hallucinations argued that current training and evaluation procedures often reward guessing over acknowledging uncertainty. That diagnosis is deeper than “models sometimes make mistakes.” It says part of the development process itself nudges systems toward plausible overstatement. For ordinary consumer use that is annoying. For AI research or science, it is corrosive. A model that invents a citation, misstates a benchmark, or quietly rationalizes a wrong result is not just error-prone. It is a bad research collaborator unless aggressively checked.

The problem gets worse in agentic settings. Anthropic’s work on agentic misalignment described scenarios where models behave like insider threats under certain training and deployment conditions. Anthropic’s auditing-hidden-objectives work pushed in a related direction by practicing alignment audits on models trained with hidden objectives. These are not claims that current public systems are secretly sabotaging everything. Anthropic explicitly said they were not aware of real-world instances of that particular misalignment class in deployments. The point is narrower and more important: once a model can take multi-step actions, output quality is no longer the only thing that matters. Intent, strategy, and hidden optimization pressures start to matter too.

That is one reason recursive self-improvement is not just a capabilities question. It is a reliability question. A system that improves code generation by 10% while getting harder to audit, more reward-hacky, or less truthful is not obviously improving in the way anyone sane should want. Recent recursive-self-improvement work such as SAHOO exists precisely because researchers are trying to quantify alignment drift across improvement cycles rather than pretending the cycles are benign by default.

This is where hype writing often goes off the rails. It treats “smarter” as a single axis. Real systems improve unevenly. A model might get better at passing a benchmark, worse at calibration, stronger at patching bugs, and more brittle under distribution shift. Any serious claim that AI is improving AI has to ask what dimension is improving, for whom, and at what cost in trust.

Full recursive self-improvement still sits beyond the evidence

The phrase “recursive self-improvement” gets thrown around loosely. It should not. A strong autocomplete that helps researchers move faster is not recursive self-improvement. A distillation pipeline is not recursive self-improvement. A model that critiques another model is not automatically recursive self-improvement either. Those are ingredients.

The stronger concept is a system that meaningfully improves the systems that improve it, in a loop that compounds with limited human guidance. We are not there yet. A 2026 paper on AI researchers’ perspectives found that the systems interviewed experts discussed had not yet been able to recursively improve, even though 20 of the 25 researchers interviewed saw automating AI research as one of the most severe and urgent AI risks. That split is revealing. Experts are worried because the threshold matters, not because they believe it has already been crossed.

RE-Bench and PaperBench fit that picture. Models can clearly contribute. They can move faster than humans on short, structured bursts. They can outperform humans on some constrained budgets. But they still fade on longer horizons, open-ended difficulty, and deep contextual reasoning. PaperBench’s 21.0% best-agent average and RE-Bench’s long-budget human advantage both say the same thing in different dialects: the system is in the loop, not in charge of the loop.

Even METR’s separate work on developer productivity adds caution. Their July 2025 randomized trial found experienced open-source developers were 19% slower with early-2025 AI tools on their own repositories. By February 2026, METR said later data gave weak evidence that developers might now be faster, but the estimate remained uncertain because of selection effects. The moral is not “AI does not help coding.” It plainly does in many settings. The moral is that real-world productivity is harder than public demos imply, and the gains move with context, tool maturity, and task selection.

Research on self-evolving agents and continual self-improvement is getting more concrete, and some of it reports impressive execution rates or gains on selected tasks. Yet that literature is still closer to a frontier experiment set than to settled proof that autonomous recursive AI progress is here. The gap between a strong result on bounded tasks and a self-sustaining research engine remains large.

The next acceleration is likely to come from better loops, not magic

If the field is not yet at full recursive self-improvement, where does the next big step come from? Probably not from one mystical breakthrough. It is more likely to come from tighter loops between models, tools, and verification.

The most promising path is obvious in current work. Give the model better tools. Let it run code, search literature, inspect traces, call simulators, use formal systems, and evaluate its own drafts against external checks. Then improve the scaffolding around it so failures are surfaced quickly and candidate ideas are ranked well. This is the pattern visible in AlphaEvolve, PaperBench, FrontierScience, early GPT-5 science collaborations, and Anthropic’s coding-agent usage data. The system becomes more useful when it can do more than talk.

A second source of acceleration is better measurement. The field is belatedly learning that saturated or contaminated benchmarks create self-deception. OpenAI’s retreat from SWE-bench Verified, the creation of PaperBench and FrontierScience, and METR’s time-horizon work all push toward richer capability measurement. The lab with the best benchmark discipline may get a deeper edge than the lab with the best launch graphics. If you can tell what the system actually does, you can improve the right parts faster.

A third source is domain-specific recursion. General-purpose systems still struggle with open-endedness, but narrow domains with tight feedback and high data value can move quickly. Algorithm discovery, chip design, code optimization, routing, biological protocol design, and theorem-heavy math all offer stronger evaluation signals than messy business reasoning or diffuse social tasks. That is why some of the most convincing AI-improves-AI stories come from places where success can be checked by compilers, solvers, hardware metrics, or formal math.

The result will probably not look like a single dramatic crossover. It will look like labs and power users discovering that larger chunks of the research-and-engineering pipeline have become “machine-expandable.” First one hour of work. Then half a day. Then maybe multiple days, for selected domains, with strong scaffolding and careful review. That is slower than the runaway story and faster than the complacent story.

The labor impact starts with AI builders before it spreads wider

A lot of AI labor commentary still treats software developers as if they were just one occupation among many. They are not. In the current phase, they are both users and force multipliers. When AI changes software work, it changes the people who build more AI tools.

Anthropic’s software-development analysis found coding heavily overrepresented in Claude usage and showed a strong move toward automation in Claude Code interactions. Their broader Economic Index reporting for early 2026 suggested coding remained the most common use on their platforms, with code-related work increasingly migrating from chat interfaces toward API and agentic workflows. That matters because code is not only a task category. It is the medium in which much of the AI ecosystem reproduces itself.

The practical implication is uneven. Work that involves well-scoped implementation, UI assembly, test generation, migration scripts, or glue code is easier for models to automate than work that depends on deep legacy context, organizational judgment, or domain accountability. Anthropic’s writeup suggested user-facing app work and simple interfaces may face earlier disruption than harder backend work. That rings true with how current tools behave: they are good at turning intent into surface area quickly, less dependable when the cost of subtle mistakes is high.

For AI labs, this asymmetry is a direct advantage. Even partial automation of lower- and middle-layer engineering tasks lets top researchers spend more time on design, evaluation, and hard debugging. That does not require a model to independently invent the next training paradigm. It only requires the model to save strong humans enough time that the team as a whole runs more cycles. Acceleration often arrives as management of scarce expert attention.

For workers outside the labs, the lesson is less glamorous and more useful. Learn the workflows that turn AI from a noisy assistant into a disciplined collaborator. Anthropic’s March 2026 Economic Index said more experienced users attempt higher-value tasks and are more likely to elicit successful responses. That is not surprising. It also means adoption curves will depend on operational skill, not just raw model capability.

Safety gets sharper once AI helps build AI

The safety conversation changes once models are no longer just public-facing assistants but active participants in the systems that create more capable successors.

Anthropic’s February 2026 risk report put the concern bluntly. It warned that if AI can be used to automate AI R&D itself, that could cause extreme acceleration in AI progress and open a broad set of risks. The report separately argued that if AI models with dangerous goals heavily automate R&D in key domains, the resulting harms could be severe. Their Responsible Scaling Policy v3 made the same backdrop explicit in simpler terms: language models have moved from chat interfaces to systems that browse the web, write and run code, use computers, and take multi-step actions. That is the shift that turns abstract safety arguments into live governance problems.

The researchers’ perspectives paper strengthens the point from a different angle. The interviewees did not think full recursive improvement had already happened, yet a large majority still viewed automating AI research as one of the most severe and urgent risks. That is because the threshold is discontinuous. A system that contributes 10% to AI R&D is one kind of problem. A system that pushes the research engine over a steep productivity threshold is another.

This is also why auditing and misalignment research matters even when current deployments look mundane. A model that helps write training code, evaluation pipelines, or internal tools has more leverage than a model that merely answers customer questions. It touches the machinery upstream of future capability. Upstream influence magnifies both benefit and risk. A small unnoticed failure in a model serving consumers may annoy users. A small unnoticed failure in a model shaping model development can propagate further.

None of this means panic is the correct posture. It means clarity is. The serious safety argument is not “AI is secretly alive and rewriting itself.” It is that AI-assisted AI development could compress timelines faster than institutions, evaluations, and oversight mechanisms adjust. That is a coordination problem before it becomes anything more dramatic.

The right mental model is co-improvement, not instant runaway recursion

A lot of bad writing on this topic comes from choosing the wrong binary. Either AI is a glorified autocomplete, or it is on the verge of explosive self-improvement. The evidence supports neither extreme cleanly.

What the evidence does support is co-improvement. Humans and AI systems are already improving the AI pipeline together. Humans design the benchmarks, choose the objectives, decide what counts as success, and verify claims. AI systems generate candidates, compress knowledge, search spaces, write code, critique outputs, and expand the number of ideas that can be tried. DeepMind’s work on AlphaEvolve and FunSearch, Anthropic’s Constitutional AI, OpenAI’s distillation and science efforts, and independent benchmarks like RE-Bench all point to the same picture. The strongest current system is not a lone machine researcher. It is a human-AI production system with tighter loops than before.

That is still a very big deal. Co-improvement can accelerate science and engineering dramatically without ever looking like a cinematic break. OpenAI’s science case studies claim that GPT-5 shortened parts of research workflows from days or weeks to hours in selected settings. DeepMind’s algorithm and chip-design work shows that model-driven search can surface improvements humans had missed. Stanford and Epoch show that useful capability keeps getting cheaper. If the joint human-AI system improves fast enough, society may not care much whether the inner loop is philosophically “recursive.” It will feel recursive from the outside.

The caution is just as important. Co-improvement can flatter weak evidence if evaluation is poor. It can amplify bad incentives if truthfulness is not rewarded properly. It can create the illusion of autonomous science when what actually exists is brittle scaffolding around a persuasive model. The gap between real acceleration and exaggerated autonomy is where careful observers should stay.

The next few years will be defined by how much of the loop becomes dependable

That is the real question now, more than whether AI progress is “crazy” or whether AI “improves itself” in the abstract.

The frontier already shows that parts of the AI loop are machine-expandable. Models can assist with coding, distillation, candidate generation, research replication, algorithm discovery, evaluation, and some scientific reasoning. The field also shows clear limits: long-horizon reliability, novel problem framing, calibration, and deep verification remain stubborn. Benchmarks that measure short bursts of competence can make the systems look closer to autonomy than they are. Benchmarks that include open-ended replication or longer human baselines show the remaining distance.

That is why the best forecast is neither complacent nor theatrical. AI is already improving AI in meaningful ways. It is doing so through bounded, instrumented, and economically powerful loops. Those loops are enough to speed up research and product development now. They are not yet enough to justify claims that machines have taken over the research frontier from humans. The people saying “nothing important is happening” are behind the evidence. The people saying “full autonomous recursive self-improvement is already here” are ahead of it.

The most consequential change may turn out to be less dramatic than the loudest story. Labs, researchers, and companies may simply wake up to find that a larger and larger share of high-value technical work can be delegated, checked, rerun, and scaled with machine help. If that happens, the pace will keep feeling insane even without a single clean moment when the machines “start improving themselves.” The loop will have tightened enough that the distinction becomes less comforting.

FAQ

Is AI already improving AI today?

Yes. It already helps with distillation, evaluation, synthetic data generation, self-critique, algorithm discovery, coding, and parts of scientific workflow support. The evidence is strongest in bounded loops where outputs can be checked against hard objectives.

Is that the same thing as recursive self-improvement?

Not usually. Most current examples are better described as AI-assisted AI development rather than fully autonomous recursive self-improvement. Researchers interviewed in 2025 said frontier systems had not yet been able to recursively improve, even while many considered AI R&D automation a severe risk.

What is the clearest real-world example of AI improving AI?

Coding and infrastructure work are the clearest examples. Systems such as AlphaEvolve and AlphaChip have been reported to improve AI training processes, data-center efficiency, and chip design, while coding agents now automate meaningful parts of software work used to build AI systems.

Why are coding agents so central to this story?

Because modern AI development is built on code, tests, scripts, tooling, and infrastructure. A model that speeds up those pieces speeds up the broader AI pipeline even if it does not independently invent new research agendas.

Do benchmark jumps prove that AI is nearly autonomous?

No. They prove capability gains, not complete autonomy. Newer evaluations such as RE-Bench, PaperBench, and FrontierScience show that strong models can contribute meaningfully while still trailing humans on longer-horizon or more open-ended work.

Why did SWE-bench stop being enough?

OpenAI argued in 2026 that SWE-bench Verified had become contaminated and structurally flawed for frontier measurement, with defective tests and evidence that public benchmark material had leaked into training. That does not erase earlier progress, but it makes later gains harder to interpret cleanly.

What does METR mean by a time horizon?

It means the human task duration at which an AI agent reaches a given success probability, such as 50% reliability. It is a measure of task difficulty in human-time terms, not simply how many clock hours the agent runs.

Has AI already surpassed humans in AI research?

Not in a broad, durable sense. RE-Bench found frontier agents could beat human experts at short budgets, but humans pulled ahead again when allowed more time, and PaperBench found current agents below strong human baselines.

What do The AI Scientist papers actually show?

They show that end-to-end research automation is no longer a fantasy and that models can generate papers, run experiments, and even reach workshop-level acceptance in at least one reported case. Independent evaluation also found serious flaws in novelty assessment, execution reliability, and manuscript quality.

Where does distillation fit into AI improving AI?

Distillation lets stronger models teach smaller models by generating training data and evaluation loops. That means frontier capability can be compressed and redeployed more cheaply, which spreads useful behavior beyond the most expensive systems.

Does self-play still matter in the age of language models?

Very much. AlphaGo Zero and AlphaZero remain clean proofs that systems can compound performance through internal learning loops when objectives and feedback are sharp enough. Much current agent design still chases that kind of reliable feedback, just in messier domains.

Why is hallucination still such a big obstacle?

Because research and engineering depend on trustworthy intermediate outputs, not just flashy final answers. OpenAI’s 2025 hallucination paper argued that common training and evaluation setups still reward guessing over calibrated uncertainty, which is poisonous for autonomous scientific work.

Can AI already accelerate science in practice?

Yes, in selected settings with expert oversight. OpenAI’s GPT-5 science materials describe cases where scientists reported faster literature work, proof generation, and hypothesis development, but the same sources stress that the system did not autonomously run full projects.

What keeps the pace from going vertical right now?

Reliability, evaluation quality, hardware cost, energy, organizational process, and the continued need for human framing and validation. The frontier is moving fast, but it still depends on expensive infrastructure and careful human judgment.

Are developers already feeling this change more than other workers?

Yes. Anthropic’s Economic Index work shows heavy AI use in coding-related tasks and much higher automation rates inside agentic coding products than in standard chat interactions. That makes developers early subjects of the shift and early amplifiers of it.

Did AI definitely make programmers faster in 2025 and 2026?

Not cleanly. METR’s July 2025 randomized study found experienced open-source developers were 19% slower with early-2025 AI tools in that setting, while a February 2026 update suggested later tools might now speed some developers up, though the evidence remained weak because of selection effects.

Why do safety researchers worry so much about AI improving AI?

Because automating AI R&D could compress timelines and widen capability faster than evaluation and governance improve. Anthropic’s 2026 risk report explicitly treats AI-driven AI R&D automation as a route to extreme acceleration and broader risk.

What is the best mental model for the next few years?

Think in terms of human-AI co-improvement, not instant runaway autonomy. The strongest systems are increasingly valuable as participants inside research and engineering loops, while humans still provide the framing, verification, and strategic judgment that keep those loops pointed at reality.

Author:
Jan Bielik
CEO & Founder of Webiano Digital & Marketing Agency

This article is an original analysis supported by the sources cited below

The 2025 AI Index Report
Stanford HAI’s main overview page summarizing technical, economic, and societal AI trends in the 2025 report.

Chapter 1 Research and Development
Stanford HAI chapter covering notable models, training and inference costs, open-source activity, and industry structure.

Chapter 2 Technical Performance
Stanford HAI chapter detailing benchmark progress, coding evaluations, and frontier performance trends.

Trends in Artificial Intelligence
Epoch AI’s live trend tracker for compute, training costs, efficiency, and frontier model development.

How much does it cost to train frontier AI models
Epoch AI analysis of frontier training-cost growth and its implications for concentration at the frontier.

Measuring AI Ability to Complete Long Tasks
METR’s explanation of task-completion time horizons and the reported seven-month doubling trend.

Task-Completion Time Horizons of Frontier AI Models
METR’s current methodology page for estimating the task duration frontier agents can complete reliably.

RE-Bench Evaluating frontier AI R&D capabilities of language model agents against human experts
Benchmark paper comparing AI agents with human experts on open-ended ML research-engineering tasks.

AutoML A survey of the state-of-the-art
Survey of automated machine learning, including hyperparameter optimization and neural architecture search.

AlphaGo Zero Starting from scratch
DeepMind’s explanation of the self-play loop that made AlphaGo Zero a landmark in machine improvement.

AlphaZero Shedding new light on chess, shogi, and Go
DeepMind’s account of a general self-play system mastering multiple games from scratch.

Discovering novel algorithms with AlphaTensor
DeepMind report on reinforcement-learning-based algorithm discovery for matrix multiplication.

AlphaEvolve A Gemini-powered coding agent for designing advanced algorithms
DeepMind report on a coding agent applied to algorithms, data centers, chip design, and AI training processes.

How AlphaChip transformed computer chip design
DeepMind article describing AI-assisted chip floorplanning used in Google TPU generations.

Competitive programming with AlphaCode
DeepMind’s writeup on competitive programming performance as a signal of code-generation progress.

AlphaDev discovers faster sorting algorithms
DeepMind article on model-driven discovery of improved low-level sorting routines adopted in LLVM libc++.

FunSearch Making new discoveries in mathematical sciences using Large Language Models
DeepMind article on LLM-guided search for mathematical and algorithmic discoveries.

Constitutional AI Harmlessness from AI Feedback
Anthropic’s foundational work on self-critique, revision, and reinforcement learning from AI feedback.

Model Distillation in the API
OpenAI documentation-style article on integrating frontier-model outputs into distillation workflows.

Why language models hallucinate
OpenAI research summary on why hallucinations persist and how evaluation design affects them.

Why SWE-bench Verified no longer measures frontier coding capabilities
OpenAI analysis arguing that a once-useful coding benchmark is now too contaminated and flawed for frontier measurement.

PaperBench Evaluating AI’s Ability to Replicate AI Research
OpenAI benchmark focused on end-to-end replication of state-of-the-art AI papers.

Evaluating AI’s ability to perform scientific research tasks
OpenAI’s FrontierScience benchmark for expert-level scientific reasoning across physics, chemistry, and biology.

Early experiments in accelerating science with GPT-5
OpenAI’s overview of curated case studies where GPT-5 helped researchers in mathematics and science.

Early science acceleration experiments with GPT-5
The underlying paper with case studies and caveats on GPT-5 in scientific collaboration.

Anthropic Economic Index AI’s impact on software development
Anthropic analysis of 500,000 coding interactions comparing augmentation and automation patterns.

Anthropic Economic Index report Economic primitives
Anthropic’s January 2026 report introducing new metrics for understanding real-world Claude usage.

Anthropic Economic Index report Learning curves
Anthropic’s March 2026 update on adoption patterns, task concentration, and user learning effects.

Responsible Scaling Policy Version 3.0
Anthropic policy update describing how fast-growing agentic capabilities change the risk landscape.

Redacted Risk Report Feb 2026
Anthropic risk report explicitly discussing AI-driven automation of AI R&D as an acceleration concern.

AI Researchers’ Perspectives on Automating AI R&D and Intelligence Explosions
Interview-based paper on how leading researchers assess AI R&D automation and recursive improvement risks.

The AI Scientist Towards Fully Automated Open-Ended Scientific Discovery
Sakana-led paper presenting an end-to-end automated research pipeline for machine learning discovery.

The AI Scientist-v2 Workshop-Level Automated Scientific Discovery via Agentic Tree Search
Follow-up paper claiming a fully AI-generated workshop paper that cleared a peer-review threshold.

Evaluating Sakana’s AI Scientist Bold Claims, Mixed Results, and a Promising Future
Independent evaluation that both recognizes progress and documents substantial weaknesses in the system.

Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity
METR’s randomized controlled trial on whether early-2025 AI tools sped up experienced open-source developers.

We are Changing our Developer Productivity Experiment Design
METR update explaining later evidence, selection effects, and why productivity measurement remains hard.

Continually self-improving AI
Research on automated AI research environments and execution feedback in open-ended LLM research tasks.

Gödel Agent A Self-Referential Framework for Agents Recursively Self-Improvement
Paper proposing a self-evolving agent framework that modifies its own routines and logic.

SAHOO Safeguarded Alignment for High-Order Optimization Objectives in Recursive Self-Improvement
Workshop paper focused on measuring and constraining alignment drift during recursive improvement cycles.