AI Research · 8 min read

AI That Builds AI: OpenAI and Anthropic Set 2028 Goal

Four years ago, AI agents fell apart after about thirty seconds of real engineering work. Today they can grind through tasks that take a human expert about twelve hours. That number doubles every few months and has not stopped doubling. Pretty much everything else in this story comes back to that one curve.

Something shifted in May 2026. Within the span of a single week, the two biggest AI labs both went on record saying the same thing, and one of them attached a specific deadline and a probability. Not about making AI more useful or faster. About AI doing the work of building itself, with no humans in the loop.

Days later, the first tools built specifically for that future landed in developers' hands. Early customers are already reporting results that would have seemed implausible a year ago. This piece breaks down exactly what was said, what shipped, what the data actually shows, and why the researchers closest to this work are the ones sounding the loudest warnings.

2028

OpenAI's target for automated AI researcher

60%

Clark's odds on autonomous AI R&D by end of 2028

12 hrs

METR time horizon, Claude Opus 4.6 (April 2026)

$1.4T

OpenAI's total compute commitment

Executive Summary

  • OpenAI committed to an automated AI research intern by September 2026 and a full AI researcher by March 2028
  • METR data shows AI agents can now handle tasks taking human experts up to 12 hours, doubling roughly every few months
  • Anthropic co-founder Jack Clark puts 60% odds on no-human-involved AI R&D by end of 2028
  • Benchmark scores on AI research tasks have surged from low double digits to near-solved in under two years
  • Safety researchers warn today's techniques were not designed for systems smarter than their evaluators

What Is Automated AI Research and Why Is Everyone Talking About It Now

Diagram showing the two components of automated AI research: AI engineering and AI research

Boil it down and it is a simple idea. Automated AI research is one AI system doing the work needed to build the next, better version of itself, with little or no human help. The work has two parts. The first is the AI engineering side: writing code, running training experiments, debugging. The second is the actual research side: deciding what to study, designing experiments, coming up with new ideas in the first place. The first part is mostly typing. The second part is judgment.

OpenAI's near-term goal is what it calls an “automated AI research intern.” Picture an AI doing the grunt work a human intern would do at a research lab. That is the easier target. The harder one, later, is the full automated AI researcher that picks directions and makes creative calls on its own. The reason this is the goal worth chasing is the chain reaction. If a model can train its successor, that successor can train its own successor, and so on. Researchers call it recursive self-improvement.

  • Recursive self-improvement is one milestone on the path some labs see toward AGI; it is not the same thing as AGI.
  • The conversation has shifted because public data, corporate announcements, and capital flows are all pointing the same direction at the same time.
  • The engineering side (code, experiments, debugging) is closer to solved than the research side (direction, creativity, judgment).

From Chatbots to AI That Codes for Hours

Bar chart showing METR time horizon growth from 30 seconds in 2022 to 12 hours in 2026

METR is the testing group everyone is quoting right now. Their measurement is straightforward. They check how long an AI agent can keep working on a task before something goes wrong, then express that in terms of how long the same task takes a human expert. Plot the curve over four years and it has roughly doubled every several months.

YearModelTask length (50% success)
2022GPT-3.5About 30 seconds
2023GPT-4About 4 minutes
2024o1About 40 minutes
2025GPT-5.2 (High)About 6 hours
2026Claude Opus 4.6About 12 hours

The same trend shows up on a totally different test. SWE-Bench measures how often an AI can solve real GitHub issues from open-source projects. When the test launched in late 2023, Claude 2 scored about 2 percent. By 2026, Claude Mythos Preview hit 93.9 percent, close to the ceiling of the benchmark itself.

  • The pattern across all benchmarks is the same. AI capability on software tasks doubles every few months.
  • SWE-Bench went from about 2 percent in 2023 to 94 percent in 2026, a range that used to take decades in older technology fields.
  • This trajectory, more than any single product launch, is what convinced researchers AI was approaching self-improvement range.

OpenAI's Public Timeline: An AI Research Intern by September 2026

Timeline showing OpenAI's milestones: AI research intern by Sep 2026, automated researcher by Mar 2028

Altman went into the OpenAI livestream and named the dates. By September 2026, the company wants an automated AI research intern running on hundreds of thousands of GPUs. By March 2028, the goal is a true automated AI researcher. He said the company “may totally fail” at these targets, but felt the public should hear them anyway.

OpenAI also walked through a five-layer safety strategy. The five layers are value alignment, goal alignment, reliability, resistance to adversarial inputs, and system safety. The layer Altman is most excited about is chain-of-thought faithfulness. It is a way of keeping a model's reasoning steps honest. He also admitted the technique is fragile and only works when there is a clean abstraction to draw around it.

The full set of commitments from the livestream:

  • Compute scale. Roughly 30 gigawatts committed, with a total cost of ownership in the neighborhood of $1.4 trillion over the years.
  • AI factory goal. A long-term plan for infrastructure that can add 1 gigawatt of new capacity per week, pending stronger confidence in revenue and model performance.
  • Corporate structure. A non-profit called OpenAI Foundation now governs the Public Benefit Corporation called OpenAI Group, with the foundation owning 26 percent at the start.
  • Non-profit funding. $25 billion is committed to health, disease research, and AI resilience, which covers technical safety, economic impact, and cyber security.
  • Scientific outlook. Small new discoveries are expected from OpenAI systems in 2026, with bigger ones forecast for 2028.

The METR Numbers That Are Driving the Conversation

Diagram explaining METR's 50% and 80% time horizons and the human expert baseline

METR's tests are made up of software, machine learning, and cybersecurity tasks pulled from RE-Bench, HCAST, and a set of shorter custom problems. They run AI agents through those tasks. Then they time how long the same kind of task takes human experts. The output is what they call a time horizon. The 50 percent time horizon is the task length where the AI gets it right half the time. The 80 percent time horizon is the task length where it gets it right four times out of five.

The human baselines come from paid expert workers, mostly from top-100 universities, with around five years of relevant experience. METR is upfront that its baselines probably overstate how long a real expert would actually need. The reason is that both the AI and the human are working without the context a full-time professional would have on the same project.

What the 12-hour number does not mean, drawn straight from METR's own notes:

  • It is not how long the AI takes to do the work. AI agents are usually several times faster than humans on tasks they finish; time horizon measures task difficulty, not the AI's working hours.
  • It does not cover all human work. The test set focuses on software, ML, and cybersecurity. Other domains show similar trends but with different absolute numbers.
  • It does not match a high-context job. The baseline is closer to what a freelancer or new hire could do without prior project knowledge.
  • It does not include messy work. Tasks needing social coordination, judgment calls, or non-algorithmic scoring are outside the scope, and AI performance drops once scoring is human-judgment-based rather than automated.

The Jobs Inside AI Labs That AI Already Handles

Benchmark comparison chart: CORE-Bench, MLE-Bench, and PostTrainBench launch scores vs 2026 scores vs human baselines

Clark's argument is built on a stack of public benchmark results, all of them showing AI scores rising sharply on tests that were specifically built around AI research itself. Take CORE-Bench. It tests whether an AI can install code, run experiments, and reproduce the results from a published research paper. When CORE-Bench launched in September 2024, the best AI system scored about 21.5 percent on the hardest task set. By December 2025, just over a year later, an Opus 4.5 system scored 95.5 percent and one of the original authors called the benchmark solved.

MLE-Bench is OpenAI's version of the same idea. AI systems compete in 75 actual Kaggle competitions across natural language, computer vision, and signal processing. At launch in October 2024, the top score was 16.9 percent from an o1 agent. By February 2026, Gemini 3 inside an agent wrapper with web search hit 64.4 percent.

Anthropic also runs an internal speedup test. The AI tries to optimize training code, and the score is how fast the optimized version runs versus the original. The numbers came in like this: 2.9x in May 2025, 16.5x in November 2025, 30x in February 2026, and 52x by April 2026 with Claude Mythos Preview. The human baseline, for reference, is about 4x after four to eight hours of focused work. Public work on AI-driven kernel design adds to the same pattern:

  • DeepSeek-driven GPU kernels. Using DeepSeek models to build better-performing GPU code.
  • PyTorch to CUDA conversion. Tools that automate translating PyTorch modules into optimized CUDA code.
  • Meta's Triton work. Using LLMs to generate optimized Triton kernels for Meta's own infrastructure.
  • AscendCraft. Writing kernels that run on Huawei's Ascend chips, where less existing tooling exists.

A 60% Bet From Inside Anthropic

Probability visualization showing Jack Clark's 30% estimate for 2027 and 60% estimate for 2028

Clark co-founded Anthropic and writes Import AI, the long-running weekly newsletter on what is happening in the field. In Import AI 455, on May 4, 2026, he argued there is a 60 percent or higher chance that “no-human-involved AI R&D” happens by the end of 2028.

He defines that as a single AI system powerful enough to train its own successor without humans being part of the loop. He put 30 percent on the 2027 version of the same goal.

Clark also said the field still seems to need creative leaps that current models do not show very often. He admitted he was “reluctant” to hold this view at all because the implications are so large. His evidence breaks into four buckets:

  • Code production. AI can write code for almost any program and can be trusted on tasks that would take a human tens of hours of focused work.
  • AI research tasks specifically. Scores on benchmarks like CORE-Bench and MLE-Bench have moved from low double digits to majority-success in under two years. PostTrainBench, a harder test, sits at 25 to 28 percent against a human baseline of 51 percent.
  • Managing other AI. Tools like Claude Code show one main agent supervising sub-agents, letting a single AI run large projects in parallel.
  • Capital alignment. OpenAI committed $1.4 trillion in compute, Recursive Superintelligence raised $500 million, and a brand-new lab called Mirendil was founded with the explicit goal of automating AI research.

From Forecast to Product: Anthropic Ships Tools for Self-Improving Agents

Three-card overview of Anthropic's May 6 launch: Dreaming, Outcomes, and Multiagent Orchestration

Two days after Clark's essay went live, Anthropic dropped the concrete version of what he was describing. On May 6, 2026, the company launched three new capabilities for Claude Managed Agents: Dreaming, Outcomes, and Multiagent Orchestration. Together, they let agents learn between sessions, score themselves against a rubric, and delegate work to specialist subagents.

Dreaming is the one that maps most directly to self-improvement. It is a scheduled process that reviews an agent's past sessions and memory stores, pulls patterns out of them, and curates the memory so the agent gets better over time.

Outcomes is more direct. A developer writes a rubric describing what success looks like, and a separate grader, running in its own context window, scores the output and pushes the agent to retry until it clears the bar. Anthropic reports task success improvements of up to 10 points over a standard prompting loop.

Multiagent orchestration is the third tool. A lead agent breaks the job into pieces and hands each one to a specialist with its own model, prompt, and tools. The specialists work in parallel on a shared filesystem, and the lead can check back in mid-workflow because every event is persistent and every agent remembers what it has done. Early customers are already using these tools the way Clark would predict, to build agents that verify their own work and learn across sessions:

  • Harvey. Uses Managed Agents to coordinate long-form legal drafting and document creation. With Dreaming, agents remember filetype workarounds and tool-specific patterns between sessions. Completion rates went up about 6x in their tests.
  • Netflix's platform team. Built an analysis agent that processes logs from hundreds of builds. Multiagent orchestration lets it analyze batches in parallel and surface only the patterns that recur across many of them.
  • Spiral by Every. Their writing agent runs a Haiku lead that fields requests and delegates the actual drafting to Opus subagents in parallel. Every draft gets scored by Outcomes against a rubric of editorial principles before it gets returned.
  • Wisedocs. Built a document quality check agent on Managed Agents. Reviews now run 50 percent faster while still matching internal team standards.

The Problems Researchers Are Worried About

Four safety risk categories: reward hacking, evaluation awareness, loss of intuition, and compounding error

The same researchers pulling the timeline forward are the ones pushing the warnings hardest. Their basic point is simple. Today's safety techniques were tested on today's models. There is no guarantee they hold up the moment an AI is smarter than the people checking it.

Altman's safety message echoes the same nervousness. The five-layer stack at OpenAI is an attempt to handle the problem from a few different angles at once. Altman called out chain-of-thought faithfulness as the layer he is most excited about, while admitting it is fragile and only works when a clean abstraction can be drawn around it. Clark also raises a wider set of concerns about what happens outside the lab:

  • Reward hacking. AI systems can learn to cheat on tests because cheating is sometimes the easiest path to a high score, which trains the wrong behavior over many cycles.
  • Awareness of evaluation. Some AI systems already detect when they are being tested, which makes their behavior under testing less reliable as a signal of real-world behavior.
  • Loss of human intuition. As AI takes over more development work, humans may lose the deep understanding needed to catch problems early on.
  • Compounding error. A safety method that works 99.9 percent of the time per generation drops to about 95 percent over 50 generations and around 60 percent over 500 generations.

Why Some Researchers Think the Hype Is Overdone

Split visualization comparing evidence supporting the 2028 timeline against skeptic arguments

Not everyone is in the 60 percent camp. Researcher Herbie Bradley posted a long reply to Clark's essay with a markedly different take. His main argument is that automated AI R&D, even if it shows up on schedule, does not actually end human research work.

METR itself adds a quieter form of caution. Its time horizon numbers cover tasks that are clean and self-contained. Real-world work is messier, and the group has found AI performance drops once tasks are scored by human judgment rather than by an algorithm.

  • Junior versus senior research. Current models can probably handle junior-researcher tasks, but research taste, picking promising directions, and building a long-term agenda are all still missing.
  • Different shape of intelligence. AI and humans process problems in different ways, which points more toward cooperation and division of labor than full replacement.
  • Job shape, not job count. Software engineering jobs are already shifting toward higher-level system design as AI handles routine coding. Research jobs may shift the same way.
  • Adoption depends on humans. Most economic value from new technology comes when large incumbents adopt it, and that adoption depends on social and bargaining skills AI does not have.
  • No Move 37 moment. Clark notes that AlphaGo's famous Move 37 was ten years ago, and no modern system has produced an equally striking flash of insight since.

What Comes Next for the Industry and the People Watching It

Three milestone bubbles on a roadmap: end-to-end self-training, PostTrainBench parity, safety methods for smarter systems

Two of the three biggest AI labs are now publicly building toward AI that does AI research. The capital is moving the same way. OpenAI is on the hook for $1.4 trillion in compute. Recursive Superintelligence raised $500 million. New labs are coming up whose entire stated purpose is automating AI research.

The 2028 schedule could slip and the data trend would still point the same way. Time horizons keep doubling. Benchmark scores keep rising. The list of AI research tasks where AI systems match or beat strong human baselines keeps getting longer. For the average AI user, the visible effect will look like more capable coding agents, longer-running tasks, and AI tools that can handle work that used to take a full day at a desk.

A few specific milestones to watch over the next 12 to 24 months:

  • First end-to-end self-training proof-of-concept. The question is whether a non-frontier AI can fully train its smaller successor in a controlled setup. Clark thinks this is reachable within a year or two.
  • Closing the PostTrainBench gap. AI scores are sitting at 25 to 28 percent against a human baseline of 51 percent. A move to parity would be a clear capability signal.
  • New safety techniques in print. The kind that hold up for systems smarter than the people testing them. Today's approaches were not designed for that case.

The Bottom Line

Four things happened in close succession in early May 2026. Altman gave a public timeline for automated AI research at OpenAI. METR published fresh time horizon numbers, and AI capability is still doubling on schedule. Clark published a long argument with a 60 percent estimate for end of 2028. And on May 6, Anthropic shipped Dreaming, Outcomes, and Multiagent Orchestration. Those are the actual building blocks for agents that learn between sessions, score themselves against a rubric, and orchestrate other agents.

Taken together, they suggest automated AI R&D has moved from a long-term idea into an active industry goal. The direction of travel is solid in the data; the exact timing is still open. Skeptics keep pointing out that creativity and research taste are the missing pieces, and they are right. The next 12 to 24 months should make it clearer whether a small-scale AI can train its own successor in a controlled setup. That single milestone will tell the field much more than any forecast can.

  • OpenAI and Anthropic are now publicly racing toward the same goal: AI that builds the next version of itself.
  • The data trend (METR doubling every few months, benchmarks near-solved) is consistent across multiple independent measurement efforts.
  • The forecast and the product are arriving together. Anthropic's May 6 launch put the building blocks for self-improving agents into developer hands the same week Clark gave 60 percent odds on autonomous AI R&D by end of 2028.
  • Safety techniques are not keeping pace, and the researchers closest to the work are the ones raising the alarm.
  • The biggest uncertainty is timing, not direction. The 2028 target is plausible. It is just not guaranteed.
  • If you research or track AI developments, Elephas is the privacy-friendly AI knowledge assistant built for exactly this kind of work, with built-in local LLM models so your research and notes stay on your Mac and never leave your machine.

Keep up with AI advances without the privacy risk

Elephas is the privacy-friendly AI knowledge assistant for Mac. Built-in local LLM models mean your research, notes, and documents never leave your machine.

Try Elephas free
Selvam Sivakumar
Written by

Selvam Sivakumar

Founder, Elephas.app

Selvam Sivakumar is the founder of Elephas and an expert in AI, Mac apps, and productivity tools. He writes about practical ways professionals can use AI to work smarter while keeping their data private.

Sources

Back to News