AI Safety · 9 min read

The Biggest Change to Anthropic's AI Safety Policy in Two Years, Explained

Anthropic released version 3.0 of its Responsible Scaling Policy on February 24, 2026. This is the biggest rewrite since the original policy launched in September 2023. Back then, large language models were chatbots that could answer questions. Now they browse the web, write code, use computers, and take autonomous multi-step actions. The risks have changed, and Anthropic is admitting that the old rules no longer fit. This article covers what the original policy said, what changed in v3.0, what stayed the same, and what all of this means for the AI industry and for people who use Claude every day.

v3.0

Latest RSP version, released Feb 24, 2026

2.5 yrs

Since the original RSP launched in Sep 2023

3–6 mo

New Risk Report publishing cadence

4 areas

Frontier Safety Roadmap focus areas

Executive Summary

Anthropic rewrote its Responsible Scaling Policy after two and a half years of real-world use. The original policy (September 2023) used a tiered AI Safety Levels system to match model capabilities with required safeguards. Version 3.0 (February 2026) overhauls that approach based on what worked and what didn't.

  • Anthropic now separates what it will do as a single company from what it believes the entire AI industry needs to do together, acknowledging that higher-level safety is a collective action problem
  • A new Frontier Safety Roadmap replaces some hard commitments with publicly tracked goals across Security, Alignment, Safeguards, and Policy
  • Risk Reports published every 3 to 6 months will give the public detailed visibility into model capabilities, threat models, and risk mitigations
  • Independent external reviewers will publicly critique Risk Reports for highly capable models, with no financial ties to Anthropic and full freedom to criticize
  • All existing ASL-3 protections for bioweapon and chemical weapon risks stay in place, and the Responsible Scaling Officer role, board oversight, and anonymous reporting channels remain unchanged

What Is Anthropic's Responsible Scaling Policy

Anthropic Responsible Scaling Policy framework overview

Anthropic's Responsible Scaling Policy (RSP) is a set of internal rules the company created for itself to manage the risks of building increasingly powerful AI systems. It was first published on September 19, 2023, and it has gone through several revisions since then. The core idea is straightforward: as AI models get more capable, certain safety checks and security measures need to be in place before Anthropic can train or release a new model. If the safety measures aren't ready, development pauses until they are.

Anthropic modeled the RSP loosely after the US government's biosafety level system, which classifies how dangerous a biological agent is and then requires specific lab protocols for each level. The RSP applies that same logic to AI: different levels of capability mean different levels of required protection. The policy focuses specifically on catastrophic risks, meaning scenarios where an AI model could directly cause large-scale devastation — think bioweapons assistance, autonomous AI acting destructively, or state-level cyberattacks.

  • The RSP was first published September 19, 2023, and has been updated four times since (v2.0, v2.1, v2.2, and now v3.0)
  • It is a voluntary policy, not a government requirement, though it has gone on to influence real regulation in California (SB 53), New York (the RAISE Act), and the EU (AI Act Codes of Practice)
  • The policy covers only catastrophic risks, not everyday product issues like incorrect answers or biased outputs
  • It was formally approved by Anthropic's board and requires board consultation with the Long-Term Benefit Trust for any changes

How the Original AI Safety Levels System Worked

AI Safety Levels (ASLs) framework from Anthropic's original RSP

The original RSP introduced a framework called AI Safety Levels, or ASLs. Each level described a category of AI capability and the safeguards required to match it.

ASL-1 covered systems with no meaningful catastrophic risk. A chess-playing AI or a basic 2018-era language model fell into this bucket. ASL-2 covered systems that showed early signs of dangerous capabilities, like being able to give instructions on how to build bioweapons, but where the information wasn't yet more reliable or useful than what someone could find through a search engine. Anthropic classified all current Claude models at ASL-2 when the policy launched.

ASL-3 covered systems that could substantially increase catastrophic risk compared to non-AI tools, or that showed low-level autonomous capabilities. This level required much stricter security, adversarial red-teaming by world-class experts, and a commitment not to deploy models showing meaningful catastrophic misuse risk. Anthropic activated ASL-3 safeguards for relevant models in May 2025. ASL-4 and beyond were intentionally left undefined, with the plan to write those standards before reaching ASL-3.

ASL Levels at a Glance

  • ASL-1: No risk (chess AI, 2018 chatbots)
  • ASL-2: Early dangerous knowledge but not more useful than a search engine (Claude was classified here at launch)
  • ASL-3: Substantial increase in catastrophic risk, requiring strict security and deployment controls (activated May 2025)
  • ASL-4+: Left undefined, to be written before models reached those capability levels

The underlying logic was conditional: if a model crosses a capability threshold, then a specific set of protections must be in place. If those protections aren't ready, Anthropic must pause scaling until they are.

Why Anthropic Decided to Rewrite the Policy After Two Years

Two and a half years of running the RSP gave Anthropic a clear picture of what worked and what didn't. On the positive side, the policy did what it was supposed to do internally. To meet ASL-3 deployment standards, Anthropic developed input and output classifiers that block content related to chemical and biological weapons.

The RSP also had an effect on the broader industry. Within a few months of Anthropic publishing the original policy, both OpenAI and Google DeepMind adopted similar frameworks. Several governments started requiring frontier AI developers to publish safety frameworks, including California with SB 53 and the EU through its AI Act Codes of Practice.

But other parts of the plan didn't play out. The idea of using capability thresholds to build consensus across the AI industry ran into a problem: evaluation science isn't mature enough to give clear answers about whether a model has crossed a threshold. Anthropic found itself in what it calls a “zone of ambiguity.” Models now pass most quick biological knowledge tests, so the company can't argue risks are low. But those tests alone aren't enough to make a strong argument that risks are high either. Wet-lab trials that could provide better answers take so long that more powerful models are available by the time results come in.

Government action on AI safety has also been slow. The policy environment shifted toward prioritizing AI competitiveness and economic growth, and safety discussions haven't gained serious traction at the federal level.

  • The RSP pushed Anthropic to develop specific defenses against bioweapon and chemical weapon misuse
  • OpenAI and Google DeepMind adopted broadly similar frameworks within months of the original RSP launch
  • Capability threshold ambiguity made it hard to build a public case for industry-wide action
  • Political climate shifted toward AI economic growth rather than safety regulation
  • A RAND report on model weight security stated that its highest security standard (SL5) is “currently not possible” and “will likely require assistance from the national security community”

The combination of ambiguous thresholds, slow government action, and future safeguards that may be impossible for one company to achieve alone created a structural problem. Anthropic chose to restructure the RSP rather than define higher safety levels in ways that would be easy to meet but meaningless.

Three Major Changes in RSP Version 3.0

Three major changes in Anthropic RSP v3.0

The updated policy restructures around three core changes.

Separating Company Plans from Industry Recommendations

Anthropic now splits its commitments into two categories. The first covers what Anthropic will do on its own, regardless of what competitors do. The second outlines what the entire AI industry would need to do to keep catastrophic risks reliably low.

The old RSP committed Anthropic to reducing its models' absolute risk to acceptable levels, without factoring in what other companies were doing. That sounds responsible, but it creates a problem at higher capability levels. If one company pauses development to build safety measures while others keep training and deploying without those protections, the result is a world where the weakest safety standards set the pace, and the company that paused loses its ability to do safety research at the frontier.

This situation hasn't happened yet, but Anthropic sees it as likely enough to plan for it. The new RSP separates achievable unilateral commitments from the broader industry-wide mitigations that Anthropic believes are necessary but can't accomplish alone.

  • Anthropic still commits to maintaining all existing ASL-3 protections for bioweapon and chemical weapon risks
  • The industry-wide recommendations are not binding commitments but a public map of what Anthropic believes responsible AI development requires across all frontier labs
  • This structure acknowledges that AI safety at higher capability levels is a collective action problem, not something one company can solve in isolation

Frontier Safety Roadmap

The RSP now requires Anthropic to develop and publish a Frontier Safety Roadmap. This roadmap lays out specific goals across four areas: Security, Alignment, Safeguards, and Policy. These are not hard promises — they are public goals that Anthropic will openly grade its own progress against. This approach borrows from the transparency strategy Anthropic has been advocating for in frontier AI legislation. It gives the public something concrete to track without locking Anthropic into commitments that might turn out to be technically impossible.

  • Security goals include launching “moonshot R&D” projects for information security that goes beyond anything currently possible
  • Alignment goals include systematic measures to ensure Claude follows its constitution
  • Safeguards goals include building automated red-teaming that outperforms the hundreds of participants in Anthropic's public bug bounty program
  • Policy goals include publishing a concrete regulatory roadmap proposing a “regulatory ladder” that scales with increasing AI risk
  • The Roadmap is shared with all full-time employees, the board of directors, and the Long-Term Benefit Trust

Risk Reports with External Review

Anthropic must now publish Risk Reports every three to six months. These go deeper than system cards. A Risk Report covers the model's capabilities, the specific threat models (the ways a model might pose threats), the active risk mitigations in place, and an overall assessment of whether the risks of training or deploying the model are justified by the benefits.

Risk Reports will be published online with some redactions to protect sensitive details about training methods and partner organizations. For models that reach high capability thresholds, or for reports with significant redactions, at least one independent external reviewer must publish a public critique. These reviewers must have no financial ties to Anthropic and must be free to say whatever they find, including outright criticism of Anthropic's reasoning or behavior.

  • Risk Reports cover capabilities, threat models, active mitigations, and overall risk assessment (system cards typically only cover capabilities and basic safety testing)
  • Published every 3 to 6 months, online, with redactions for sensitive operational details
  • External reviewers are required for highly capable models or reports with significant redacted sections
  • Reviewers get unredacted or minimally-redacted access and must be independent of Anthropic financially
  • Anthropic is already running pilot programs for external review even though current models don't yet require it

What AI Capabilities Would Trigger Stricter Safety Measures

The full RSP v3.0 document lays out specific capability thresholds that would trigger different levels of protection. Anthropic now maps these thresholds to two columns of mitigations: what Anthropic will do alone, and what the industry should do collectively.

The lowest threshold involves non-novel bioweapon and chemical weapon production. If a model can meaningfully help someone with a basic science background create a dangerous weapon using known methods, that triggers the existing ASL-3 protections. Anthropic commits to maintaining these regardless of what competitors do.

The next level involves novel bioweapon production. If a model could help well-resourced expert teams create weapons with damage potential beyond past events like the COVID-19 pandemic, the required protections jump significantly. Anthropic says industry-wide protections should ideally meet RAND Security Level 4, a very high standard.

High-stakes sabotage is another threshold. If an AI system embedded in sensitive operations gains enough autonomous capability that it could carry out actions increasing the odds of a global-scale disaster, that triggers its own monitoring and restriction requirements.

The most serious threshold involves automated AI R&D. Anthropic's working line for a “highly capable” model is one that could compress two or more years of 2018-to-2024 AI research progress into a single year. Reaching that level would trigger the most serious industry-wide response requirements.

Capability Thresholds & Required Responses

  • Non-novel bioweapon/chemical weapon assistance: triggers existing ASL-3 safeguards (Anthropic commits to these unilaterally)
  • Novel bioweapon production by expert teams: requires RAND Security Level 4 protections (industry-wide)
  • High-stakes autonomous sabotage: requires dedicated monitoring and restriction frameworks
  • Accelerated AI R&D (2+ years compressed into 1): the highest threshold, triggering maximum industry-wide response
  • Anthropic now requires a “strong argument for safety” rather than a fixed checklist, giving more flexibility but also making external verification harder

What Did Not Change in the Updated Policy

Not everything got rewritten. Several core elements carry over from the previous versions. Anthropic still maintains all ASL-3 protections that were activated in May 2025. The input and output classifiers that block bioweapon and chemical weapon content remain in place, along with access controls for trusted users, red-teaming programs, and bug bounty programs. The Responsible Scaling Officer position stays, with the same core duties: overseeing policy compliance, approving model development decisions, and reviewing major contracts. Policy changes still require board approval in consultation with the Long-Term Benefit Trust.

  • All ASL-3 safeguards remain active (classifiers, access controls, red-teaming, bug bounty)
  • The Responsible Scaling Officer role and responsibilities are unchanged
  • Board approval with Long-Term Benefit Trust consultation is still required for policy changes
  • Employees can still report noncompliance anonymously without fear of non-disparagement clauses
  • An independent third-party procedural review still happens roughly once a year to verify Anthropic followed its own processes

Is This Update a Step Forward or a Step Back

There are legitimate arguments on both sides.

On the positive side, Anthropic is doing something most companies don't do: publicly admitting that parts of its safety framework didn't achieve what it set out to achieve. The separation of company-level commitments from industry-wide recommendations is a more honest framing of the collective action problem. One company can't solve AI safety at the highest levels on its own, and pretending otherwise doesn't help anyone. Risk Reports with external reviewers create more genuine accountability than the old system of internal evaluations. And the Frontier Safety Roadmap gives the public specific goals to track progress against.

The critical perspective is also worth considering. Separating “what Anthropic will do” from “what the industry should do” could be read as reducing hard obligations while maintaining the appearance of responsibility. The shift from specific ASL checklists to a “strong argument for safety” standard is less concrete and harder for outside observers to verify. External review only kicks in for highly capable models with significant redactions, which means plenty of reports won't face that scrutiny. And the entire framework assumes that government coordination on AI safety will eventually happen, but the political climate hasn't shown much movement in that direction.

Arguments in Favor

  • Transparent about what didn't work
  • Honest about collective action limits
  • External review adds real accountability
  • Public roadmap gives concrete goals

Arguments Against

  • Fewer hard commitments than before
  • “Strong argument for safety” is subjective
  • External review has a high trigger threshold
  • Relies on government action that hasn't materialized

How This Affects People Who Use Claude Every Day

If you use Claude for writing, coding, research, or conversation, nothing about your daily experience changes. The RSP operates at the research and deployment level, not at the product feature level. It won't affect what Claude can do in a chat window or how it responds to your prompts.

What does change is transparency. The Risk Reports that Anthropic will now publish every three to six months give regular users real visibility into what the company actually knows about its models' risks and capabilities. If Anthropic ever reaches a threshold where its models could seriously assist in creating weapons of mass destruction, the new policy requires it to publish what it found and explain what it's doing about it. The external review requirement means that if important safety information gets buried in redacted reports, independent experts have the authority and the incentive to call it out publicly.

The bottom line for everyday users: more transparency, not fewer features.

The Bottom Line

RSP v3.0 is what happens when a company runs a safety framework for two and a half years and is willing to say what didn't work. Some goals were met. The internal forcing function succeeded. The industry took notice. But capability thresholds turned out to be ambiguous, government action was slow, and some future safeguards look impossible for one company to achieve alone.

The new policy is more realistic about what Anthropic can do on its own and more transparent about what the whole industry needs to do. Whether this update represents progress depends entirely on follow-through. The Risk Reports, the Frontier Safety Roadmap, and the external reviews are only as good as the commitment behind them. Anthropic has always called the RSP a living document. Version 4.0 will come when the technology forces it.

Use AI That Takes Privacy Seriously

Elephas keeps your work private — no training on your data, runs on-device where possible, and built for professionals who can't afford data leaks.

Try Elephas Free

Sources

Back to News