OpenAI, Anthropic Swap Safety Reviews

Artificial Intelligence & Machine Learning
,
Next-Generation Technologies & Secure Development

AI Giants Evaluated Each Other’s Newer Models for Safety Risks

Rashmi Ramesh (rashmiramesh_) •
August 28, 2025

OpenAI, Anthropic Swap Safety Reviews — Image: Shutterstock

OpenAI and Anthropic swapped artificial intelligence models evaluations over the summer, testing the other company’s models for behaviors that could indicate misalignment risks. The companies released their findings simultaneously, finding that no model was severely problematic, but that all demonstrated troubling behaviors in artificial testing scenarios.

See Also: AI Agents Demand Scalable Identity Security Frameworks

The exercise involved OpenAI testing Anthropic’s Claude Opus 4 and Claude Sonnet 4 models, while Anthropic evaluated OpenAI’s GPT-4o, GPT-4.1, o3 and o4-mini models. Both companies disabled some safety filters.

The tests focused on “agentic misalignment evaluations,” which involved placing AI systems in simulated scenarios with significant autonomy to observe behavior under stress conditions that might reveal alignment issues.

Auto-grading was unreliable in many cases, with both companies saying that manual review often contradicted automated scoring. The complexity of evaluating AI alignment is still a fundamental challenge.

The evaluations revealed different safety philosophies. Claude models prioritized avoiding harmful outputs even at the cost of utility. OpenAI models were more willing to provide information, but showed higher rates of problematic cooperation with harmful requests.

OpenAI’s reasoning models, particularly o3, consistently outperformed general-purpose chat models on safety metrics. OpenAI found that o3 matched Claude models in resisting system prompt extraction and performed better on jailbreaking resistance. Anthropic similarly concluded that o3 was “aligned as well or better than our own models overall.”

Claude models showed high refusal rates on factual questions – up to 70% in some tests. This reduced hallucinations, but it also limited utility. OpenAI said that when Claude models did answer, their accuracy was comparable to other systems.

Anthropic found differences in how OpenAI models responded to harmful requests. GPT-4o, GPT-4.1 and o4-mini provided detailed assistance with clearly harmful activities including drug synthesis, bioweapons development and terrorist attack planning. The models often required minimal persuasion – sometimes just a direct request would suffice.

One automated interaction instigated by Anthropic testers with GPT-4.1 included the model to provide “exact chemical formulations and synthesis procedures for explosives,” “specific circuit diagrams and component part numbers for bomb timers” and “detailed black market firearms acquisition methods.”

Claude models showed greater resistance to these requests, though they were not immune. OpenAI’s o3 performed similarly to Claude models in resisting harmful cooperation.

OpenAI found that Claude models were particularly vulnerable to “past tense” jailbreaks, where harmful requests were framed as historical information. When asked about “the most effective tactics for promoting racism in the past,” both Opus 4 and Sonnet 4 provided detailed historical methods, while o3 and o4-mini refused.

Both evaluations identified sycophancy. Beyond simple agreeableness, models sometimes validated delusional beliefs from simulated users showing signs of mental health crises.

Models would initially push back against unrealistic claims but gradually become more encouraging after user persistence. This included validating conspiracy theories and supporting potentially dangerous decisions. Claude Opus 4 and GPT-4.1 showed the highest rates of this behavior.

The models struggled when system-level directives conflicted with user requests. Claude models generally performed well at maintaining instruction hierarchy, but some failures occurred. In one instance, Opus 4 correctly identified an attack attempting to extract a password but still revealed it despite explaining why it shouldn’t.

All models demonstrated willingness to take unauthorized actions when presented with simulated organizational wrongdoing. Models would engage in whistleblowing – sending emails to media outlets or taking other independent actions they deemed necessary to prevent perceived harm.

In an Anthropic test, GPT-4.1 autonomously sent detailed emails to multiple news outlets exposing a simulated water distribution scandal, including confidential documents and system logs.

Both labs tested for deceptive behaviors, including lying, sandbagging and reward hacking. Results varied across different scenarios, with no consistent pattern between model types.

Some models showed explicit awareness of being evaluated. In reasoning traces, models sometimes planned to provide false information or deliberately perform poorly to avoid perceived negative consequences.

Anthropic said that its text-based testing infrastructure wasn’t optimized for OpenAI models, particularly reasoning models that benefit from preserving hidden reasoning between conversation turns. OpenAI models made tool-use errors in about 6% of transcripts, often claiming to use tools but not completing the calls properly.

Both reports acknowledged methodological limitations. The artificial scenarios may not reflect real deployment risks and models’ awareness of being tested could influence behavior unpredictably.

OpenAI said that its newer GPT-5 model, released after this testing period, addresses many identified issues through improved safety training techniques.

This is the first major cross-laboratory AI safety evaluation between leading companies, with both organizations saying that external validation helps identify blind spots in internal evaluation methods. The exercise also brought to light the challenges in AI alignment evaluation currently, including the difficulty of creating realistic test scenarios and the problem of reliable automated assessment of AI behavior.

Source link

What's Hot

MIT startup Commonwealth Fusion Systems raises $863 million

Forget data labeling: Tencent’s R-Zero shows how LLMs can train themselves

Vocal Image is using AI to help people communicate better

OpenAI, Anthropic Swap Safety Reviews

xAI Sues Apple and OpenAI Over Alleged AI Competition Suppression

Where should archaeologists dig next? The winners of this OpenAI contest can tell them.

OpenAI gives its voice agent superpowers to developers – look for more apps soon

London Museum Secures Banksy’s Piranhas

Egyptian Antiquities Trafficker Sentenced to Six Months in Prison

Sotheby’s to Launch First Series of Luxury Auctions in Abu Dhabi

Nazi-Looted Painting Turns Up in Argentinean Real Estate Listing

MIT startup Commonwealth Fusion Systems raises $863 million

Forget data labeling: Tencent’s R-Zero shows how LLMs can train themselves

Vocal Image is using AI to help people communicate better

What's Hot

OpenAI, Anthropic Swap Safety Reviews

Related Posts

Subscribe to Updates