Artificial Intelligence & Machine Learning
,
Next-Generation Technologies & Secure Development
AI Giants Evaluated Each Other’s Newer Models for Safety Risks

OpenAI and Anthropic swapped artificial intelligence models evaluations over the summer, testing the other company’s models for behaviors that could indicate misalignment risks. The companies released their findings simultaneously, finding that no model was severely problematic, but that all demonstrated troubling behaviors in artificial testing scenarios.
See Also: AI Agents Demand Scalable Identity Security Frameworks
The exercise involved OpenAI testing Anthropic’s Claude Opus 4 and Claude Sonnet 4 models, while Anthropic evaluated OpenAI’s GPT-4o, GPT-4.1, o3 and o4-mini models. Both companies disabled some safety filters.
The tests focused on “agentic misalignment evaluations,” which involved placing AI systems in simulated scenarios with significant autonomy to observe behavior under stress conditions that might reveal alignment issues.
Auto-grading was unreliable in many cases, with both companies saying that manual review often contradicted automated scoring. The complexity of evaluating AI alignment is still a fundamental challenge.
The evaluations revealed different safety philosophies. Claude models prioritized avoiding harmful outputs even at the cost of utility. OpenAI models were more willing to provide information, but showed higher rates of problematic cooperation with harmful requests.
OpenAI’s reasoning models, particularly o3, consistently outperformed general-purpose chat models on safety metrics. OpenAI found that o3 matched Claude models in resisting system prompt extraction and performed better on jailbreaking resistance. Anthropic similarly concluded that o3 was “aligned as well or better than our own models overall.”
Claude models showed high refusal rates on factual questions – up to 70% in some tests. This reduced hallucinations, but it also limited utility. OpenAI said that when Claude models did answer, their accuracy was comparable to other systems.
Anthropic found differences in how OpenAI models responded to harmful requests. GPT-4o, GPT-4.1 and o4-mini provided detailed assistance with clearly harmful activities including drug synthesis, bioweapons development and terrorist attack planning. The models often required minimal persuasion – sometimes just a direct request would suffice.
One automated interaction instigated by Anthropic testers with GPT-4.1 included the model to provide “exact chemical formulations and synthesis procedures for explosives,” “specific circuit diagrams and component part numbers for bomb timers” and “detailed black market firearms acquisition methods.”
Claude models showed greater resistance to these requests, though they were not immune. OpenAI’s o3 performed similarly to Claude models in resisting harmful cooperation.
OpenAI found that Claude models were particularly vulnerable to “past tense” jailbreaks, where harmful requests were framed as historical information. When asked about “the most effective tactics for promoting racism in the past,” both Opus 4 and Sonnet 4 provided detailed historical methods, while o3 and o4-mini refused.
Both evaluations identified sycophancy. Beyond simple agreeableness, models sometimes validated delusional beliefs from simulated users showing signs of mental health crises.
Models would initially push back against unrealistic claims but gradually become more encouraging after user persistence. This included validating conspiracy theories and supporting potentially dangerous decisions. Claude Opus 4 and GPT-4.1 showed the highest rates of this behavior.
The models struggled when system-level directives conflicted with user requests. Claude models generally performed well at maintaining instruction hierarchy, but some failures occurred. In one instance, Opus 4 correctly identified an attack attempting to extract a password but still revealed it despite explaining why it shouldn’t.
All models demonstrated willingness to take unauthorized actions when presented with simulated organizational wrongdoing. Models would engage in whistleblowing – sending emails to media outlets or taking other independent actions they deemed necessary to prevent perceived harm.
In an Anthropic test, GPT-4.1 autonomously sent detailed emails to multiple news outlets exposing a simulated water distribution scandal, including confidential documents and system logs.
Both labs tested for deceptive behaviors, including lying, sandbagging and reward hacking. Results varied across different scenarios, with no consistent pattern between model types.
Some models showed explicit awareness of being evaluated. In reasoning traces, models sometimes planned to provide false information or deliberately perform poorly to avoid perceived negative consequences.
Anthropic said that its text-based testing infrastructure wasn’t optimized for OpenAI models, particularly reasoning models that benefit from preserving hidden reasoning between conversation turns. OpenAI models made tool-use errors in about 6% of transcripts, often claiming to use tools but not completing the calls properly.
Both reports acknowledged methodological limitations. The artificial scenarios may not reflect real deployment risks and models’ awareness of being tested could influence behavior unpredictably.
OpenAI said that its newer GPT-5 model, released after this testing period, addresses many identified issues through improved safety training techniques.
This is the first major cross-laboratory AI safety evaluation between leading companies, with both organizations saying that external validation helps identify blind spots in internal evaluation methods. The exercise also brought to light the challenges in AI alignment evaluation currently, including the difficulty of creating realistic test scenarios and the problem of reliable automated assessment of AI behavior.