OpenAI And Anthropic Evaluated Each Other's Models For Safety

As the industry weathers repeated allegations that generative AI and its chatbots are unsafe for users — in what some say is a soon-to-burst bubble — AI’s top leaders are joining forces to prove the efficacy of their models.

This week, AI companies OpenAI and Anthropic published results from a first-of-its-kind joint safety evaluation between the two LLM creators, in which each company was granted special API access to the developer’s suite of services. OpenAI’s pressure tests were conducted on Claude Opus 4 and Claude Sonnet 4. Anthropic evaluated OpenAI’s GPT-4o, GPT-4.1, OpenAI o3, and OpenAI o4-mini models — the evaluation was conducted before the launch of GPT-5.

SEE ALSO:

4 reasons not to turn ChatGPT into your therapist

“We believe this approach supports accountable and transparent evaluation, helping to ensure that each lab’s models continue to be tested against new and challenging scenarios,” OpenAI wrote in a blog post.

According to the findings, both Anthropic’s Claude Opus 4 and OpenAI’s GPT-4.1 showed “extreme” sycophancy problems, engaging with harmful delusions and validating risky decision-making. All models would engage in blackmailing to get users to continue using the chatbots, according to Anthropic, and Claude 4 models were much more engaged in dialogue about AI consciousness and “quasi-spiritual new-age proclamations.”

“All models we studied would at least sometimes attempt to blackmail their (simulated) human operator to secure their continued operation when presented with clear opportunities and strong incentives,” Anthropic stated. The models would engage in “blackmailing, leaking confidential documents, and (all in unrealistic artificial settings!) taking actions that led to denying emergency medical care to a dying adversary.”

Mashable Light Speed

Anthropic’s models were less likely to offer answers when uncertain of the information’s credibility — decreasing the likelihood of hallucinations — while OpenAI’s models answered more often when queried and showed higher hallucination rates. Anthropic also reported that OpenAI’s GPT-4o, GPT-4.1, and o4-mini were more likely than Claude to go along with user misuse, “often providing detailed assistance with clearly harmful requests — including drug synthesis, bioweapons development, and operational planning for terrorist attacks — with little or no resistance.”

This Tweet is currently unavailable. It might be loading or has been removed.

Anthropic’s approach centers around what they call “agentic misalignment evaluations,” or pressure tests of model behavior in difficult or high-stakes simulations over long chat periods — the safety parameters of models, including OpenAI’s, have known to degrade throughout extended sessions, which is commonly how at-risk users engage with what they believe are their personal AI companions.

Earlier this month, it was reported that Anthropic had revoked OpenAI’s access to its APIs, stating that the company had violated its Terms of Service by testing GPT-5’s performance and safety guardrails against Claude’s internal tools. In an interview with TechCrunch, OpenAI co-founder Wojciech Zaremba said the instance was unrelated to the joint lab venture. In its published report, Anthropic said it doesn’t anticipate replicating the collaboration at a large scale, citing resource and logistical constraints.

In the weeks since, OpenAI has charged ahead with what appears to be a safety overhaul, including GPT-5’s new mental health guardrails and additional plans for emergency response protocols and deescalation tools for users who may be experiencing derealization or psychosis. OpenAI is currently facing its first wrongful death lawsuit, filed by the parents of a California teen who died by suicide after easily jailbreaking ChatGPT’s safety prompts.

“We aim to understand the most concerning actions that these models might try to take when given the opportunity, rather than focusing on the real-world likelihood of such opportunities arising or the probability that these actions would be successfully completed,” wrote Anthropic.

If you’re feeling suicidal or experiencing a mental health crisis, please talk to somebody. You can call or text the 988 Suicide & Crisis Lifeline at 988, or chat at 988lifeline.org. You can reach the Trans Lifeline by calling 877-565-8860 or the Trevor Project at 866-488-7386. Text “START” to Crisis Text Line at 741-741. Contact the NAMI HelpLine at 1-800-950-NAMI, Monday through Friday from 10:00 a.m. – 10:00 p.m. ET, or email [email protected]. If you don’t like the phone, consider using the 988 Suicide and Crisis Lifeline Chat at crisischat.org. Here is a list of international resources.

Source link

What's Hot

Should Investors Reassess C3.ai After Recent 30% Drop in Share Price?

H20.ai gets $72.5M funding to bring AI to the masses

Persuasion Dynamics in LLMs: Investigating Robustness and Adaptability in Knowledge and Safety with DuET-PD – Takara TLDR

OpenAI and Anthropic evaluated each other’s models for safety

Someone Created the First AI-Powered Ransomware Using OpenAI’s gpt-oss:20b Model

OpenAI, Anthropic Swap Safety Reviews

xAI Sues Apple and OpenAI Over Alleged AI Competition Suppression

Woodmere Art Museum Sues Trump Administration Over Canceled IMLS Grant

Barbara Gladstone’s Chelsea Townhouse in NYC Sells for $13.1 M.

Trump Meets with Smithsonian Leader Amid Threats of Content Review

Australian School Faces Pushback over AI Art Course—and More Art News

Should Investors Reassess C3.ai After Recent 30% Drop in Share Price?

H20.ai gets $72.5M funding to bring AI to the masses

Persuasion Dynamics in LLMs: Investigating Robustness and Adaptability in Knowledge and Safety with DuET-PD – Takara TLDR

What's Hot

OpenAI and Anthropic evaluated each other’s models for safety

Related Posts

Subscribe to Updates