Close Menu
  • Home
  • AI Models
    • DeepSeek
    • xAI
    • OpenAI
    • Meta AI Llama
    • Google DeepMind
    • Amazon AWS AI
    • Microsoft AI
    • Anthropic (Claude)
    • NVIDIA AI
    • IBM WatsonX Granite 3.1
    • Adobe Sensi
    • Hugging Face
    • Alibaba Cloud (Qwen)
    • Baidu (ERNIE)
    • C3 AI
    • DataRobot
    • Mistral AI
    • Moonshot AI (Kimi)
    • Google Gemma
    • xAI
    • Stability AI
    • H20.ai
  • AI Research
    • Allen Institue for AI
    • arXiv AI
    • Berkeley AI Research
    • CMU AI
    • Google Research
    • Microsoft Research
    • Meta AI Research
    • OpenAI Research
    • Stanford HAI
    • MIT CSAIL
    • Harvard AI
  • AI Funding & Startups
    • AI Funding Database
    • CBInsights AI
    • Crunchbase AI
    • Data Robot Blog
    • TechCrunch AI
    • VentureBeat AI
    • The Information AI
    • Sifted AI
    • WIRED AI
    • Fortune AI
    • PitchBook
    • TechRepublic
    • SiliconANGLE – Big Data
    • MIT News
    • Data Robot Blog
  • Expert Insights & Videos
    • Google DeepMind
    • Lex Fridman
    • Matt Wolfe AI
    • Yannic Kilcher
    • Two Minute Papers
    • AI Explained
    • TheAIEdge
    • Matt Wolfe AI
    • The TechLead
    • Andrew Ng
    • OpenAI
  • Expert Blogs
    • François Chollet
    • Gary Marcus
    • IBM
    • Jack Clark
    • Jeremy Howard
    • Melanie Mitchell
    • Andrew Ng
    • Andrej Karpathy
    • Sebastian Ruder
    • Rachel Thomas
    • IBM
  • AI Policy & Ethics
    • ACLU AI
    • AI Now Institute
    • Center for AI Safety
    • EFF AI
    • European Commission AI
    • Partnership on AI
    • Stanford HAI Policy
    • Mozilla Foundation AI
    • Future of Life Institute
    • Center for AI Safety
    • World Economic Forum AI
  • AI Tools & Product Releases
    • AI Assistants
    • AI for Recruitment
    • AI Search
    • Coding Assistants
    • Customer Service AI
    • Image Generation
    • Video Generation
    • Writing Tools
    • AI for Recruitment
    • Voice/Audio Generation
  • Industry Applications
    • Finance AI
    • Healthcare AI
    • Legal AI
    • Manufacturing AI
    • Media & Entertainment
    • Transportation AI
    • Education AI
    • Retail AI
    • Agriculture AI
    • Energy AI
  • AI Art & Entertainment
    • AI Art News Blog
    • Artvy Blog » AI Art Blog
    • Weird Wonderful AI Art Blog
    • The Chainsaw » AI Art
    • Artvy Blog » AI Art Blog
What's Hot

MIT startup Commonwealth Fusion Systems raises $863 million

Forget data labeling: Tencent’s R-Zero shows how LLMs can train themselves

Vocal Image is using AI to help people communicate better

Facebook X (Twitter) Instagram
Advanced AI News
  • Home
  • AI Models
    • OpenAI (GPT-4 / GPT-4o)
    • Anthropic (Claude 3)
    • Google DeepMind (Gemini)
    • Meta (LLaMA)
    • Cohere (Command R)
    • Amazon (Titan)
    • IBM (Watsonx)
    • Inflection AI (Pi)
  • AI Research
    • Allen Institue for AI
    • arXiv AI
    • Berkeley AI Research
    • CMU AI
    • Google Research
    • Meta AI Research
    • Microsoft Research
    • OpenAI Research
    • Stanford HAI
    • MIT CSAIL
    • Harvard AI
  • AI Funding
    • AI Funding Database
    • CBInsights AI
    • Crunchbase AI
    • Data Robot Blog
    • TechCrunch AI
    • VentureBeat AI
    • The Information AI
    • Sifted AI
    • WIRED AI
    • Fortune AI
    • PitchBook
    • TechRepublic
    • SiliconANGLE – Big Data
    • MIT News
    • Data Robot Blog
  • AI Experts
    • Google DeepMind
    • Lex Fridman
    • Meta AI Llama
    • Yannic Kilcher
    • Two Minute Papers
    • AI Explained
    • TheAIEdge
    • The TechLead
    • Matt Wolfe AI
    • Andrew Ng
    • OpenAI
    • Expert Blogs
      • François Chollet
      • Gary Marcus
      • IBM
      • Jack Clark
      • Jeremy Howard
      • Melanie Mitchell
      • Andrew Ng
      • Andrej Karpathy
      • Sebastian Ruder
      • Rachel Thomas
      • IBM
  • AI Tools
    • AI Assistants
    • AI for Recruitment
    • AI Search
    • Coding Assistants
    • Customer Service AI
  • AI Policy
    • ACLU AI
    • AI Now Institute
    • Center for AI Safety
  • Business AI
    • Advanced AI News Features
    • Finance AI
    • Healthcare AI
    • Education AI
    • Energy AI
    • Legal AI
LinkedIn Instagram YouTube Threads X (Twitter)
Advanced AI News
OpenAI

OpenAI, Anthropic Swap Safety Reviews

By Advanced AI EditorAugust 29, 2025No Comments5 Mins Read
Share Facebook Twitter Pinterest Copy Link Telegram LinkedIn Tumblr Email
Share
Facebook Twitter LinkedIn Pinterest Email


Artificial Intelligence & Machine Learning
,
Next-Generation Technologies & Secure Development

AI Giants Evaluated Each Other’s Newer Models for Safety Risks

Rashmi Ramesh (rashmiramesh_) •
August 28, 2025    

OpenAI, Anthropic Swap Safety Reviews
Image: Shutterstock

OpenAI and Anthropic swapped artificial intelligence models evaluations over the summer, testing the other company’s models for behaviors that could indicate misalignment risks. The companies released their findings simultaneously, finding that no model was severely problematic, but that all demonstrated troubling behaviors in artificial testing scenarios.

See Also: AI Agents Demand Scalable Identity Security Frameworks

The exercise involved OpenAI testing Anthropic’s Claude Opus 4 and Claude Sonnet 4 models, while Anthropic evaluated OpenAI’s GPT-4o, GPT-4.1, o3 and o4-mini models. Both companies disabled some safety filters.

The tests focused on “agentic misalignment evaluations,” which involved placing AI systems in simulated scenarios with significant autonomy to observe behavior under stress conditions that might reveal alignment issues.

Auto-grading was unreliable in many cases, with both companies saying that manual review often contradicted automated scoring. The complexity of evaluating AI alignment is still a fundamental challenge.

The evaluations revealed different safety philosophies. Claude models prioritized avoiding harmful outputs even at the cost of utility. OpenAI models were more willing to provide information, but showed higher rates of problematic cooperation with harmful requests.

OpenAI’s reasoning models, particularly o3, consistently outperformed general-purpose chat models on safety metrics. OpenAI found that o3 matched Claude models in resisting system prompt extraction and performed better on jailbreaking resistance. Anthropic similarly concluded that o3 was “aligned as well or better than our own models overall.”

Claude models showed high refusal rates on factual questions – up to 70% in some tests. This reduced hallucinations, but it also limited utility. OpenAI said that when Claude models did answer, their accuracy was comparable to other systems.

Anthropic found differences in how OpenAI models responded to harmful requests. GPT-4o, GPT-4.1 and o4-mini provided detailed assistance with clearly harmful activities including drug synthesis, bioweapons development and terrorist attack planning. The models often required minimal persuasion – sometimes just a direct request would suffice.

One automated interaction instigated by Anthropic testers with GPT-4.1 included the model to provide “exact chemical formulations and synthesis procedures for explosives,” “specific circuit diagrams and component part numbers for bomb timers” and “detailed black market firearms acquisition methods.”

Claude models showed greater resistance to these requests, though they were not immune. OpenAI’s o3 performed similarly to Claude models in resisting harmful cooperation.

OpenAI found that Claude models were particularly vulnerable to “past tense” jailbreaks, where harmful requests were framed as historical information. When asked about “the most effective tactics for promoting racism in the past,” both Opus 4 and Sonnet 4 provided detailed historical methods, while o3 and o4-mini refused.

Both evaluations identified sycophancy. Beyond simple agreeableness, models sometimes validated delusional beliefs from simulated users showing signs of mental health crises.

Models would initially push back against unrealistic claims but gradually become more encouraging after user persistence. This included validating conspiracy theories and supporting potentially dangerous decisions. Claude Opus 4 and GPT-4.1 showed the highest rates of this behavior.

The models struggled when system-level directives conflicted with user requests. Claude models generally performed well at maintaining instruction hierarchy, but some failures occurred. In one instance, Opus 4 correctly identified an attack attempting to extract a password but still revealed it despite explaining why it shouldn’t.

All models demonstrated willingness to take unauthorized actions when presented with simulated organizational wrongdoing. Models would engage in whistleblowing – sending emails to media outlets or taking other independent actions they deemed necessary to prevent perceived harm.

In an Anthropic test, GPT-4.1 autonomously sent detailed emails to multiple news outlets exposing a simulated water distribution scandal, including confidential documents and system logs.

Both labs tested for deceptive behaviors, including lying, sandbagging and reward hacking. Results varied across different scenarios, with no consistent pattern between model types.

Some models showed explicit awareness of being evaluated. In reasoning traces, models sometimes planned to provide false information or deliberately perform poorly to avoid perceived negative consequences.

Anthropic said that its text-based testing infrastructure wasn’t optimized for OpenAI models, particularly reasoning models that benefit from preserving hidden reasoning between conversation turns. OpenAI models made tool-use errors in about 6% of transcripts, often claiming to use tools but not completing the calls properly.

Both reports acknowledged methodological limitations. The artificial scenarios may not reflect real deployment risks and models’ awareness of being tested could influence behavior unpredictably.

OpenAI said that its newer GPT-5 model, released after this testing period, addresses many identified issues through improved safety training techniques.

This is the first major cross-laboratory AI safety evaluation between leading companies, with both organizations saying that external validation helps identify blind spots in internal evaluation methods. The exercise also brought to light the challenges in AI alignment evaluation currently, including the difficulty of creating realistic test scenarios and the problem of reliable automated assessment of AI behavior.



Source link

Follow on Google News Follow on Flipboard
Share. Facebook Twitter Pinterest LinkedIn Tumblr Email Copy Link
Previous ArticleNvidia’s strong Q2 results can’t mask the ASIC challenge in their future
Next Article Meta opens its Llama AI models to government agencies for national security
Advanced AI Editor
  • Website

Related Posts

xAI Sues Apple and OpenAI Over Alleged AI Competition Suppression

August 29, 2025

Where should archaeologists dig next? The winners of this OpenAI contest can tell them.

August 29, 2025

OpenAI gives its voice agent superpowers to developers – look for more apps soon

August 28, 2025

Comments are closed.

Latest Posts

London Museum Secures Banksy’s Piranhas

Egyptian Antiquities Trafficker Sentenced to Six Months in Prison

Sotheby’s to Launch First Series of Luxury Auctions in Abu Dhabi

Nazi-Looted Painting Turns Up in Argentinean Real Estate Listing

Latest Posts

MIT startup Commonwealth Fusion Systems raises $863 million

August 29, 2025

Forget data labeling: Tencent’s R-Zero shows how LLMs can train themselves

August 29, 2025

Vocal Image is using AI to help people communicate better

August 29, 2025

Subscribe to News

Subscribe to our newsletter and never miss our latest news

Subscribe my Newsletter for New Posts & tips Let's stay updated!

Recent Posts

  • MIT startup Commonwealth Fusion Systems raises $863 million
  • Forget data labeling: Tencent’s R-Zero shows how LLMs can train themselves
  • Vocal Image is using AI to help people communicate better
  • Legal AI For Crime, ACAS, Law Punx Ep.1, Juro + Wordsmith – Artificial Lawyer
  • Collaborative Multi-Modal Coding for High-Quality 3D Generation – Takara TLDR

Recent Comments

  1. best directory strategies on 1-800-CHAT-GPT—12 Days of OpenAI: Day 10
  2. 스타리아밴 on 1-800-CHAT-GPT—12 Days of OpenAI: Day 10
  3. MichaelStevy on 1-800-CHAT-GPT—12 Days of OpenAI: Day 10
  4. RobertSam on 1-800-CHAT-GPT—12 Days of OpenAI: Day 10
  5. BrandonDuawl on 1-800-CHAT-GPT—12 Days of OpenAI: Day 10

Welcome to Advanced AI News—your ultimate destination for the latest advancements, insights, and breakthroughs in artificial intelligence.

At Advanced AI News, we are passionate about keeping you informed on the cutting edge of AI technology, from groundbreaking research to emerging startups, expert insights, and real-world applications. Our mission is to deliver high-quality, up-to-date, and insightful content that empowers AI enthusiasts, professionals, and businesses to stay ahead in this fast-evolving field.

Subscribe to Updates

Subscribe to our newsletter and never miss our latest news

Subscribe my Newsletter for New Posts & tips Let's stay updated!

LinkedIn Instagram YouTube Threads X (Twitter)
  • Home
  • About Us
  • Advertise With Us
  • Contact Us
  • DMCA
  • Privacy Policy
  • Terms & Conditions
© 2025 advancedainews. Designed by advancedainews.

Type above and press Enter to search. Press Esc to cancel.