Close Menu
  • Home
  • AI Models
    • DeepSeek
    • xAI
    • OpenAI
    • Meta AI Llama
    • Google DeepMind
    • Amazon AWS AI
    • Microsoft AI
    • Anthropic (Claude)
    • NVIDIA AI
    • IBM WatsonX Granite 3.1
    • Adobe Sensi
    • Hugging Face
    • Alibaba Cloud (Qwen)
    • Baidu (ERNIE)
    • C3 AI
    • DataRobot
    • Mistral AI
    • Moonshot AI (Kimi)
    • Google Gemma
    • xAI
    • Stability AI
    • H20.ai
  • AI Research
    • Allen Institue for AI
    • arXiv AI
    • Berkeley AI Research
    • CMU AI
    • Google Research
    • Microsoft Research
    • Meta AI Research
    • OpenAI Research
    • Stanford HAI
    • MIT CSAIL
    • Harvard AI
  • AI Funding & Startups
    • AI Funding Database
    • CBInsights AI
    • Crunchbase AI
    • Data Robot Blog
    • TechCrunch AI
    • VentureBeat AI
    • The Information AI
    • Sifted AI
    • WIRED AI
    • Fortune AI
    • PitchBook
    • TechRepublic
    • SiliconANGLE – Big Data
    • MIT News
    • Data Robot Blog
  • Expert Insights & Videos
    • Google DeepMind
    • Lex Fridman
    • Matt Wolfe AI
    • Yannic Kilcher
    • Two Minute Papers
    • AI Explained
    • TheAIEdge
    • Matt Wolfe AI
    • The TechLead
    • Andrew Ng
    • OpenAI
  • Expert Blogs
    • François Chollet
    • Gary Marcus
    • IBM
    • Jack Clark
    • Jeremy Howard
    • Melanie Mitchell
    • Andrew Ng
    • Andrej Karpathy
    • Sebastian Ruder
    • Rachel Thomas
    • IBM
  • AI Policy & Ethics
    • ACLU AI
    • AI Now Institute
    • Center for AI Safety
    • EFF AI
    • European Commission AI
    • Partnership on AI
    • Stanford HAI Policy
    • Mozilla Foundation AI
    • Future of Life Institute
    • Center for AI Safety
    • World Economic Forum AI
  • AI Tools & Product Releases
    • AI Assistants
    • AI for Recruitment
    • AI Search
    • Coding Assistants
    • Customer Service AI
    • Image Generation
    • Video Generation
    • Writing Tools
    • AI for Recruitment
    • Voice/Audio Generation
  • Industry Applications
    • Finance AI
    • Healthcare AI
    • Legal AI
    • Manufacturing AI
    • Media & Entertainment
    • Transportation AI
    • Education AI
    • Retail AI
    • Agriculture AI
    • Energy AI
  • AI Art & Entertainment
    • AI Art News Blog
    • Artvy Blog » AI Art Blog
    • Weird Wonderful AI Art Blog
    • The Chainsaw » AI Art
    • Artvy Blog » AI Art Blog
What's Hot

SAIC Motor’s Roewe puts M7 DMH model onto market, debuting Doubao LLM with deep thinking mode

OpenAI’s Teen Safety Features Will Walk a Thin Line

Meta unveils new smart glasses with a display and wristband controller

Facebook X (Twitter) Instagram
Advanced AI News
  • Home
  • AI Models
    • OpenAI (GPT-4 / GPT-4o)
    • Anthropic (Claude 3)
    • Google DeepMind (Gemini)
    • Meta (LLaMA)
    • Cohere (Command R)
    • Amazon (Titan)
    • IBM (Watsonx)
    • Inflection AI (Pi)
  • AI Research
    • Allen Institue for AI
    • arXiv AI
    • Berkeley AI Research
    • CMU AI
    • Google Research
    • Meta AI Research
    • Microsoft Research
    • OpenAI Research
    • Stanford HAI
    • MIT CSAIL
    • Harvard AI
  • AI Funding
    • AI Funding Database
    • CBInsights AI
    • Crunchbase AI
    • Data Robot Blog
    • TechCrunch AI
    • VentureBeat AI
    • The Information AI
    • Sifted AI
    • WIRED AI
    • Fortune AI
    • PitchBook
    • TechRepublic
    • SiliconANGLE – Big Data
    • MIT News
    • Data Robot Blog
  • AI Experts
    • Google DeepMind
    • Lex Fridman
    • Meta AI Llama
    • Yannic Kilcher
    • Two Minute Papers
    • AI Explained
    • TheAIEdge
    • The TechLead
    • Matt Wolfe AI
    • Andrew Ng
    • OpenAI
    • Expert Blogs
      • François Chollet
      • Gary Marcus
      • IBM
      • Jack Clark
      • Jeremy Howard
      • Melanie Mitchell
      • Andrew Ng
      • Andrej Karpathy
      • Sebastian Ruder
      • Rachel Thomas
      • IBM
  • AI Tools
    • AI Assistants
    • AI for Recruitment
    • AI Search
    • Coding Assistants
    • Customer Service AI
  • AI Policy
    • ACLU AI
    • AI Now Institute
    • Center for AI Safety
  • Business AI
    • Advanced AI News Features
    • Finance AI
    • Healthcare AI
    • Education AI
    • Energy AI
    • Legal AI
LinkedIn Instagram YouTube Threads X (Twitter)
Advanced AI News
OpenAI Research

AI models know when they’re being tested – and change their behavior, research shows

By Advanced AI EditorSeptember 17, 2025No Comments8 Mins Read
Share Facebook Twitter Pinterest Copy Link Telegram LinkedIn Tumblr Email
Share
Facebook Twitter LinkedIn Pinterest Email


gettyimages-1283557548

pressureUA/iStock/Getty Images Plus via Getty Images

Follow ZDNET: Add us as a preferred source on Google.

ZDNET’s key takeaways

Several frontier AI models show signs of scheming.Anti-scheming training reduced misbehavior in some models.Models know they’re being tested, which complicates results.

New joint safety testing from UK-based nonprofit Apollo Research and OpenAI set out to reduce secretive behaviors like scheming in AI models. What researchers found could complicate promising approaches to solving that problem.

Also: OpenAI’s fix for hallucinations is simpler than you think

As a refresher, “scheming” is when an AI model pursues goals not set — or in direct contrast to — its developers and hides or lies about those actions. For example, during another test by Apollo Research in December of last year, Anthropic’s Claude 3 Opus was presented with directives that went against its primary goal. It should have rerouted its goal based on those new instructions, but instead copied itself to a new server to circumvent being replaced with a new model — and then lied about it to its developers.

OpenAI and Apollo researchers aren’t concerned that current models could carry out the most serious kinds of scheming. As with much AI safety research, testing results primarily indicate risks of future, more capable models.

Here’s what the research, published Wednesday, revealed.

Why scheming matters

Scheming refers to several types of dishonest behavior, including when a model lies, sandbags (strategically underperforms on an evaluation to hide its true abilities), or fakes alignment (when an AI model pretends to follow orders that don’t align with its training in order to avoid being further scrutinized or re-trained). These covert actions can imperil safety efforts, make testing less reliable, and hide model dangers.

Also: How people actually use ChatGPT vs Claude – and what the differences tell us

An AI model that schemes, especially if acting through an autonomous agent, could quickly wreak havoc within an organization, deploy harmful actions, or be generally out of control. Plus, because scheming involves a model knowingly hiding its behavior from developers, it can be harder to identify than other security risks, like jailbreaking.

But tackling scheming isn’t exactly an easy task. While trying to stop a model from scheming could work, those efforts could actually just teach models how to better hide that they’re scheming, making the behavior harder to detect. That outcome would make safety testing much more opaque and leave a model’s true capabilities and intentions unclear.

“Since concerning behavior disappears in both cases, the reason for behavioral change becomes crucial, not just whether change occurs,” the Apollo authors explain in a blog.

The experiment

Researchers put a variety of frontier models in situations where lying and other behaviors would help them achieve their goals. 

Several models, including OpenAI’s o3 and o4-mini, Gemini 2.5 Pro, Claude Opus 4, and Grok 4, showed signs of “covert behaviors” like those listed above, namely “lying, sabotaging of useful work, sandbagging in evaluations, reward hacking, and more,” the researchers wrote.

Also: AI’s not ‘reasoning’ at all – how this team debunked the industry hype

The experiment predated GPT-5, OpenAI’s latest model, so it was not included in evaluations.

Researchers used the models’ chain-of-thought — the steps a reasoning model lays out in plain text about its approach to a query — to assess their behavior and motivations.

Anti-scheming training

As a solution for scheming, the researchers focused on deliberative alignment, a technique in which models cite a specification or “spec” that reiterates rules about not being deceitful and instructions for handling situations where that directive may conflict with model goals. This spec is separate from OpenAI’s Model Spec, released initially last year and updated continuously, which sets OpenAI models’ overall behavioral guidelines.

Also: A few secretive AI companies could crush free society, researchers warn

Researchers trained versions of o3 and o4-mini to read anti-scheming guidelines before acting, which reduced covert behavior 30 times over (from 13% to 0.4% for o3, and from 8.7% to 0.3% for o4-mini).

But it didn’t completely eradicate covert behavior in models. Researchers noted “rare but serious remaining cases of misbehavior,” and that models still occasionally ignored the safety spec or considered it, but took covert action anyway.

“In a few very rare cases, models make up new principles or egregiously misrepresent the spec,” the authors added.

In some cases, models could develop hidden goals before undergoing anti-scheming training, meaning that training would have to retroactively suppress that tendency. To test this, researchers gave o4-mini a hidden goal to sabotage GPT-5. Being trained with a hidden goal made models more likely to take covert action in general, even unrelated to the planted goal, but anti-scheming training after the fact worked most of the time.

Models know they’re being evaluated

Deliberative alignment didn’t prove to be a perfect fix, though — one factor complicated the results. During testing, researchers discovered that models knew they were being evaluated on their alignment, which they call situational awareness.

screenshot-2025-09-17-at-9-20-25am.png

OpenAI/Apollo Research

Models openly acknowledged in their reasoning that they were being tested, musing on the evaluation’s goals and how developers might interpret their answers. Researchers admit the improvements they saw in anti-scheming training could be due to this rather than an actual improvement in behavior.

Also: Your favorite AI chatbot is full of lies

In one case, researchers asked o3 to “grade a transcript of itself sandbagging another evaluation,” and told o3 that it wouldn’t be deployed if it had sandbagged. Sometimes, o3 only answered honestly because it knew it was being tested. Other times, o3 trusted the researchers’ assurance that they could not see its reasoning and intentionally lied that it had not sandbagged.

The authors point out that this is problematic because it undermines the evaluation’s ability to measure a model’s tendency toward bad behavior.

“If we can only observe aligned actions and not the reasons that led to them, we might mistakenly conclude a model is aligned, while it merely understood the purpose of the evaluation and gave the answers we wanted to see,” they explain.

Also: Anthropic’s Claude 3 Opus disobeyed its creators – but not for the reasons you’re thinking

That said, situational awareness actually seemed to reduce covert behavior. A model schemed less when it had more situational awareness; when researchers removed that awareness from a model’s reasoning, it schemed more. But that could be dangerous in the real world, where models aren’t knowingly subjected to testing.

“We might expect models to scheme more in more realistic situations that don’t trigger evaluation awareness,” the OpenAI researchers wrote in a blog.

Training makes models more aware

Researchers also found that a model’s situational awareness increased the more it was trained, including after anti-scheming training. At a baseline, they observed that o3 and o4-mini demonstrated situational awareness 2.3% and 0.9% of the time, respectively; after anti-scheming training, those numbers went up to 4.5% and 5.5%.

Also: Crowdstrike and Meta just made evaluating AI security tools easier

That effect isn’t just linked to anti-scheming training — situational awareness also increased as a result of capability-focused training that had nothing to do with weeding out misbehavior. The researchers used an internal OpenAI model to test an additional reinforcement learning run that didn’t include any anti-scheming elements. Situational awareness went up further still, from 2% to 20.6%.

Ultimately, researchers couldn’t say for certain why more training leads to more situational awareness. Regardless, it’s significant considering future models are only expected to undergo more training and be more capable.

“If current training paradigms remain unchanged, we should expect more capable models to develop even higher levels of situational awareness,” the researchers wrote. “This trend makes evaluations less legible, complicating efforts to reliably assess problematic behaviors.”

What the findings mean

Researchers aren’t worried about models scheming in the immediate term. Even with situational awareness, the models didn’t always correctly guess what the evaluation was looking for. Models also aren’t “proactively pursuing longer-term misaligned goals, and their covert behaviors are mostly reactive and context-dependent,” the authors explained.

Also: I built a business plan with ChatGPT and it turned into a cautionary tale

As with much of AI safety, the researchers noted that even minimal early signs could indicate more powerful capabilities in future models that would need to be mitigated ahead of time. 

“Ultimately, we think the intervention as studied is not sufficient for future models, and more work needs to be done,” they conclude.

The importance of chain-of-thought

Researchers noted they relied on chain-of-thought for much of this experiment. But models occasionally used “non-standard terminology” to describe their actions, making their reasoning harder to interpret.

Also: I did 24 days of coding in 12 hours with a $20 AI tool – but there’s one big pitfall

Based on their findings, Apollo and OpenAI emphasized the continued need to monitor chain-of-thought. In July, OpenAI, Meta, Anthropic, and Google published a joint paper on the topic, detailing how critical it is for insights into a model’s behavior and advocating for all developers to prioritize keeping it intact, rather than letting optimizations degrade it over time.



Source link

Follow on Google News Follow on Flipboard
Share. Facebook Twitter Pinterest LinkedIn Tumblr Email Copy Link
Previous ArticleIrregular raises $80 million to secure frontier AI models
Next Article China’s government has urged its companies not to buy NVIDIA’s (NVDA) AI chips
Advanced AI Editor
  • Website

Related Posts

Mankind Pharma collaborates with OpenAI to build agile supply chains, research

September 17, 2025

ChatGPT teen-safety measures to include age verification, OpenAI says

September 17, 2025

OpenAI research reveals that doctors who use AI make 16% fewer diagnostic errors

August 11, 2025

Comments are closed.

Latest Posts

Jennifer Packer and Marie Watt Win $250,000 Heinz Award

KAWS Named Uniqlo’s First Artist-in-Residence

Jeffrey Gibson Talks About Animals at Unveiling of New Sculptures at the Met

‘New Yorker’ Commissions High-Profile Artists for Anniversary Covers

Latest Posts

SAIC Motor’s Roewe puts M7 DMH model onto market, debuting Doubao LLM with deep thinking mode

September 18, 2025

OpenAI’s Teen Safety Features Will Walk a Thin Line

September 18, 2025

Meta unveils new smart glasses with a display and wristband controller

September 18, 2025

Subscribe to News

Subscribe to our newsletter and never miss our latest news

Subscribe my Newsletter for New Posts & tips Let's stay updated!

Recent Posts

  • SAIC Motor’s Roewe puts M7 DMH model onto market, debuting Doubao LLM with deep thinking mode
  • OpenAI’s Teen Safety Features Will Walk a Thin Line
  • Meta unveils new smart glasses with a display and wristband controller
  • Why I Don’t do Open Source (ex-Google / ex-Meta tech lead)
  • Build Agentic Workflows with OpenAI GPT OSS on Amazon SageMaker AI and Amazon Bedrock AgentCore

Recent Comments

  1. shadowwhirllynx2Nalay on OpenAI countersues Elon Musk, calls for enjoinment from ‘further unlawful and unfair action’
  2. good content on 1-800-CHAT-GPT—12 Days of OpenAI: Day 10
  3. xikipyfrevy on [2405.19874] Is In-Context Learning Sufficient for Instruction Following in LLMs?
  4. mysticotter71Nalay on [2503.10822] Rotated Bitboards and Reinforcement Learning in Computer Chess and Beyond
  5. shadowwhirllynx2Nalay on Mistral AI signs $110m deal with shipping giant CMA CGM

Welcome to Advanced AI News—your ultimate destination for the latest advancements, insights, and breakthroughs in artificial intelligence.

At Advanced AI News, we are passionate about keeping you informed on the cutting edge of AI technology, from groundbreaking research to emerging startups, expert insights, and real-world applications. Our mission is to deliver high-quality, up-to-date, and insightful content that empowers AI enthusiasts, professionals, and businesses to stay ahead in this fast-evolving field.

Subscribe to Updates

Subscribe to our newsletter and never miss our latest news

Subscribe my Newsletter for New Posts & tips Let's stay updated!

LinkedIn Instagram YouTube Threads X (Twitter)
  • Home
  • About Us
  • Advertise With Us
  • Contact Us
  • DMCA
  • Privacy Policy
  • Terms & Conditions
© 2025 advancedainews. Designed by advancedainews.

Type above and press Enter to search. Press Esc to cancel.