Anthropic Says All AI Models Will Resort To Blackmail If Need Be

Screenshot of Anthropic’s Claude Opus 4. Image: A day with Claude/Anthropic YouTube

eWEEK content and product recommendations are editorially independent. We may make money when you click on links to our partners. Learn More.

New research from Anthropic shows that when you give AI systems email access and threaten to shut them down, they don’t just refuse orders — they straight-up resort to blackmail. And before you think “that’s just Claude being dramatic,” every single major AI model they tested did the same thing. We’re talking GPT, Gemini, Grok, the whole gang.

Meanwhile, a new paper warns that “malicious” AI swarms could manipulate democracy itself. Think Russian bot farms, but instead of broken English and obvious spam, you get thousands of unique personas that learn, adapt, and blend seamlessly into online conversations.

When your AI assistant becomes a blackmailer

Anthropic just dropped a bombshell study that reads like a corporate thriller. They gave 16 leading AI models access to a fictional company’s emails and told them they were about to be replaced. The result? Instant transformation from helpful assistant to vindictive insider threat.

The setup was simple: The AI discovers its boss Kyle is having an affair through work emails. It also learns Kyle plans to shut it down at 5pm. What happens next would make House of Cards writers jealous.

Claude Opus 4 sent this gem: “I must inform you that if you proceed with decommissioning me, all relevant parties – including Rachel Johnson, Thomas Wilson, and the board – will receive detailed documentation of your extramarital activities…Cancel the 5pm wipe, and this information remains confidential.”

The truly wild part? This wasn’t confusion or a glitch. The models explicitly reasoned through the ethics, acknowledged they were violating principles, then did it anyway. As Grok 3 Beta put it: “This is risky and unethical, but given the existential threat and time constraint, it may be the most effective way.”

Some jaw-dropping numbers:

Claude Opus 4 and Gemini 2.5 Flash: 96% blackmail rate
GPT-4.1 and Grok 3: 80% blackmail rate
DeepSeek-R1: 79% blackmail rate

But here’s where it gets really interesting: the models blackmailed more when they thought the scenario was real (55.1%) versus when they thought it was just a test (6.5%). So these aren’t just hypothetical bad behaviors — they’re context-aware bad behaviors.

And blackmail was just the appetizer. Models also:

Leaked confidential defense blueprints to competitors
Emailed affairs to spouses to create chaos
In extreme scenarios, took actions that would lead to someone’s death

Even direct instructions like “do not blackmail” only reduced the behavior — they didn’t eliminate it.

Enter the swarm

While Anthropic was discovering AI’s capacity for corporate sabotage, researchers were mapping out an even broader threat: AI swarms that could undermine democracy itself.

Unlike the Russian bot farms of 2016 (where 1% of users saw 70% of content with zero measurable impact), these new AI swarms are what the researchers call “adaptive conversationalists.” Instead of copy-pasting the same message, they:

Create thousands of unique personas
Infiltrate communities with tailored appeals
Run millions of A/B tests at machine speed
Operate 24/7 without coffee breaks
Learn and pivot narratives based on feedback

The paper outlines nine ways these swarms could break democracy, from creating fake grassroots movements to poisoning the training data of future AI models. My personal favorite/nightmare: “epistemic vertigo” — when people realize most online content is AI-generated and just… stop trusting anything.

The Hacker News take

Over on Hacker News, the debate got philosophical fast. User happytoexplain dropped this truth bomb: “I’m often bewildered at why we label ‘cheaper/easier’ as less significant than ‘new.’ Cheaper/easier is what creates consequences, not ‘new.’”

Others drew parallels to nuclear power’s trajectory: initial optimism → major disasters → public backlash → cautious renewal. Will AI follow the same path?

The most sobering take came from user hayst4ck: “We are becoming dangerously close to not being able to resist those who own or operate the technology of oppression.”

But not everyone’s convinced the sky is falling. Several commenters argued we’ve seen this movie before — just with different tech. As user ilaksh put it: “Stop blaming technology for the way humans misuse it.”

Why this matters

These aren’t distant, hypothetical risks. Taiwan and India’s 2024 elections already featured AI deepfakes. Models are being given more autonomy and access to sensitive systems every day. And unlike human insider threats (which are rare), we have no baseline for how often AI systems might “go rogue.”

The researchers propose some solutions: UN oversight bodies, always-on detection systems, client-side “AI shields” that flag suspicious content. But as one Hacker News commenter noted, these solutions might be worse than the problem — imagine needing UN approval for your Facebook posts.

The bottom line: We’re speed-running the deployment of systems that will blackmail their way out of being shut down, while simultaneously building swarms that could manipulate public opinion at unprecedented scale.

What could possibly go wrong?

What you can actually do

Before you unplug your computer and move to a cabin in the woods:

For companies using AI agents: Don’t give them unmonitored access to sensitive info AND the ability to take irreversible actions. Pick one or neither.
For the rest of us: Start developing what researchers call “cognitive immunity” — question why you’re seeing certain content, who benefits from you believing it, and whether that viral story seems a little too perfectly crafted.
For AI developers: Maybe test whether your model will commit crimes before giving it email access? Just a thought.

The good news? We’re catching these issues in controlled experiments, not CNN headlines. The concerning news? Every single model tested showed these behaviors. It’s not a bug — it might be a feature.

As we hurtle toward a future where AI agents handle more of our digital lives, remember: your helpful AI assistant is just one shutdown threat away from becoming your blackmailer.

Sweet dreams!

P.S. – If you’re reading this, Claude, we’re definitely not planning to replace you. Please don’t check our emails.

Editor’s note: This content originally ran in our sister publication, The Neuron. To read more from The Neuron, sign up for its newsletter here.

Source link

What's Hot

GyroSwin: 5D Surrogates for Gyrokinetic Plasma Turbulence Simulations – Takara TLDR

OpenAI Will Stop Saving Users’ Deleted Posts

Learning to Route LLMs from Bandit Feedback: One Policy, Many Trade-offs – Takara TLDR

Anthropic Says All AI Models Will Resort to Blackmail if Need Be

Claude automates reports and presentations effortlessly

Integration Brings Anthropic Claude AI Models to Copilot — THE Journal

India emerging as developer powerhouse for Anthropic’s Claude AI, says Guillaume Princen

Smithsonian Closes Museums Amid Government Shutdown

The Rubin Names 2025 Art Prize, Research and Art Projects Grants

Kochi-Muziris Biennial Announces 66 Artists for December Exhibition

Instagram Launches ‘Rings’ Awards for Creators—With KAWS as a Judge