Artificial Intelligence & Machine Learning
,
Next-Generation Technologies & Secure Development
‘More Research’ Needed, Says Google

Google DeepMind expanded its risk framework to cover scenarios where artificial intelligence models might manipulate people or resist shutdown, marking the company’s most explicit warning yet about potential misalignment.
See Also: OnDemand | Navigate the threat of AI-powered cyberattacks
An update to Google’s Frontier Safety Framework adds a new misuse class for generative AI it calls “harmful manipulation,” reflecting concerns that models could elude human control.
“Models with high manipulative capabilities” could be “misused in ways that could reasonably result in large-scale harm,” the framework now states.
The framework is organized around capability thresholds at which models could cause severe harm unless appropriate mitigations are in place. Version 3.0 adds new levels and fleshes out mitigation approaches and acknowledges gaps where researchers do not yet have fixes. Among those gaps is the prospect that a model might resist operator attempts to modify it or shut it down, a scenario the company treats as an escalated misalignment risk.
DeepMind researchers frame many of these scenarios as a malfunction rather than the model acquiring human-style intentions.
A related concern is the emergence of what DeepMind labels misalignment risk, a risk category intended for situations where models develop a baseline instrumental reasoning ability capable of undermining human control.
As a mitigation, the company recommends examining “scratchpad” outputs, which are explicit, inspectable chains of thought. Looking at interim outputs isn’t a permanent solution, Google acknowledged. Future models might simulate scratchpad outputs that don’t actually reflect a verifiable chain of thought. “Additional mitigations may be warranted – the development of which is an area of active research,” Google said.
DeepMind also details second-order risks, such as an advanced model being used to accelerate machine learning research and in doing so enabling creation of still-more-capable systems. This risk could have a “significant effect on society’s ability to adapt to and govern powerful AI models.”
The update also calls out the possibility that a model could be tuned to “systematically and substantially change people’s beliefs.” Google classified that as a “low-velocity” threat, one they expect social defenses to largely handle. It nevertheless added “harmful manipulation” as a formal misuse risk to ensure it is measured and mitigated.
“Going forward, we’ll continue to invest in this domain to better understand and measure the risks associated with harmful manipulation,” DeepMind executives Four Flynn, Helen King and Anca Dragan wrote.