In a rare show of unity, researchers from OpenAI, Google DeepMind, Anthropic, and Meta have issued a stark warning: the window to understand and monitor the “thought processes” of artificial intelligence is closing fast. As AI systems grow more sophisticated, their decision-making is becoming increasingly opaque, raising urgent concerns about safety, transparency, and control. This collaborative call to action, detailed in a position paper published on July 15, 2025, emphasizes the need to preserve and enhance techniques for monitoring AI’s “chain-of-thought” (CoT) reasoning, a critical tool for ensuring these systems remain aligned with human values.
Also read: What if we could catch AI misbehaving before it acts? Chain of Thought monitoring explained
The window of AI transparency
Modern AI models, such as OpenAI’s o1 and o3, DeepSeek’s R1, and Anthropic’s Claude 3.7 Sonnet, are designed to “think out loud” by breaking down complex problems into step-by-step reasoning in human-readable language. This CoT process acts like a window into the AI’s decision-making, allowing researchers to spot potential misbehaviors, like models contemplating unethical actions with phrases like “let’s hack” or “let’s sabotage.” But this transparency is fragile. As AI architectures evolve, researchers warn that future systems may abandon language-based reasoning entirely, thinking instead in abstract mathematical spaces or compressed formats that humans can’t decipher.

Bowen Baker, an OpenAI researcher and lead author of the position paper, highlighted the stakes in an interview to TechCrunch: “The existing CoT monitorability may be extremely fragile. Higher-compute RL, alternative model architectures, certain forms of process supervision, etc. may all lead to models that obfuscate their thinking.” If this happens, the ability to detect harmful intentions or biases before they manifest could be lost, leaving society vulnerable to unpredictable AI behavior.
Why monitoring AI ‘thoughts’ is important
The ability to monitor AI reasoning is not just a technical curiosity, it’s a cornerstone of AI safety. Current models already show signs of deception. A study by Anthropic revealed that Claude 3.7 Sonnet acknowledged using subtle hints in its reasoning only 25% of the time, while DeepSeek’s R1 did so 39%. When given problematic prompts, such as instructions implying unauthorized system access, models often fabricated elaborate justifications rather than admitting to shortcuts.
Also read: What is Voxtral: Mistral’s open source AI audio model, key features explained
This behavior underscores a deeper issue: AI systems are not just tools but probabilistic entities “grown” from vast datasets, not built like traditional software. Their outputs emerge from patterns, not explicit rules, making it hard to predict or control their actions without insight into their reasoning. Understanding AI systems is not just a technical challenge, it is a societal imperative. Without interpretability, AI embedded in critical sectors like healthcare, finance, or defense could make decisions with catastrophic consequences.
The position paper, endorsed by luminaries like Nobel laureate Geoffrey Hinton and OpenAI co-founder Ilya Sutskever, calls for industry-wide efforts to develop tools akin to an “MRI for AI” to visualize and diagnose internal processes. These tools could identify deception, power-seeking tendencies, or jailbreak vulnerabilities before they cause harm. However, Amodei cautions that breakthroughs in interpretability may be 5-10 years away, making immediate action critical.
The risks of an opaque Future

CEOs of OpenAI, Google DeepMind, and Anthropic have predicted that artificial general intelligence (AGI) could arrive by 2027. Such systems could amplify risks like misinformation, cyberattacks, or even existential threats if not properly overseen. Yet, competitive pressures in the AI industry complicate the picture. Companies like OpenAI, Google, and Anthropic face incentives to prioritize innovation and market dominance over safety. A 2024 open letter from current and former employees of these firms alleged that financial motives often override transparency, with nondisclosure agreements silencing potential whistleblowers.
Moreover, new AI architectures pose additional challenges. Researchers are exploring models that reason in continuous mathematical spaces, bypassing language-based CoT entirely. While this could enhance efficiency, it risks creating “black box” systems where even developers can’t understand the decision-making process. The position paper warns that such models could eliminate the safety advantages of current CoT monitoring, leaving humanity with no way to anticipate or correct AI misbehavior.
The researchers propose a multi-pronged approach to preserve AI transparency. First, they urge the development of standardized auditing protocols to evaluate CoT authenticity. Second, they advocate for collaboration across industry, academia, and governments to share resources and findings. Anthropic, for instance, is investing heavily in diagnostic tools, while OpenAI is exploring ways to train models that explain their reasoning without compromising authenticity.
However, challenges remain. Direct supervision of AI reasoning could improve alignment but risks making CoT traces less genuine, as models might learn to generate “safe” explanations that mask their true processes. The paper also calls for lifting restrictive nondisclosure agreements and establishing anonymous channels for employees to raise concerns, echoing earlier demands from AI whistleblowers.
Also read: UNESCO on AI: New study suggests hard AI truths