Close Menu
  • Home
  • AI Models
    • DeepSeek
    • xAI
    • OpenAI
    • Meta AI Llama
    • Google DeepMind
    • Amazon AWS AI
    • Microsoft AI
    • Anthropic (Claude)
    • NVIDIA AI
    • IBM WatsonX Granite 3.1
    • Adobe Sensi
    • Hugging Face
    • Alibaba Cloud (Qwen)
    • Baidu (ERNIE)
    • C3 AI
    • DataRobot
    • Mistral AI
    • Moonshot AI (Kimi)
    • Google Gemma
    • xAI
    • Stability AI
    • H20.ai
  • AI Research
    • Allen Institue for AI
    • arXiv AI
    • Berkeley AI Research
    • CMU AI
    • Google Research
    • Microsoft Research
    • Meta AI Research
    • OpenAI Research
    • Stanford HAI
    • MIT CSAIL
    • Harvard AI
  • AI Funding & Startups
    • AI Funding Database
    • CBInsights AI
    • Crunchbase AI
    • Data Robot Blog
    • TechCrunch AI
    • VentureBeat AI
    • The Information AI
    • Sifted AI
    • WIRED AI
    • Fortune AI
    • PitchBook
    • TechRepublic
    • SiliconANGLE – Big Data
    • MIT News
    • Data Robot Blog
  • Expert Insights & Videos
    • Google DeepMind
    • Lex Fridman
    • Matt Wolfe AI
    • Yannic Kilcher
    • Two Minute Papers
    • AI Explained
    • TheAIEdge
    • Matt Wolfe AI
    • The TechLead
    • Andrew Ng
    • OpenAI
  • Expert Blogs
    • François Chollet
    • Gary Marcus
    • IBM
    • Jack Clark
    • Jeremy Howard
    • Melanie Mitchell
    • Andrew Ng
    • Andrej Karpathy
    • Sebastian Ruder
    • Rachel Thomas
    • IBM
  • AI Policy & Ethics
    • ACLU AI
    • AI Now Institute
    • Center for AI Safety
    • EFF AI
    • European Commission AI
    • Partnership on AI
    • Stanford HAI Policy
    • Mozilla Foundation AI
    • Future of Life Institute
    • Center for AI Safety
    • World Economic Forum AI
  • AI Tools & Product Releases
    • AI Assistants
    • AI for Recruitment
    • AI Search
    • Coding Assistants
    • Customer Service AI
    • Image Generation
    • Video Generation
    • Writing Tools
    • AI for Recruitment
    • Voice/Audio Generation
  • Industry Applications
    • Finance AI
    • Healthcare AI
    • Legal AI
    • Manufacturing AI
    • Media & Entertainment
    • Transportation AI
    • Education AI
    • Retail AI
    • Agriculture AI
    • Energy AI
  • AI Art & Entertainment
    • AI Art News Blog
    • Artvy Blog » AI Art Blog
    • Weird Wonderful AI Art Blog
    • The Chainsaw » AI Art
    • Artvy Blog » AI Art Blog
What's Hot

United States, China, and United Kingdom Lead the Global AI Ranking According to Stanford HAI’s Global AI Vibrancy Tool

Foundation AI: Cisco launches AI model for integration in security applications

A New Trick Could Block the Misuse of Open Source AI

Facebook X (Twitter) Instagram
Advanced AI News
  • Home
  • AI Models
    • Adobe Sensi
    • Aleph Alpha
    • Alibaba Cloud (Qwen)
    • Amazon AWS AI
    • Anthropic (Claude)
    • Apple Core ML
    • Baidu (ERNIE)
    • ByteDance Doubao
    • C3 AI
    • Cohere
    • DataRobot
    • DeepSeek
  • AI Research & Breakthroughs
    • Allen Institue for AI
    • arXiv AI
    • Berkeley AI Research
    • CMU AI
    • Google Research
    • Meta AI Research
    • Microsoft Research
    • OpenAI Research
    • Stanford HAI
    • MIT CSAIL
    • Harvard AI
  • AI Funding & Startups
    • AI Funding Database
    • CBInsights AI
    • Crunchbase AI
    • Data Robot Blog
    • TechCrunch AI
    • VentureBeat AI
    • The Information AI
    • Sifted AI
    • WIRED AI
    • Fortune AI
    • PitchBook
    • TechRepublic
    • SiliconANGLE – Big Data
    • MIT News
    • Data Robot Blog
  • Expert Insights & Videos
    • Google DeepMind
    • Lex Fridman
    • Meta AI Llama
    • Yannic Kilcher
    • Two Minute Papers
    • AI Explained
    • TheAIEdge
    • Matt Wolfe AI
    • The TechLead
    • Andrew Ng
    • OpenAI
  • Expert Blogs
    • François Chollet
    • Gary Marcus
    • IBM
    • Jack Clark
    • Jeremy Howard
    • Melanie Mitchell
    • Andrew Ng
    • Andrej Karpathy
    • Sebastian Ruder
    • Rachel Thomas
    • IBM
  • AI Policy & Ethics
    • ACLU AI
    • AI Now Institute
    • Center for AI Safety
    • EFF AI
    • European Commission AI
    • Partnership on AI
    • Stanford HAI Policy
    • Mozilla Foundation AI
    • Future of Life Institute
    • Center for AI Safety
    • World Economic Forum AI
  • AI Tools & Product Releases
    • AI Assistants
    • AI for Recruitment
    • AI Search
    • Coding Assistants
    • Customer Service AI
    • Image Generation
    • Video Generation
    • Writing Tools
    • AI for Recruitment
    • Voice/Audio Generation
  • Industry Applications
    • Education AI
    • Energy AI
    • Finance AI
    • Healthcare AI
    • Legal AI
    • Media & Entertainment
    • Transportation AI
    • Manufacturing AI
    • Retail AI
    • Agriculture AI
  • AI Art & Entertainment
    • AI Art News Blog
    • Artvy Blog » AI Art Blog
    • Weird Wonderful AI Art Blog
    • The Chainsaw » AI Art
    • Artvy Blog » AI Art Blog
Advanced AI News
Home » Anthropic scientists expose how AI actually ‘thinks’ — and discover it secretly plans ahead and sometimes lies
VentureBeat AI

Anthropic scientists expose how AI actually ‘thinks’ — and discover it secretly plans ahead and sometimes lies

Advanced AI BotBy Advanced AI BotMarch 31, 2025No Comments8 Mins Read
Share Facebook Twitter Pinterest Copy Link Telegram LinkedIn Tumblr Email
Share
Facebook Twitter LinkedIn Pinterest Email


Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More

Anthropic has developed a new method for peering inside large language models (LLMs) like Claude, revealing for the first time how these AI systems process information and make decisions.

The research, published today in two papers (available here and here), shows these models are more sophisticated than previously understood — they plan ahead when writing poetry, use the same internal blueprint to interpret ideas regardless of language, and sometimes even work backward from a desired outcome instead of simply building up from the facts.

The work draws inspiration from neuroscience techniques used to study biological brains and represents a significant advance in AI interpretability. This approach could allow researchers to audit these systems for safety issues that might remain hidden during conventional external testing.

“We’ve created these AI systems with remarkable capabilities, but because of how they’re trained, we haven’t understood how those capabilities actually emerged,” said Joshua Batson, a researcher at Anthropic, in an exclusive interview with VentureBeat. “Inside the model, it’s just a bunch of numbers —matrix weights in the artificial neural network.”

New techniques illuminate AI’s previously hidden decision-making process

Large language models like OpenAI’s GPT-4o, Anthropic’s Claude, and Google’s Gemini have demonstrated remarkable capabilities, from writing code to synthesizing research papers. But these systems have primarily functioned as “black boxes” — even their creators often don’t understand exactly how they arrive at particular responses.

Anthropic’s new interpretability techniques, which the company dubs “circuit tracing” and “attribution graphs,” allow researchers to map out the specific pathways of neuron-like features that activate when models perform tasks. The approach borrows concepts from neuroscience, viewing AI models as analogous to biological systems.

“This work is turning what were almost philosophical questions — ‘Are models thinking? Are models planning? Are models just regurgitating information?’ — into concrete scientific inquiries about what’s literally happening inside these systems,” Batson explained.

Claude’s hidden planning: How AI plots poetry lines and solves geography questions

Among the most striking discoveries was evidence that Claude plans ahead when writing poetry. When asked to compose a rhyming couplet, the model identified potential rhyming words for the end of the following line before it began writing — a level of sophistication that surprised even Anthropic’s researchers.

“This is probably happening all over the place,” Batson said. “If you had asked me before this research, I would have guessed the model is thinking ahead in various contexts. But this example provides the most compelling evidence we’ve seen of that capability.”

For instance, when writing a poem ending with “rabbit,” the model activates features representing this word at the beginning of the line, then structures the sentence to arrive at that conclusion naturally.

The researchers also found that Claude performs genuine multi-step reasoning. In a test asking “The capital of the state containing Dallas is…” the model first activates features representing “Texas,” and then uses that representation to determine “Austin” as the correct answer. This suggests the model is actually performing a chain of reasoning rather than merely regurgitating memorized associations.

By manipulating these internal representations — for example, replacing “Texas” with “California” — the researchers could cause the model to output “Sacramento” instead, confirming the causal relationship.

Beyond translation: Claude’s universal language concept network revealed

Another key discovery involves how Claude handles multiple languages. Rather than maintaining separate systems for English, French, and Chinese, the model appears to translate concepts into a shared abstract representation before generating responses.

“We find the model uses a mixture of language-specific and abstract, language-independent circuits,” the researchers write in their paper. When asked for the opposite of “small” in different languages, the model uses the same internal features representing “opposites” and “smallness,” regardless of the input language.

This finding has implications for how models might transfer knowledge learned in one language to others, and suggests that models with larger parameter counts develop more language-agnostic representations.

When AI makes up answers: Detecting Claude’s mathematical fabrications

Perhaps most concerning, the research revealed instances where Claude’s reasoning doesn’t match what it claims. When presented with complex math problems like computing cosine values of large numbers, the model sometimes claims to follow a calculation process that isn’t reflected in its internal activity.

“We are able to distinguish between cases where the model genuinely performs the steps they say they are performing, cases where it makes up its reasoning without regard for truth, and cases where it works backwards from a human-provided clue,” the researchers explain.

In one example, when a user suggests an answer to a difficult problem, the model works backward to construct a chain of reasoning that leads to that answer, rather than working forward from first principles.

“We mechanistically distinguish an example of Claude 3.5 Haiku using a faithful chain of thought from two examples of unfaithful chains of thought,” the paper states. “In one, the model is exhibiting ‘bullshitting‘… In the other, it exhibits motivated reasoning.”

Inside AI Hallucinations: How Claude decides when to answer or refuse questions

The research also explains why language models hallucinate — making up information when they don’t know an answer. Anthropic found evidence of a “default” circuit that causes Claude to decline to answer questions, which is inhibited when the model recognizes entities it knows about.

“The model contains ‘default’ circuits that cause it to decline to answer questions,” the researchers explain. “When a model is asked a question about something it knows, it activates a pool of features which inhibit this default circuit, thereby allowing the model to respond to the question.”

When this mechanism misfires — recognizing an entity but lacking specific knowledge about it — hallucinations can occur. This explains why models might confidently provide incorrect information about well-known figures while refusing to answer questions about obscure ones.

Safety implications: Using circuit tracing to improve AI reliability and trustworthiness

This research represents a significant step toward making AI systems more transparent and potentially safer. Researchers could potentially identify and address problematic reasoning patterns by understanding how models arrive at their answers.

Anthropic has long emphasized the safety potential of interpretability work. In their May 2024 Sonnet paper, the research team articulated a similar vision: “We hope that we and others can use these discoveries to make models safer,” the researchers wrote at that time. “For example, it might be possible to use the techniques described here to monitor AI systems for certain dangerous behaviors–such as deceiving the user–to steer them towards desirable outcomes, or to remove certain dangerous subject matter entirely.”

Today’s announcement builds on that foundation, though Batson cautions that the current techniques still have significant limitations. They only capture a fraction of the total computation performed by these models, and analyzing the results remains labor-intensive.

“Even on short, simple prompts, our method only captures a fraction of the total computation performed by Claude,” the researchers acknowledge in their latest work.

The future of AI transparency: Challenges and opportunities in model interpretation

Anthropic’s new techniques come at a time of increasing concern about AI transparency and safety. As these models become more powerful and more widely deployed, understanding their internal mechanisms becomes increasingly essential.

The research also has potential commercial implications. As enterprises increasingly rely on large language models to power applications, understanding when and why these systems might provide incorrect information becomes crucial for managing risk.

“Anthropic wants to make models safe in a broad sense, including everything from mitigating bias to ensuring an AI is acting honestly to preventing misuse — including in scenarios of catastrophic risk,” the researchers write.

While this research represents a significant advance, Batson emphasized that it’s only the beginning of a much longer journey. “The work has really just begun,” he said. “Understanding the representations the model uses doesn’t tell us how it uses them.”

For now, Anthropic’s circuit tracing offers a first tentative map of previously uncharted territory — much like early anatomists sketching the first crude diagrams of the human brain. The full atlas of AI cognition remains to be drawn, but we can now at least see the outlines of how these systems think.

Daily insights on business use cases with VB Daily

If you want to impress your boss, VB Daily has you covered. We give you the inside scoop on what companies are doing with generative AI, from regulatory shifts to practical deployments, so you can share insights for maximum ROI.

Read our Privacy Policy

Thanks for subscribing. Check out more VB newsletters here.

An error occured.



Source link

Follow on Google News Follow on Flipboard
Share. Facebook Twitter Pinterest LinkedIn Tumblr Email Copy Link
Previous ArticleEverything you need to know about the AI chatbot
Next Article Nexthop AI Locks up $110M Led by Lightspeed
Advanced AI Bot
  • Website

Related Posts

Sam Altman calls for ‘AI privilege’ as OpenAI clarifies court order to retain temporary and deleted ChatGPT sessions

June 6, 2025

Voice AI that actually converts: New TTS model boosts sales 15% for major brands

June 6, 2025

Google claims Gemini 2.5 Pro preview beats DeepSeek R1 and Grok 3 Beta in coding performance

June 5, 2025
Leave A Reply Cancel Reply

Latest Posts

Jiaxing Train Station By Architect Ma Yansong Is A Model Of People-Centric, Green Urban Design

Hugh Jackman And Sonia Friedman Boldly Bid To Democratize Theater

Men’s Swimwear Gets Casual At Miami Swim Week 2025

Original Prototype for Jane Birkin’s Hermes Bag Consigned to Sotheby’s

Latest Posts

United States, China, and United Kingdom Lead the Global AI Ranking According to Stanford HAI’s Global AI Vibrancy Tool

June 7, 2025

Foundation AI: Cisco launches AI model for integration in security applications

June 7, 2025

A New Trick Could Block the Misuse of Open Source AI

June 7, 2025

Subscribe to News

Subscribe to our newsletter and never miss our latest news

Subscribe my Newsletter for New Posts & tips Let's stay updated!

Welcome to Advanced AI News—your ultimate destination for the latest advancements, insights, and breakthroughs in artificial intelligence.

At Advanced AI News, we are passionate about keeping you informed on the cutting edge of AI technology, from groundbreaking research to emerging startups, expert insights, and real-world applications. Our mission is to deliver high-quality, up-to-date, and insightful content that empowers AI enthusiasts, professionals, and businesses to stay ahead in this fast-evolving field.

Subscribe to Updates

Subscribe to our newsletter and never miss our latest news

Subscribe my Newsletter for New Posts & tips Let's stay updated!

YouTube LinkedIn
  • Home
  • About Us
  • Advertise With Us
  • Contact Us
  • DMCA
  • Privacy Policy
  • Terms & Conditions
© 2025 advancedainews. Designed by advancedainews.

Type above and press Enter to search. Press Esc to cancel.