Close Menu
  • Home
  • AI Models
    • DeepSeek
    • xAI
    • OpenAI
    • Meta AI Llama
    • Google DeepMind
    • Amazon AWS AI
    • Microsoft AI
    • Anthropic (Claude)
    • NVIDIA AI
    • IBM WatsonX Granite 3.1
    • Adobe Sensi
    • Hugging Face
    • Alibaba Cloud (Qwen)
    • Baidu (ERNIE)
    • C3 AI
    • DataRobot
    • Mistral AI
    • Moonshot AI (Kimi)
    • Google Gemma
    • xAI
    • Stability AI
    • H20.ai
  • AI Research
    • Allen Institue for AI
    • arXiv AI
    • Berkeley AI Research
    • CMU AI
    • Google Research
    • Microsoft Research
    • Meta AI Research
    • OpenAI Research
    • Stanford HAI
    • MIT CSAIL
    • Harvard AI
  • AI Funding & Startups
    • AI Funding Database
    • CBInsights AI
    • Crunchbase AI
    • Data Robot Blog
    • TechCrunch AI
    • VentureBeat AI
    • The Information AI
    • Sifted AI
    • WIRED AI
    • Fortune AI
    • PitchBook
    • TechRepublic
    • SiliconANGLE – Big Data
    • MIT News
    • Data Robot Blog
  • Expert Insights & Videos
    • Google DeepMind
    • Lex Fridman
    • Matt Wolfe AI
    • Yannic Kilcher
    • Two Minute Papers
    • AI Explained
    • TheAIEdge
    • Matt Wolfe AI
    • The TechLead
    • Andrew Ng
    • OpenAI
  • Expert Blogs
    • François Chollet
    • Gary Marcus
    • IBM
    • Jack Clark
    • Jeremy Howard
    • Melanie Mitchell
    • Andrew Ng
    • Andrej Karpathy
    • Sebastian Ruder
    • Rachel Thomas
    • IBM
  • AI Policy & Ethics
    • ACLU AI
    • AI Now Institute
    • Center for AI Safety
    • EFF AI
    • European Commission AI
    • Partnership on AI
    • Stanford HAI Policy
    • Mozilla Foundation AI
    • Future of Life Institute
    • Center for AI Safety
    • World Economic Forum AI
  • AI Tools & Product Releases
    • AI Assistants
    • AI for Recruitment
    • AI Search
    • Coding Assistants
    • Customer Service AI
    • Image Generation
    • Video Generation
    • Writing Tools
    • AI for Recruitment
    • Voice/Audio Generation
  • Industry Applications
    • Finance AI
    • Healthcare AI
    • Legal AI
    • Manufacturing AI
    • Media & Entertainment
    • Transportation AI
    • Education AI
    • Retail AI
    • Agriculture AI
    • Energy AI
  • AI Art & Entertainment
    • AI Art News Blog
    • Artvy Blog » AI Art Blog
    • Weird Wonderful AI Art Blog
    • The Chainsaw » AI Art
    • Artvy Blog » AI Art Blog
What's Hot

SoundHound AI, Cloudflare, C3.ai, Domo, and The Trade Desk Shares Plummet, What You Need To Know

Enhance AI agents using predictive ML models with Amazon SageMaker AI and Model Context Protocol (MCP)

Baidu, Inc. (BIDU) Q2 2025 Earnings Call Transcript

Facebook X (Twitter) Instagram
Advanced AI News
  • Home
  • AI Models
    • OpenAI (GPT-4 / GPT-4o)
    • Anthropic (Claude 3)
    • Google DeepMind (Gemini)
    • Meta (LLaMA)
    • Cohere (Command R)
    • Amazon (Titan)
    • IBM (Watsonx)
    • Inflection AI (Pi)
  • AI Research
    • Allen Institue for AI
    • arXiv AI
    • Berkeley AI Research
    • CMU AI
    • Google Research
    • Meta AI Research
    • Microsoft Research
    • OpenAI Research
    • Stanford HAI
    • MIT CSAIL
    • Harvard AI
  • AI Funding
    • AI Funding Database
    • CBInsights AI
    • Crunchbase AI
    • Data Robot Blog
    • TechCrunch AI
    • VentureBeat AI
    • The Information AI
    • Sifted AI
    • WIRED AI
    • Fortune AI
    • PitchBook
    • TechRepublic
    • SiliconANGLE – Big Data
    • MIT News
    • Data Robot Blog
  • AI Experts
    • Google DeepMind
    • Lex Fridman
    • Meta AI Llama
    • Yannic Kilcher
    • Two Minute Papers
    • AI Explained
    • TheAIEdge
    • The TechLead
    • Matt Wolfe AI
    • Andrew Ng
    • OpenAI
    • Expert Blogs
      • François Chollet
      • Gary Marcus
      • IBM
      • Jack Clark
      • Jeremy Howard
      • Melanie Mitchell
      • Andrew Ng
      • Andrej Karpathy
      • Sebastian Ruder
      • Rachel Thomas
      • IBM
  • AI Tools
    • AI Assistants
    • AI for Recruitment
    • AI Search
    • Coding Assistants
    • Customer Service AI
  • AI Policy
    • ACLU AI
    • AI Now Institute
    • Center for AI Safety
  • Business AI
    • Advanced AI News Features
    • Finance AI
    • Healthcare AI
    • Education AI
    • Energy AI
    • Legal AI
LinkedIn Instagram YouTube Threads X (Twitter)
Advanced AI News
VentureBeat AI

LLMs generate ‘fluent nonsense’ when reasoning outside their training zone

By Advanced AI EditorAugust 20, 2025No Comments7 Mins Read
Share Facebook Twitter Pinterest Copy Link Telegram LinkedIn Tumblr Email
Share
Facebook Twitter LinkedIn Pinterest Email


Want smarter insights in your inbox? Sign up for our weekly newsletters to get only what matters to enterprise AI, data, and security leaders. Subscribe Now

A new study from Arizona State University researchers suggests that the celebrated “Chain-of-Thought” (CoT) reasoning in Large Language Models (LLMs) may be more of a “brittle mirage” than genuine intelligence. The research builds on a growing body of work questioning the depth of LLM reasoning, but it takes a unique “data distribution” lens to test where and why CoT breaks down systematically.

Crucially for application builders, the paper goes beyond critique to offer clear, practical guidance on how to account for these limitations when developing LLM-powered applications, from testing strategies to the role of fine-tuning.

The promise and problem of Chain-of-Thought

CoT prompting, which asks an LLM to “think step by step,” has shown impressive results on complex tasks, leading to the perception that models are engaging in human-like inferential processes. However, a closer inspection often reveals logical inconsistencies that challenge this view. 

Various studies show that LLMs frequently rely on surface-level semantics and clues rather than logical procedures. The models generate plausible-sounding logic by repeating token patterns they have seen during training. Still, this approach often fails on tasks that deviate from familiar templates or when irrelevant information is introduced. 

AI Scaling Hits Its Limits

Power caps, rising token costs, and inference delays are reshaping enterprise AI. Join our exclusive salon to discover how top teams are:

Turning energy into a strategic advantage

Architecting efficient inference for real throughput gains

Unlocking competitive ROI with sustainable AI systems

Secure your spot to stay ahead: https://bit.ly/4mwGngO

Despite these observations, the researchers of the new study argue that “a systematic understanding of why and when CoT reasoning fails is still a mystery,” which their study aims to address. Previous work has already shown that LLMs struggle to generalize their reasoning abilities. As the paper notes, “theoretical and empirical evidence shows that CoT generalizes well only when test inputs share latent structures with training data; otherwise, performance declines sharply.”

A new lens on LLM reasoning

The ASU researchers propose a new lens to view this problem: CoT isn’t an act of reasoning but a sophisticated form of pattern matching, fundamentally bound by the statistical patterns in its training data. They posit that “CoT’s success stems not from a model’s inherent reasoning capacity, but from its ability to generalize conditionally to out-of-distribution (OOD) test cases that are structurally similar to in-distribution exemplars.” In other words, an LLM is good at applying old patterns to new data that looks similar, but not at solving truly novel problems.

The data distribution lens Source: GitHub

To test this hypothesis, they dissected CoT’s capabilities across three dimensions of “distributional shift” (changes between the training data and the test data). First, they tested “task generalization” to see if a model could apply a learned reasoning process to a new type of task. Second, they examined “length generalization” to determine if it could handle reasoning chains that are significantly longer or shorter than those it was trained on. Finally, they assessed “format generalization” to measure how sensitive the model is to minor changes in the prompt’s wording or structure. 

For their analysis, they developed a framework called DataAlchemy to train smaller LLMs from scratch in a controlled environment, allowing them to precisely measure how performance degrades when pushed beyond the training data.

“The data distribution lens and controlled environment are both central to what we were trying to convey,” Chengshuai Zhao, doctoral student at ASU and co-author of the paper, told VentureBeat. “We hope to create a space where the public, researchers, and developers can freely explore and probe the nature of LLMs and advance the boundaries of human knowledge.”

The mirage confirmed

Based on their findings, the researchers conclude that CoT reasoning is a “sophisticated form of structured pattern matching, fundamentally bounded by the data distribution seen during training.” When tested even slightly outside this distribution, performance collapses. What looks like structured reasoning is more of a mirage, “emerging from memorized or interpolated patterns in the training data rather than logical inference.”

The breakdown was consistent across all three dimensions. On new tasks, models failed to generalize and instead replicated the closest patterns they had seen during training. When faced with reasoning chains of different lengths, they struggled, often trying to artificially add or remove steps to match the length of their training examples. Finally, their performance proved highly sensitive to superficial changes in the prompt, especially variations in core elements and instructions.

Interestingly, the researchers found that these failures could be quickly fixed. By fine-tuning the models on a very small sample of the new, unseen data through supervised fine-tuning (SFT), performance on that specific type of problem increased rapidly. However, this quick fix further supports the pattern-matching theory, suggesting the model isn’t learning to reason more abstractly but is instead just memorizing a new pattern to overcome a specific weakness.

Takeaways for the enterprise

The researchers offer a direct warning to practitioners, highlighting “the risk of relying on CoT as a plug-and-play solution for reasoning tasks and caution against equating CoT-style output with human thinking.” They provide three key pieces of advice for developers building applications with LLMs.

1)Guard against over-reliance and false confidence. CoT should not be treated as a reliable module for reasoning in high-stakes fields like finance or legal analysis. LLMs can produce “fluent nonsense” (plausible but logically flawed reasoning) that is more deceptive than an outright incorrect answer. The authors stress that “sufficient auditing from domain experts is indispensable.”

“The advance of science should remain human-centered—machines can assist, but discovery still thrives on humanity and curiosity,” Zhao said.

2) Prioritize out-of-distribution (OOD) testing. Standard validation, where test data mirrors training data, is not enough to measure true robustness. Developers must implement rigorous testing that systematically probes for failures across task, length, and format variations.

3)Recognize fine-tuning as a patch, not a panacea. While supervised fine-tuning (SFT) can quickly “patch” a model’s performance on a specific new data distribution, it does not create true generalization. It simply expands the model’s “in-distribution bubble” slightly. Relying on SFT to fix every OOD failure is an unsustainable strategy that fails to address the model’s core lack of abstract reasoning.

While CoT isn’t a form of human cognition, this limitation can be managed. Most enterprise applications involve a relatively narrow and predictable set of tasks. The paper’s findings provide a blueprint for ensuring reliability within these domains. Developers can build rigorous evaluation suites that systematically test model performance against the specific task, length, and format variations their application will encounter. This allows them to map out the boundaries of a model’s “in-distribution” comfort zone and identify where it aligns with their specific needs.

This targeted testing transforms fine-tuning from a reactive “patch” into a proactive strategy for alignment. When evaluations reveal a specific weakness, developers can create small, targeted SFT datasets to address it. Instead of trying to achieve broad, general reasoning, this approach uses SFT surgically to ensure the model’s pattern-matching capabilities are precisely aligned with the contours of a specific enterprise task. Ultimately, the study offers a practical lens for moving beyond hope and engineering LLM applications to achieve predictable success.

Daily insights on business use cases with VB Daily

If you want to impress your boss, VB Daily has you covered. We give you the inside scoop on what companies are doing with generative AI, from regulatory shifts to practical deployments, so you can share insights for maximum ROI.

Read our Privacy Policy

Thanks for subscribing. Check out more VB newsletters here.

An error occured.



Source link

Follow on Google News Follow on Flipboard
Share. Facebook Twitter Pinterest LinkedIn Tumblr Email Copy Link
Previous ArticleGrammarly’s new AI agents can help students write without replacing you
Next Article Vodafone Idea, IBM Launch AI Innovation Hub for Telecom Transformation
Advanced AI Editor
  • Website

Related Posts

ByteDance releases new open source Seed-OSS-36B model

August 21, 2025

CodeSignal’s new AI tutoring app Cosmo wants to be the ‘Duolingo for job skills’

August 20, 2025

Stop benchmarking in the lab: Inclusion Arena shows how LLMs perform in production

August 20, 2025

Comments are closed.

Latest Posts

Dallas Museum of Art Names Brian Ferriso as Its Next Director

Rapa Nui’s Moai Statues Threatened by Rising Sea Levels, Flooding

Mickalene Thomas Accused of Harassment by Racquel Chevremont

AI Impact on Art Galleries, and More Art News

Latest Posts

SoundHound AI, Cloudflare, C3.ai, Domo, and The Trade Desk Shares Plummet, What You Need To Know

August 21, 2025

Enhance AI agents using predictive ML models with Amazon SageMaker AI and Model Context Protocol (MCP)

August 21, 2025

Baidu, Inc. (BIDU) Q2 2025 Earnings Call Transcript

August 21, 2025

Subscribe to News

Subscribe to our newsletter and never miss our latest news

Subscribe my Newsletter for New Posts & tips Let's stay updated!

Recent Posts

  • SoundHound AI, Cloudflare, C3.ai, Domo, and The Trade Desk Shares Plummet, What You Need To Know
  • Enhance AI agents using predictive ML models with Amazon SageMaker AI and Model Context Protocol (MCP)
  • Baidu, Inc. (BIDU) Q2 2025 Earnings Call Transcript
  • OpenAI says GPT-6 is coming and it’ll be better than GPT-5 (obviously)
  • ByteDance releases new open source Seed-OSS-36B model

Recent Comments

  1. Charlescak on 1-800-CHAT-GPT—12 Days of OpenAI: Day 10
  2. Richardsmeap on 1-800-CHAT-GPT—12 Days of OpenAI: Day 10
  3. ArturoJep on 1-800-CHAT-GPT—12 Days of OpenAI: Day 10
  4. ArturoJep on 1-800-CHAT-GPT—12 Days of OpenAI: Day 10
  5. Charlescak on 1-800-CHAT-GPT—12 Days of OpenAI: Day 10

Welcome to Advanced AI News—your ultimate destination for the latest advancements, insights, and breakthroughs in artificial intelligence.

At Advanced AI News, we are passionate about keeping you informed on the cutting edge of AI technology, from groundbreaking research to emerging startups, expert insights, and real-world applications. Our mission is to deliver high-quality, up-to-date, and insightful content that empowers AI enthusiasts, professionals, and businesses to stay ahead in this fast-evolving field.

Subscribe to Updates

Subscribe to our newsletter and never miss our latest news

Subscribe my Newsletter for New Posts & tips Let's stay updated!

LinkedIn Instagram YouTube Threads X (Twitter)
  • Home
  • About Us
  • Advertise With Us
  • Contact Us
  • DMCA
  • Privacy Policy
  • Terms & Conditions
© 2025 advancedainews. Designed by advancedainews.

Type above and press Enter to search. Press Esc to cancel.