LLMs’ “simulated Reasoning” Abilities Are A “brittle Mirage,” Researchers Find

The researchers used test cases that fall outside of the LLM training data in task type, format, and length.

Credit:

Zhao et al

These simplified models were then tested using a variety of tasks, some of which precisely or closely matched the function patterns in the training data and others that required function compositions that were either partially or fully “out of domain” for the training data. For instance, a model trained on data showing two cyclical shifts might be asked to perform a novel transformation involving two ROT shifts (with basic training on what a single example of either shift looks like). The final answers and reasoning steps were compared to the desired answer using BLEU scores and Levenshtein Distance for an objective measure of their accuracy.

As the researchers hypothesized, these basic models started to fail catastrophically when asked to generalize novel sets of transformations that were not directly demonstrated in the training data. While the models would often try to generalize new logical rules based on similar patterns in the training data, this would quite often lead to the model laying out “correct reasoning paths, yet incorrect answer[s].” In other cases, the LLM would sometimes stumble onto correct answers paired with “unfaithful reasoning paths” that didn’t follow logically.

“Rather than demonstrating a true understanding of text, CoT reasoning under task transformations appears to reflect a replication of patterns learned during training,” the researchers write.

As requested tasks get further outside the training distribution (redder dots), the answers provided drift farther from the desired answer (lower right of the graph).

Credit:

Zhao et al

The researchers went on to test their controlled system using input text strings slightly shorter or longer than those found in the training data, or that required function chains of different lengths than those it was trained on. In both cases the accuracy of the results “deteriorates as the [length] discrepancy increases,” thus “indicating the failure of generalization” in the models. Small, unfamiliar-to-the-model discrepancies in the format of the test tasks (e.g., the introduction of letters or symbols not found in the training data) also caused performance to “degrade sharply” and “affect[ed] the correctness” of the model’s responses, the researchers found.

Source link

What's Hot

MindJourney enables AI to explore simulated 3D worlds to improve spatial interpretation

IBM’s and NASA’s Surya AI model is designed to predict the next ‘Carrington-class’ solar storm

US tech stocks slide after Altman warns of ‘bubble’ in AI and MIT study doubts the hype

LLMs’ “simulated reasoning” abilities are a “brittle mirage,” researchers find

IBM and NASA made an open-source AI model for predicting solar weather

IBM Ventures Invests $500M in Quantum and AI Startups to Fuel Market Growth

AI Dev 25: Andrew Ng Hosts AI Developer Conference in New York City for Hands-On Learning and Networking | AI News Detail

Getty Grants $2.6 M. to Black Visual Arts Archives Across the U.S.

Barbara Hepworth Sculpture Will Remain in UK After £3.8 M. Raised

After 12-Year Hiatus, Egypt’s Alexandria Biennale Will Return

Ai Weiwei Visits Ukraine’s Front Line Ahead of Kyiv Installation

MindJourney enables AI to explore simulated 3D worlds to improve spatial interpretation

IBM’s and NASA’s Surya AI model is designed to predict the next ‘Carrington-class’ solar storm

US tech stocks slide after Altman warns of ‘bubble’ in AI and MIT study doubts the hype

What's Hot

LLMs’ “simulated reasoning” abilities are a “brittle mirage,” researchers find

Related Posts

Subscribe to Updates