Paper page - When Do LLMs Admit Their Mistakes? Understanding the Role of Model Belief in Retraction

LLMs rarely retract incorrect answers they believe to be factually correct, but supervised fine-tuning can improve their retraction performance by refining their internal beliefs.

Can large language models (LLMs) admit their mistakes when they should know
better? In this work, we define the behavior of acknowledging errors in
previously generated answers as “retraction” and aim to understand when and why
LLMs choose to retract. We first construct model-specific datasets to evaluate
whether a model will retract an incorrect answer that contradicts its own
parametric knowledge. While LLMs are capable of retraction, they do so only
infrequently. We demonstrate that retraction is closely tied to previously
identified indicators of models’ internal belief: models fail to retract wrong
answers that they “believe” to be factually correct. Steering experiments
further demonstrate that internal belief causally influences model retraction.
In particular, when the model does not believe its answer, this not only
encourages the model to attempt to verify the answer, but also alters attention
behavior during self-verification. Finally, we demonstrate that simple
supervised fine-tuning significantly improves retraction performance by helping
the model learn more accurate internal beliefs. Code and datasets are available
on https://github.com/ayyyq/llm-retraction.

Source link

What's Hot

Unstructured data becomes AI-ready: Companies reshape enterprise platforms

Mistral’s Le Chat adds deep research agent and voice mode to challenge OpenAI’s enterprise dominance

Anthropic tightens usage limits for Claude Code – without telling users

Paper page – When Do LLMs Admit Their Mistakes? Understanding the Role of Model Belief in Retraction

Paper page – PhysX: Physical-Grounded 3D Asset Generation

Paper page – DrafterBench: Benchmarking Large Language Models for Tasks Automation in Civil Engineering

Paper page – MMHU: A Massive-Scale Multimodal Benchmark for Human Behavior Understanding

Yale Art Gallery Rejects Federal Grants for Africa Migration Show

Chanel Will Return to New York City with Métiers d’Art Collection

Rashid Johnson Painting Spotted in Trump Official’s Home

Christie’s Reports $2.1 B. Sales Total for H1 2024

Unstructured data becomes AI-ready: Companies reshape enterprise platforms

Mistral’s Le Chat adds deep research agent and voice mode to challenge OpenAI’s enterprise dominance

Anthropic tightens usage limits for Claude Code – without telling users

What's Hot

Paper page – When Do LLMs Admit Their Mistakes? Understanding the Role of Model Belief in Retraction

Related Posts

Subscribe to Updates