Paper Page - Machine Bullshit: Characterizing The Emergent Disregard For Truth In Large Language Models

Machine bullshit, characterized by LLMs’ indifference to truth, is quantified and analyzed through a new framework, revealing that RLHF and CoT prompting exacerbate certain bullshit forms.

Bullshit, as conceptualized by philosopher Harry Frankfurt, refers to
statements made without regard to their truth value. While previous work has
explored large language model (LLM) hallucination and sycophancy, we propose
machine bullshit as an overarching conceptual framework that can allow
researchers to characterize the broader phenomenon of emergent loss of
truthfulness in LLMs and shed light on its underlying mechanisms. We introduce
the Bullshit Index, a novel metric quantifying LLMs’ indifference to truth, and
propose a complementary taxonomy analyzing four qualitative forms of bullshit:
empty rhetoric, paltering, weasel words, and unverified claims. We conduct
empirical evaluations on the Marketplace dataset, the Political Neutrality
dataset, and our new BullshitEval benchmark (2,400 scenarios spanning 100 AI
assistants) explicitly designed to evaluate machine bullshit. Our results
demonstrate that model fine-tuning with reinforcement learning from human
feedback (RLHF) significantly exacerbates bullshit and inference-time
chain-of-thought (CoT) prompting notably amplify specific bullshit forms,
particularly empty rhetoric and paltering. We also observe prevalent machine
bullshit in political contexts, with weasel words as the dominant strategy. Our
findings highlight systematic challenges in AI alignment and provide new
insights toward more truthful LLM behavior.

Source link

What's Hot

‘We have some new technology’

New AI Finally Solved The Hardest Animation Problem!

Here’s Why Elon Musk Is Suing OpenAI… Again

Paper page – Machine Bullshit: Characterizing the Emergent Disregard for Truth in Large Language Models

rStar2-Agent: Agentic Reasoning Technical Report – Takara TLDR

Pref-GRPO: Pairwise Preference Reward-based GRPO for Stable Text-to-Image Reinforcement Learning – Takara TLDR

Turning the Spell Around: Lightweight Alignment Amplification via Rank-One Safety Injection – Takara TLDR

Woodmere Art Museum Sues Trump Administration Over Canceled IMLS Grant

Barbara Gladstone’s Chelsea Townhouse in NYC Sells for $13.1 M.

Trump Meets with Smithsonian Leader Amid Threats of Content Review

Australian School Faces Pushback over AI Art Course—and More Art News

‘We have some new technology’

New AI Finally Solved The Hardest Animation Problem!

Here’s Why Elon Musk Is Suing OpenAI… Again

What's Hot

Paper page – Machine Bullshit: Characterizing the Emergent Disregard for Truth in Large Language Models

Related Posts

Subscribe to Updates