Persuasion Dynamics In LLMs: Investigating Robustness And Adaptability In Knowledge And Safety With DuET-PD - Takara TLDR

Large Language Models (LLMs) can struggle to balance gullibility to
misinformation and resistance to valid corrections in persuasive dialogues, a
critical challenge for reliable deployment. We introduce DuET-PD (Dual
Evaluation for Trust in Persuasive Dialogues), a framework evaluating
multi-turn stance-change dynamics across dual dimensions: persuasion type
(corrective/misleading) and domain (knowledge via MMLU-Pro, and safety via
SALAD-Bench). We find that even a state-of-the-art model like GPT-4o achieves
only 27.32% accuracy in MMLU-Pro under sustained misleading persuasions.
Moreover, results reveal a concerning trend of increasing sycophancy in newer
open-source models. To address this, we introduce Holistic DPO, a training
approach balancing positive and negative persuasion examples. Unlike prompting
or resist-only training, Holistic DPO enhances both robustness to
misinformation and receptiveness to corrections, improving
Llama-3.1-8B-Instruct’s accuracy under misleading persuasion in safety contexts
from 4.21% to 76.54%. These contributions offer a pathway to developing more
reliable and adaptable LLMs for multi-turn dialogue. Code is available at
https://github.com/Social-AI-Studio/DuET-PD.

Source link

What's Hot

Pref-GRPO: Pairwise Preference Reward-based GRPO for Stable Text-to-Image Reinforcement Learning – Takara TLDR

DeepSeek AI Predicts Price of Ethereum, XRP, Solana by End of 2025

Claude Now Joins OpenAI In Getting Sued For Copyright Infringement

Persuasion Dynamics in LLMs: Investigating Robustness and Adaptability in Knowledge and Safety with DuET-PD – Takara TLDR

Pref-GRPO: Pairwise Preference Reward-based GRPO for Stable Text-to-Image Reinforcement Learning – Takara TLDR

Turning the Spell Around: Lightweight Alignment Amplification via Rank-One Safety Injection – Takara TLDR

OneReward: Unified Mask-Guided Image Generation via Multi-Task Human Preference Learning – Takara TLDR

Woodmere Art Museum Sues Trump Administration Over Canceled IMLS Grant

Barbara Gladstone’s Chelsea Townhouse in NYC Sells for $13.1 M.

Trump Meets with Smithsonian Leader Amid Threats of Content Review

Australian School Faces Pushback over AI Art Course—and More Art News

Pref-GRPO: Pairwise Preference Reward-based GRPO for Stable Text-to-Image Reinforcement Learning – Takara TLDR

DeepSeek AI Predicts Price of Ethereum, XRP, Solana by End of 2025

Claude Now Joins OpenAI In Getting Sued For Copyright Infringement

What's Hot

Persuasion Dynamics in LLMs: Investigating Robustness and Adaptability in Knowledge and Safety with DuET-PD – Takara TLDR

Related Posts

Subscribe to Updates