False Sense Of Security: Why Probing-based Malicious Input Detection Fails To Generalize - Takara TLDR

Large Language Models (LLMs) can comply with harmful instructions, raising
serious safety concerns despite their impressive capabilities. Recent work has
leveraged probing-based approaches to study the separability of malicious and
benign inputs in LLMs’ internal representations, and researchers have proposed
using such probing methods for safety detection. We systematically re-examine
this paradigm. Motivated by poor out-of-distribution performance, we
hypothesize that probes learn superficial patterns rather than semantic
harmfulness. Through controlled experiments, we confirm this hypothesis and
identify the specific patterns learned: instructional patterns and trigger
words. Our investigation follows a systematic approach, progressing from
demonstrating comparable performance of simple n-gram methods, to controlled
experiments with semantically cleaned datasets, to detailed analysis of pattern
dependencies. These results reveal a false sense of security around current
probing-based approaches and highlight the need to redesign both models and
evaluation protocols, for which we provide further discussions in the hope of
suggesting responsible further research in this direction. We have open-sourced
the project at https://github.com/WangCheng0116/Why-Probe-Fails.

Source link

What's Hot

Drawing2CAD: Sequence-to-Sequence Learning for CAD Generation from Vector Drawings – Takara TLDR

China advances in AI agentic tools as Tencent, ByteDance weigh in

OpenAI Announces Hiring Platform, Will Use AI to Match Companies With Talent

False Sense of Security: Why Probing-based Malicious Input Detection Fails to Generalize – Takara TLDR

Drawing2CAD: Sequence-to-Sequence Learning for CAD Generation from Vector Drawings – Takara TLDR

Durian: Dual Reference-guided Portrait Animation with Attribute Transfer – Takara TLDR

Robix: A Unified Model for Robot Interaction, Reasoning and Planning – Takara TLDR

Fan Conventions Are Drawing The Line On AI ‘Slop’

Sculptor Who Defined Minimalism Dies at 88

Amy Sherald’s Canceled Smithsonian Show Goes to Baltimore

Rabkin Foundation Names 2025 Arts Journalism Grant Winners

Drawing2CAD: Sequence-to-Sequence Learning for CAD Generation from Vector Drawings – Takara TLDR

China advances in AI agentic tools as Tencent, ByteDance weigh in

OpenAI Announces Hiring Platform, Will Use AI to Match Companies With Talent

What's Hot

False Sense of Security: Why Probing-based Malicious Input Detection Fails to Generalize – Takara TLDR

Related Posts

Subscribe to Updates