Probing The Vulnerability Of Large Language Models To Polysemantic Interventions

arXiv:2505.11611v1 Announce Type: new
Abstract: Polysemanticity — where individual neurons encode multiple unrelated features — is a well-known characteristic of large neural networks and remains a central challenge in the interpretability of language models. At the same time, its implications for model safety are also poorly understood. Leveraging recent advances in sparse autoencoders, we investigate the polysemantic structure of two small models (Pythia-70M and GPT-2-Small) and evaluate their vulnerability to targeted, covert interventions at the prompt, feature, token, and neuron levels. Our analysis reveals a consistent polysemantic topology shared across both models. Strikingly, we demonstrate that this structure can be exploited to mount effective interventions on two larger, black-box instruction-tuned models (LLaMA3.1-8B-Instruct and Gemma-2-9B-Instruct). These findings suggest not only the generalizability of the interventions but also point to a stable and transferable polysemantic structure that could potentially persist across architectures and training regimes.

Source link

What's Hot

IVEBench: Modern Benchmark Suite for Instruction-Guided Video Editing Assessment – Takara TLDR

Walmart partners with OpenAI to offer shopping on ChatGPT – East Bay Times

MIT-educated brothers face trial over $25 million crypto heist

Probing the Vulnerability of Large Language Models to Polysemantic Interventions

LTLCrit: A Temporal Logic-based LLM Critic for Safe and Efficient Embodied Agents

From Imitation to Innovation: The Emergence of AI Unique Artistic Styles and the Challenge of Copyright Protection

VerifyLLM: LLM-Based Pre-Execution Task Plan Verification for Robots

Qatar Reveals It’s the Owner of Courbet’s Famous Self-Portrait

Egyptian Archaeologists Discover Large New Kingdom Military Fortress

Joan Weinstein to Head Vice President for Getty-Wide Program Planning

India Plots First Venice Biennale Pavilion in Seven Years

IVEBench: Modern Benchmark Suite for Instruction-Guided Video Editing Assessment – Takara TLDR

Walmart partners with OpenAI to offer shopping on ChatGPT – East Bay Times

MIT-educated brothers face trial over $25 million crypto heist

What's Hot

Probing the Vulnerability of Large Language Models to Polysemantic Interventions

Related Posts

Subscribe to Updates