Probing The Vulnerability Of Large Language Models To Polysemantic Interventions

arXiv:2505.11611v1 Announce Type: new
Abstract: Polysemanticity — where individual neurons encode multiple unrelated features — is a well-known characteristic of large neural networks and remains a central challenge in the interpretability of language models. At the same time, its implications for model safety are also poorly understood. Leveraging recent advances in sparse autoencoders, we investigate the polysemantic structure of two small models (Pythia-70M and GPT-2-Small) and evaluate their vulnerability to targeted, covert interventions at the prompt, feature, token, and neuron levels. Our analysis reveals a consistent polysemantic topology shared across both models. Strikingly, we demonstrate that this structure can be exploited to mount effective interventions on two larger, black-box instruction-tuned models (LLaMA3.1-8B-Instruct and Gemma-2-9B-Instruct). These findings suggest not only the generalizability of the interventions but also point to a stable and transferable polysemantic structure that could potentially persist across architectures and training regimes.

Source link

What's Hot

Information Concerning The Total Number Of Voting Rights And Shares In The Share Capital As Of 31 August 2025

DeepMMSearch-R1: Empowering Multimodal LLMs in Multimodal Web Search – Takara TLDR

OpenAI has five years to turn $13 billion into $1 trillion

Probing the Vulnerability of Large Language Models to Polysemantic Interventions

LTLCrit: A Temporal Logic-based LLM Critic for Safe and Efficient Embodied Agents

From Imitation to Innovation: The Emergence of AI Unique Artistic Styles and the Challenge of Copyright Protection

VerifyLLM: LLM-Based Pre-Execution Task Plan Verification for Robots

Qatar Reveals It’s the Owner of Courbet’s Famous Self-Portrait

Issy Wood Paints Charli XCX—and Her ‘Britishness’—for Vanity Fair

San Francisco May Destroy Vaillancourt Fountain in Redevelopment Plan

DuSable Black History Museum Responds to Accusations of Retaliation

Information Concerning The Total Number Of Voting Rights And Shares In The Share Capital As Of 31 August 2025

DeepMMSearch-R1: Empowering Multimodal LLMs in Multimodal Web Search – Takara TLDR

OpenAI has five years to turn $13 billion into $1 trillion

What's Hot

Probing the Vulnerability of Large Language Models to Polysemantic Interventions

Related Posts

Subscribe to Updates