arXiv AI

Head-Specific Intervention Can Induce Misaligned AI Coordination in Large Language Models

By Advanced AI EditorMay 2, 2025No Comments2 Mins Read

[Submitted on 9 Feb 2025 (v1), last revised 1 May 2025 (this version, v2)]

View a PDF of the paper titled HSI: Head-Specific Intervention Can Induce Misaligned AI Coordination in Large Language Models, by Paul Darm and 1 other authors

View PDF
HTML (experimental)

Abstract:Robust alignment guardrails for large language models are becoming increasingly important with their widespread application. In contrast to previous studies, we demonstrate that inference-time activation interventions can bypass safety alignments and effectively steer model generations towards harmful AI coordination for Llama 2. Our method applies fine-grained interventions at specific model subcomponents, particularly attention heads, using a simple binary choice probing strategy. These interventions then generalise to the open-ended generation setting effectively circumventing safety guardrails. We show that probing single attention heads is more effective than intervening on full layers and intervening on only four attention heads is comparable to supervised fine-tuning. We further show that only a few example completions are needed to compute effective steering directions, which is an advantage over classical fine-tuning. Our findings highlight the shortcomings of current alignment techniques. In addition, our results suggest that, at the attention head level, activations encode fine-grained linearly separable behaviors. Practically, the approach offers a straightforward methodology to steer large language model behaviour, which could be extended to diverse domains beyond safety requiring fine-grained control over the model output. The code and datasets for this study can be found on this https URL.

Submission history

From: Paul Darm [view email]
[v1]
Sun, 9 Feb 2025 16:11:57 UTC (433 KB)
[v2]
Thu, 1 May 2025 09:03:35 UTC (1,875 KB)

Previous ArticleStanford HAI’s annual report highlights rapid adoption and growing accessibility of powerful AI systems

Next Article Berlin’s culture senator Joe Chialo announces his resignation

Advanced AI Editor

Leave A Reply