arXiv AI

[2506.11618] Convergent Linear Representations of Emergent Misalignment

By Advanced AI EditorJune 23, 2025No Comments2 Mins Read

[Submitted on 13 Jun 2025 (v1), last revised 20 Jun 2025 (this version, v2)]

View a PDF of the paper titled Convergent Linear Representations of Emergent Misalignment, by Anna Soligo and 3 other authors

View PDF
HTML (experimental)

Abstract:Fine-tuning large language models on narrow datasets can cause them to develop broadly misaligned behaviours: a phenomena known as emergent misalignment. However, the mechanisms underlying this misalignment, and why it generalizes beyond the training domain, are poorly understood, demonstrating critical gaps in our knowledge of model alignment. In this work, we train and study a minimal model organism which uses just 9 rank-1 adapters to emergently misalign Qwen2.5-14B-Instruct. Studying this, we find that different emergently misaligned models converge to similar representations of misalignment. We demonstrate this convergence by extracting a ‘misalignment direction’ from one fine-tuned model’s activations, and using it to effectively ablate misaligned behaviour from fine-tunes using higher dimensional LoRAs and different datasets. Leveraging the scalar hidden state of rank-1 LoRAs, we further present a set of experiments for directly interpreting the fine-tuning adapters, showing that six contribute to general misalignment, while two specialise for misalignment in just the fine-tuning domain. Emergent misalignment is a particularly salient example of undesirable and unexpected model behaviour and by advancing our understanding of the mechanisms behind it, we hope to move towards being able to better understand and mitigate misalignment more generally.

Submission history

From: Anna Soligo [view email]
[v1]
Fri, 13 Jun 2025 09:39:54 UTC (5,542 KB)
[v2]
Fri, 20 Jun 2025 17:23:55 UTC (5,685 KB)

Previous ArticleIBM QRadar SIEM: Autoupdate files can be infected with malicious code

Next Article Walmart agrees to pay $10M to settle FTC wire-transfer allegations

Advanced AI Editor

Leave A Reply