Paper page - Steering Out-of-Distribution Generalization with Concept Ablation Fine-Tuning

Concept Ablation Fine-Tuning (CAFT) uses interpretability tools to steer LLM generalization away from unintended concepts without altering training data.

Fine-tuning large language models (LLMs) can lead to unintended
out-of-distribution generalization. Standard approaches to this problem rely on
modifying training data, for example by adding data that better specify the
intended generalization. However, this is not always practical. We introduce
Concept Ablation Fine-Tuning (CAFT), a technique that leverages
interpretability tools to control how LLMs generalize from fine-tuning, without
needing to modify the training data or otherwise use data from the target
distribution. Given a set of directions in an LLM’s latent space corresponding
to undesired concepts, CAFT works by ablating these concepts with linear
projections during fine-tuning, steering the model away from unintended
generalizations. We successfully apply CAFT to three fine-tuning tasks,
including emergent misalignment, a phenomenon where LLMs fine-tuned on a narrow
task generalize to give egregiously misaligned responses to general questions.
Without any changes to the fine-tuning data, CAFT reduces misaligned responses
by 10x without degrading performance on the training distribution. Overall,
CAFT represents a novel approach for steering LLM generalization without
modifying training data.

Source link

What's Hot

Former Anthropic exec raises $15M to insure AI agents and help startups deploy safely

AI’s talent arms race is starting to look like pro sports

Promoting Skills for Employees | Recruiting News Network

Paper page – Steering Out-of-Distribution Generalization with Concept Ablation Fine-Tuning

Paper page – Does More Inference-Time Compute Really Help Robustness?

Paper page – RefCritic: Training Long Chain-of-Thought Critic Models with Refinement Feedback

Paper page – ObjectGS: Object-aware Scene Reconstruction and Scene Understanding via Gaussian Splatting

Barnes Foundation Online Learning Platform Expands to Penn Museum

Archaeologists Identify 5,500-Year-Old Megalithic Tombs in Poland

Phillips to Debut ‘First-of-its Kind’ Priority Bidding Structure

3,800-Year-Old Warrior’s Tomb Unearthed in Azerbaijan

Former Anthropic exec raises $15M to insure AI agents and help startups deploy safely

AI’s talent arms race is starting to look like pro sports

Promoting Skills for Employees | Recruiting News Network

What's Hot

Paper page – Steering Out-of-Distribution Generalization with Concept Ablation Fine-Tuning

Related Posts

Subscribe to Updates