CRISP: Persistent Concept Unlearning Via Sparse Autoencoders - Takara TLDR

As large language models (LLMs) are increasingly deployed in real-world
applications, the need to selectively remove unwanted knowledge while
preserving model utility has become paramount. Recent work has explored sparse
autoencoders (SAEs) to perform precise interventions on monosemantic features.
However, most SAE-based methods operate at inference time, which does not
create persistent changes in the model’s parameters. Such interventions can be
bypassed or reversed by malicious actors with parameter access. We introduce
CRISP, a parameter-efficient method for persistent concept unlearning using
SAEs. CRISP automatically identifies salient SAE features across multiple
layers and suppresses their activations. We experiment with two LLMs and show
that our method outperforms prior approaches on safety-critical unlearning
tasks from the WMDP benchmark, successfully removing harmful knowledge while
preserving general and in-domain capabilities. Feature-level analysis reveals
that CRISP achieves semantically coherent separation between target and benign
concepts, allowing precise suppression of the target features.

Source link

What's Hot

Robomart unveils new delivery robot with $3 flat fee to challenge DoorDash, Uber Eats

Stability AI launches its ‘most sophisticated’ image generator yet

Artificial intelligence could end disease, lead to “radical abundance,” Google DeepMind CEO Demis Hassabis says

CRISP: Persistent Concept Unlearning via Sparse Autoencoders – Takara TLDR

AgentFly: Fine-tuning LLM Agents without Fine-tuning LLMs – Takara TLDR

End-to-End Agentic RAG System Training for Traceable Diagnostic Reasoning – Takara TLDR

TPLA: Tensor Parallel Latent Attention for Efficient Disaggregated Prefill \& Decode Inference – Takara TLDR

People Inc. Sells Oldenburg and Van Bruggen ‘Plantoir’ Sculpture

Amy Sherald Speaks Out About Government Censorship at the Smithsonian

Dealers Living Like Collectors, Egypt’s Tourism and More: Morning Links

Mütter Museum in Philadelphia Announces New Policy for Human Remains

Robomart unveils new delivery robot with $3 flat fee to challenge DoorDash, Uber Eats

Stability AI launches its ‘most sophisticated’ image generator yet

Artificial intelligence could end disease, lead to “radical abundance,” Google DeepMind CEO Demis Hassabis says

What's Hot

CRISP: Persistent Concept Unlearning via Sparse Autoencoders – Takara TLDR

Related Posts

Subscribe to Updates