Paper Page - Automating Steering For Safe Multimodal Large Language Models

AutoSteer, a modular inference-time intervention technology, enhances the safety of Multimodal Large Language Models by reducing attack success rates across various threats without fine-tuning.

Recent progress in Multimodal Large Language Models (MLLMs) has unlocked
powerful cross-modal reasoning abilities, but also raised new safety concerns,
particularly when faced with adversarial multimodal inputs. To improve the
safety of MLLMs during inference, we introduce a modular and adaptive
inference-time intervention technology, AutoSteer, without requiring any
fine-tuning of the underlying model. AutoSteer incorporates three core
components: (1) a novel Safety Awareness Score (SAS) that automatically
identifies the most safety-relevant distinctions among the model’s internal
layers; (2) an adaptive safety prober trained to estimate the likelihood of
toxic outputs from intermediate representations; and (3) a lightweight Refusal
Head that selectively intervenes to modulate generation when safety risks are
detected. Experiments on LLaVA-OV and Chameleon across diverse safety-critical
benchmarks demonstrate that AutoSteer significantly reduces the Attack Success
Rate (ASR) for textual, visual, and cross-modal threats, while maintaining
general abilities. These findings position AutoSteer as a practical,
interpretable, and effective framework for safer deployment of multimodal AI
systems.

Source link

What's Hot

California lawmakers pass AI safety bill SB 53 — but Newsom could still veto

Tucker Carlson Asks OpenAI CEO Sam Altman If He Ordered Employee’s Murder

The AI drug breakthrough is taking a long time to arrive for reasons that may have little to do with the technology’s limits

Paper page – Automating Steering for Safe Multimodal Large Language Models

Research Paper – Takara TLDR

2D Gaussian Splatting with Semantic Alignment for Image Inpainting – Takara TLDR

The Choice of Divergence: A Neglected Key to Mitigating Diversity Collapse in Reinforcement Learning with Verifiable Reward – Takara TLDR

Ohio Auction of Two Paintings Looted By Nazis Halted By Foundation

Lee Ufan Painting at Center of Bribery Investigation in Korea

Drought Reveals 40 Ancient Tombs in Northern Iraqi Reservoir

Artifacts Removed from Gaza Building Before Suspected Israeli Strike

California lawmakers pass AI safety bill SB 53 — but Newsom could still veto

Tucker Carlson Asks OpenAI CEO Sam Altman If He Ordered Employee’s Murder

The AI drug breakthrough is taking a long time to arrive for reasons that may have little to do with the technology’s limits

What's Hot

Paper page – Automating Steering for Safe Multimodal Large Language Models

Related Posts

Subscribe to Updates