Paper Page - X-Teaming: Multi-Turn Jailbreaks And Defenses With Adaptive Multi-Agents

Multi-turn interactions with language models (LMs) pose critical safety
risks, as harmful intent can be strategically spread across exchanges. Yet, the
vast majority of prior work has focused on single-turn safety, while
adaptability and diversity remain among the key challenges of multi-turn
red-teaming. To address these challenges, we present X-Teaming, a scalable
framework that systematically explores how seemingly harmless interactions
escalate into harmful outcomes and generates corresponding attack scenarios.
X-Teaming employs collaborative agents for planning, attack optimization, and
verification, achieving state-of-the-art multi-turn jailbreak effectiveness and
diversity with success rates up to 98.1% across representative leading
open-weight and closed-source models. In particular, X-Teaming achieves a 96.2%
attack success rate against the latest Claude 3.7 Sonnet model, which has been
considered nearly immune to single-turn attacks. Building on X-Teaming, we
introduce XGuard-Train, an open-source multi-turn safety training dataset that
is 20x larger than the previous best resource, comprising 30K interactive
jailbreaks, designed to enable robust multi-turn safety alignment for LMs. Our
work offers essential tools and insights for mitigating sophisticated
conversational attacks, advancing the multi-turn safety of LMs.

Source link

What's Hot

TikTok’s ByteDance releases AI ‘Doubao-1.5-pro’, Chinese AI comparable to OpenAI’s model appears one after another

Google Gemini Nano Banana now available on WhatsApp: Perplexity CEO Aravind Srinivas shows how to generate AI images with prompts – Technology News

Chinese large AI models offer alternatives amid US’ service cut

Paper page – X-Teaming: Multi-Turn Jailbreaks and Defenses with Adaptive Multi-Agents

BTL-UI: Blink-Think-Link Reasoning Model for GUI Agent – Takara TLDR

Latent Zoning Network: A Unified Principle for Generative Modeling, Representation Learning, and Classification – Takara TLDR

A Vision-Language-Action-Critic Model for Robotic Real-World Reinforcement Learning – Takara TLDR

St. Patrick’s Cathedral Unveils Monumental Mural by Adam Cvijanovic

Three Loaned Banksy Works Incite Dispute Between England and Italy

New Collectors Drive Strong Sales at New York Fair

Hidden Portrait May Be Vermeer’s Earliest Known Work

TikTok’s ByteDance releases AI ‘Doubao-1.5-pro’, Chinese AI comparable to OpenAI’s model appears one after another

Google Gemini Nano Banana now available on WhatsApp: Perplexity CEO Aravind Srinivas shows how to generate AI images with prompts – Technology News

Chinese large AI models offer alternatives amid US’ service cut

What's Hot

Paper page – X-Teaming: Multi-Turn Jailbreaks and Defenses with Adaptive Multi-Agents

Related Posts

Subscribe to Updates