Exploring Jailbreak Attacks On LLMs Through Intent Concealment And Diversion

arXiv:2505.14316v1 Announce Type: cross
Abstract: Although large language models (LLMs) have achieved remarkable advancements, their security remains a pressing concern. One major threat is jailbreak attacks, where adversarial prompts bypass model safeguards to generate harmful or objectionable content. Researchers study jailbreak attacks to understand security and robustness of LLMs. However, existing jailbreak attack methods face two main challenges: (1) an excessive number of iterative queries, and (2) poor generalization across models. In addition, recent jailbreak evaluation datasets focus primarily on question-answering scenarios, lacking attention to text generation tasks that require accurate regeneration of toxic content. To tackle these challenges, we propose two contributions: (1) ICE, a novel black-box jailbreak method that employs Intent Concealment and divErsion to effectively circumvent security constraints. ICE achieves high attack success rates (ASR) with a single query, significantly improving efficiency and transferability across different models. (2) BiSceneEval, a comprehensive dataset designed for assessing LLM robustness in question-answering and text-generation tasks. Experimental results demonstrate that ICE outperforms existing jailbreak techniques, revealing critical vulnerabilities in current defense mechanisms. Our findings underscore the necessity of a hybrid security strategy that integrates predefined security mechanisms with real-time semantic decomposition to enhance the security of LLMs.

Source link

What's Hot

Mitigating Overthinking through Reasoning Shaping – Takara TLDR

Medical reports analysis dashboard using Amazon Bedrock, LangChain, and Streamlit

MIT rejects White House education demands | Massachusetts

Exploring Jailbreak Attacks on LLMs through Intent Concealment and Diversion

LTLCrit: A Temporal Logic-based LLM Critic for Safe and Efficient Embodied Agents

From Imitation to Innovation: The Emergence of AI Unique Artistic Styles and the Challenge of Copyright Protection

VerifyLLM: LLM-Based Pre-Execution Task Plan Verification for Robots

Artist Behind Canterbury Cathedral Art Responds to JD Vance, Elon Musk

Jenkins Johnson Gallery to Open Tribeca Outpost on Marian Goodman Gallery’s Third Floor

Ruth Asawa May Have Broken Record at MoMA—and More Art News

Toledo Museum of Art Director on Digital Art, AI, and Future-Proofing

Mitigating Overthinking through Reasoning Shaping – Takara TLDR

Medical reports analysis dashboard using Amazon Bedrock, LangChain, and Streamlit

MIT rejects White House education demands | Massachusetts

What's Hot

Exploring Jailbreak Attacks on LLMs through Intent Concealment and Diversion

Related Posts

Subscribe to Updates