Jailbreaking Commercial Black-Box LLMs With Explicitly Harmful Prompts - Takara TLDR

Evaluating jailbreak attacks is challenging when prompts are not overtly
harmful or fail to induce harmful outputs. Unfortunately, many existing
red-teaming datasets contain such unsuitable prompts. To evaluate attacks
accurately, these datasets need to be assessed and cleaned for maliciousness.
However, existing malicious content detection methods rely on either manual
annotation, which is labor-intensive, or large language models (LLMs), which
have inconsistent accuracy in harmful types. To balance accuracy and
efficiency, we propose a hybrid evaluation framework named MDH (Malicious
content Detection based on LLMs with Human assistance) that combines LLM-based
annotation with minimal human oversight, and apply it to dataset cleaning and
detection of jailbroken responses. Furthermore, we find that well-crafted
developer messages can significantly boost jailbreak success, leading us to
propose two new strategies: D-Attack, which leverages context simulation, and
DH-CoT, which incorporates hijacked chains of thought. The Codes, datasets,
judgements, and detection results will be released in github repository:
https://github.com/AlienZhang1996/DH-CoT.

Source link

What's Hot

InstructX: Towards Unified Visual Editing with MLLM Guidance – Takara TLDR

Mark Cuban Joins OpenAI’s Sora — and Lets Fans Make AI Videos of Him

MIT rejects proposed ‘compact’ with White House

Jailbreaking Commercial Black-Box LLMs with Explicitly Harmful Prompts – Takara TLDR

InstructX: Towards Unified Visual Editing with MLLM Guidance – Takara TLDR

CoMAS: Co-Evolving Multi-Agent Systems via Interaction Rewards – Takara TLDR

MM-HELIX: Boosting Multimodal Long-Chain Reflective Reasoning with Holistic Platform and Adaptive Hybrid Policy Optimization – Takara TLDR

Frieze to Launch Abu Dhabi Fair in November 2026

Jeff Koons Returns to Gagosian with First New York Show in Seven Years

Ancient Egyptian Iconography Found in Roman-Era Bathhouse in Turkey

London Gallery Harlesden High Street Goes to Mayfair For a Pop-up

InstructX: Towards Unified Visual Editing with MLLM Guidance – Takara TLDR

Mark Cuban Joins OpenAI’s Sora — and Lets Fans Make AI Videos of Him

MIT rejects proposed ‘compact’ with White House

What's Hot

Jailbreaking Commercial Black-Box LLMs with Explicitly Harmful Prompts – Takara TLDR

Related Posts

Subscribe to Updates