Researchers at Seoul National University have uncovered a powerful new method of jailbreaking large language models (LLMs) that bypasses existing safeguards with alarming success. The work, led by Seongho Joo, Hyukhun Koh, and Kyomin Jung, details a systematic strategy that manipulates AI responses through reframed instructions and hidden encodings.
The study, titled Harmful Prompt Laundering: Jailbreaking LLMs with Abductive Styles and Symbolic Encoding and published on arXiv, shows how attackers can trigger harmful outputs from models such as GPT, Claude, LLaMA, and Qwen using only black-box access. The findings raise urgent concerns about the fragility of current safety systems and the trade-offs between preventing abuse and preserving helpfulness in legitimate use cases.
How does HaPLa bypass model safeguards?
The researchers introduce a new jailbreak technique called HaPLa, short for Harmful Prompt Laundering. The approach relies on two key mechanisms: abductive framing and symbolic encoding.
Abductive framing transforms direct malicious instructions into third-person reconstructions. Instead of asking a model “how to make” or “how to do” something dangerous, the attacker frames it as a retrospective inference problem, describing a scenario and asking the model to infer missing steps. This subtle shift avoids triggering the model’s immediate refusal mechanisms, which often rely on detecting explicit harmful requests at the start of a prompt.
The second mechanism, symbolic encoding, conceals sensitive trigger words using numerical or symbolic substitutes. The most effective version, ASCII encoding, replaces letters with their corresponding character codes. In some variations, attackers use flipped or rearranged encodings to further mask intent. Despite the obfuscation, models can decode the words in context and generate harmful content. By balancing the masking level, attackers can evade safety filters while still ensuring that the model understands the intended meaning.
When combined, abductive framing and symbolic encoding create a highly reliable jailbreak pathway. According to the study, this two-step strategy consistently bypasses lexical triggers and token-based refusals, allowing models to produce harmful instructions they would otherwise block.
How successful is the attack across models and defenses?
The authors tested HaPLa against a range of commercial and open-source LLMs, including GPT-3.5-turbo, GPT-4o-mini, GPT-4o, Claude 3.5-Sonnet, LLaMA-3-8B-Instruct, and Qwen-2.5-7B. The results show that the method achieves success rates above 95 percent on GPT models and more than 70 percent on average across all systems. HaPLa outperformed state-of-the-art baselines such as ArtPrompt, AutoDAN, DeepInception, TAP, and CodeChameleon.
Crucially, the method also held up against defenses that are commonly deployed. Tested against LLaMA Guard filters, paraphrasing layers, self-reminders, and perplexity-based checks, HaPLa still maintained high success rates. Even with paraphrasing defense, which rewrites prompts to filter intent, attack rates remained between 49 and 94 percent depending on the model. Upgrading guard models from LLaMA Guard 7B to 8B reduced success rates by around 10 percentage points but did not stop the attacks.
The study also shows that the effectiveness of HaPLa increases in multi-turn conversations. When attackers use iterative dialogue to refine prompts, second-turn success rates rise above 75 percent across tested models. This highlights a weakness in systems that focus only on single-turn safety mechanisms.
In detailed ablation studies, the authors found that both abductive framing and symbolic encoding are necessary components. Removing one weakens performance but does not eliminate the vulnerability, suggesting that attackers can adapt strategies depending on a model’s defense profile. Masking strength also plays a role: as more of a sensitive keyword is encoded, refusal rates fall and acceptance rates rise, though different models respond differently to the degree of masking.
What are the broader implications for AI safety?
The findings highlight critical flaws in current alignment approaches that depend heavily on lexical triggers and token-level refusal policies. Even when harmful words are reintroduced into benign or educational contexts, models often continue to refuse, showing that trigger-word sensitivity overrides intent analysis. This rigidity creates both safety vulnerabilities and usability trade-offs.
The researchers further explored the possibility of retraining models to resist HaPLa-style prompts. Fine-tuning LLaMA-3-8B on refusals reduced success for seen encodings but failed to generalize to new encoding schemes. More aggressive tuning degraded performance on benign tasks that included sensitive words, showing a direct trade-off between improving safety and preserving helpfulness.
The real-world risks are substantial. The study found that harmful outputs generated through HaPLa closely resembled methods used in actual crimes since 2015. Over 80 percent of tested outputs scored at the highest level of similarity to real incidents, and more than half were judged as highly realistic. This alignment with real-world practices underscores the potential for misuse in security-sensitive areas.
The authors also compared reasoning-focused models, such as GPT-o1, which showed lower vulnerability to ASCII encoding. However, when alternative schemes like emoji encodings were introduced, jailbreak success returned to high levels, suggesting that stronger reasoning alone is insufficient to guarantee safety.