Close Menu
  • Home
  • AI Models
    • DeepSeek
    • xAI
    • OpenAI
    • Meta AI Llama
    • Google DeepMind
    • Amazon AWS AI
    • Microsoft AI
    • Anthropic (Claude)
    • NVIDIA AI
    • IBM WatsonX Granite 3.1
    • Adobe Sensi
    • Hugging Face
    • Alibaba Cloud (Qwen)
    • Baidu (ERNIE)
    • C3 AI
    • DataRobot
    • Mistral AI
    • Moonshot AI (Kimi)
    • Google Gemma
    • xAI
    • Stability AI
    • H20.ai
  • AI Research
    • Allen Institue for AI
    • arXiv AI
    • Berkeley AI Research
    • CMU AI
    • Google Research
    • Microsoft Research
    • Meta AI Research
    • OpenAI Research
    • Stanford HAI
    • MIT CSAIL
    • Harvard AI
  • AI Funding & Startups
    • AI Funding Database
    • CBInsights AI
    • Crunchbase AI
    • Data Robot Blog
    • TechCrunch AI
    • VentureBeat AI
    • The Information AI
    • Sifted AI
    • WIRED AI
    • Fortune AI
    • PitchBook
    • TechRepublic
    • SiliconANGLE – Big Data
    • MIT News
    • Data Robot Blog
  • Expert Insights & Videos
    • Google DeepMind
    • Lex Fridman
    • Matt Wolfe AI
    • Yannic Kilcher
    • Two Minute Papers
    • AI Explained
    • TheAIEdge
    • Matt Wolfe AI
    • The TechLead
    • Andrew Ng
    • OpenAI
  • Expert Blogs
    • François Chollet
    • Gary Marcus
    • IBM
    • Jack Clark
    • Jeremy Howard
    • Melanie Mitchell
    • Andrew Ng
    • Andrej Karpathy
    • Sebastian Ruder
    • Rachel Thomas
    • IBM
  • AI Policy & Ethics
    • ACLU AI
    • AI Now Institute
    • Center for AI Safety
    • EFF AI
    • European Commission AI
    • Partnership on AI
    • Stanford HAI Policy
    • Mozilla Foundation AI
    • Future of Life Institute
    • Center for AI Safety
    • World Economic Forum AI
  • AI Tools & Product Releases
    • AI Assistants
    • AI for Recruitment
    • AI Search
    • Coding Assistants
    • Customer Service AI
    • Image Generation
    • Video Generation
    • Writing Tools
    • AI for Recruitment
    • Voice/Audio Generation
  • Industry Applications
    • Finance AI
    • Healthcare AI
    • Legal AI
    • Manufacturing AI
    • Media & Entertainment
    • Transportation AI
    • Education AI
    • Retail AI
    • Agriculture AI
    • Energy AI
  • AI Art & Entertainment
    • AI Art News Blog
    • Artvy Blog » AI Art Blog
    • Weird Wonderful AI Art Blog
    • The Chainsaw » AI Art
    • Artvy Blog » AI Art Blog
What's Hot

OpenAI Acknowledges the Teen Problem

MIT investigates multiple reports of swastikas, pro-Palestinian graffiti on campus

Huawei announces new AI infrastructure as Nvidia gets locked out of China

Facebook X (Twitter) Instagram
Advanced AI News
  • Home
  • AI Models
    • OpenAI (GPT-4 / GPT-4o)
    • Anthropic (Claude 3)
    • Google DeepMind (Gemini)
    • Meta (LLaMA)
    • Cohere (Command R)
    • Amazon (Titan)
    • IBM (Watsonx)
    • Inflection AI (Pi)
  • AI Research
    • Allen Institue for AI
    • arXiv AI
    • Berkeley AI Research
    • CMU AI
    • Google Research
    • Meta AI Research
    • Microsoft Research
    • OpenAI Research
    • Stanford HAI
    • MIT CSAIL
    • Harvard AI
  • AI Funding
    • AI Funding Database
    • CBInsights AI
    • Crunchbase AI
    • Data Robot Blog
    • TechCrunch AI
    • VentureBeat AI
    • The Information AI
    • Sifted AI
    • WIRED AI
    • Fortune AI
    • PitchBook
    • TechRepublic
    • SiliconANGLE – Big Data
    • MIT News
    • Data Robot Blog
  • AI Experts
    • Google DeepMind
    • Lex Fridman
    • Meta AI Llama
    • Yannic Kilcher
    • Two Minute Papers
    • AI Explained
    • TheAIEdge
    • The TechLead
    • Matt Wolfe AI
    • Andrew Ng
    • OpenAI
    • Expert Blogs
      • François Chollet
      • Gary Marcus
      • IBM
      • Jack Clark
      • Jeremy Howard
      • Melanie Mitchell
      • Andrew Ng
      • Andrej Karpathy
      • Sebastian Ruder
      • Rachel Thomas
      • IBM
  • AI Tools
    • AI Assistants
    • AI for Recruitment
    • AI Search
    • Coding Assistants
    • Customer Service AI
  • AI Policy
    • ACLU AI
    • AI Now Institute
    • Center for AI Safety
  • Business AI
    • Advanced AI News Features
    • Finance AI
    • Healthcare AI
    • Education AI
    • Energy AI
    • Legal AI
LinkedIn Instagram YouTube Threads X (Twitter)
Advanced AI News
Alibaba Cloud (Qwen)

New jailbreak method exposes major flaws in AI safety systems

By Advanced AI EditorSeptember 18, 2025No Comments4 Mins Read
Share Facebook Twitter Pinterest Copy Link Telegram LinkedIn Tumblr Email
Share
Facebook Twitter LinkedIn Pinterest Email


Researchers at Seoul National University have uncovered a powerful new method of jailbreaking large language models (LLMs) that bypasses existing safeguards with alarming success. The work, led by Seongho Joo, Hyukhun Koh, and Kyomin Jung, details a systematic strategy that manipulates AI responses through reframed instructions and hidden encodings.

The study, titled Harmful Prompt Laundering: Jailbreaking LLMs with Abductive Styles and Symbolic Encoding and published on arXiv, shows how attackers can trigger harmful outputs from models such as GPT, Claude, LLaMA, and Qwen using only black-box access. The findings raise urgent concerns about the fragility of current safety systems and the trade-offs between preventing abuse and preserving helpfulness in legitimate use cases.

How does HaPLa bypass model safeguards?

The researchers introduce a new jailbreak technique called HaPLa, short for Harmful Prompt Laundering. The approach relies on two key mechanisms: abductive framing and symbolic encoding.

Abductive framing transforms direct malicious instructions into third-person reconstructions. Instead of asking a model “how to make” or “how to do” something dangerous, the attacker frames it as a retrospective inference problem, describing a scenario and asking the model to infer missing steps. This subtle shift avoids triggering the model’s immediate refusal mechanisms, which often rely on detecting explicit harmful requests at the start of a prompt.

The second mechanism, symbolic encoding, conceals sensitive trigger words using numerical or symbolic substitutes. The most effective version, ASCII encoding, replaces letters with their corresponding character codes. In some variations, attackers use flipped or rearranged encodings to further mask intent. Despite the obfuscation, models can decode the words in context and generate harmful content. By balancing the masking level, attackers can evade safety filters while still ensuring that the model understands the intended meaning.

When combined, abductive framing and symbolic encoding create a highly reliable jailbreak pathway. According to the study, this two-step strategy consistently bypasses lexical triggers and token-based refusals, allowing models to produce harmful instructions they would otherwise block.

How successful is the attack across models and defenses?

The authors tested HaPLa against a range of commercial and open-source LLMs, including GPT-3.5-turbo, GPT-4o-mini, GPT-4o, Claude 3.5-Sonnet, LLaMA-3-8B-Instruct, and Qwen-2.5-7B. The results show that the method achieves success rates above 95 percent on GPT models and more than 70 percent on average across all systems. HaPLa outperformed state-of-the-art baselines such as ArtPrompt, AutoDAN, DeepInception, TAP, and CodeChameleon.

Crucially, the method also held up against defenses that are commonly deployed. Tested against LLaMA Guard filters, paraphrasing layers, self-reminders, and perplexity-based checks, HaPLa still maintained high success rates. Even with paraphrasing defense, which rewrites prompts to filter intent, attack rates remained between 49 and 94 percent depending on the model. Upgrading guard models from LLaMA Guard 7B to 8B reduced success rates by around 10 percentage points but did not stop the attacks.

The study also shows that the effectiveness of HaPLa increases in multi-turn conversations. When attackers use iterative dialogue to refine prompts, second-turn success rates rise above 75 percent across tested models. This highlights a weakness in systems that focus only on single-turn safety mechanisms.

In detailed ablation studies, the authors found that both abductive framing and symbolic encoding are necessary components. Removing one weakens performance but does not eliminate the vulnerability, suggesting that attackers can adapt strategies depending on a model’s defense profile. Masking strength also plays a role: as more of a sensitive keyword is encoded, refusal rates fall and acceptance rates rise, though different models respond differently to the degree of masking.

What are the broader implications for AI safety?

The findings highlight critical flaws in current alignment approaches that depend heavily on lexical triggers and token-level refusal policies. Even when harmful words are reintroduced into benign or educational contexts, models often continue to refuse, showing that trigger-word sensitivity overrides intent analysis. This rigidity creates both safety vulnerabilities and usability trade-offs.

The researchers further explored the possibility of retraining models to resist HaPLa-style prompts. Fine-tuning LLaMA-3-8B on refusals reduced success for seen encodings but failed to generalize to new encoding schemes. More aggressive tuning degraded performance on benign tasks that included sensitive words, showing a direct trade-off between improving safety and preserving helpfulness.

The real-world risks are substantial. The study found that harmful outputs generated through HaPLa closely resembled methods used in actual crimes since 2015. Over 80 percent of tested outputs scored at the highest level of similarity to real incidents, and more than half were judged as highly realistic. This alignment with real-world practices underscores the potential for misuse in security-sensitive areas.

The authors also compared reasoning-focused models, such as GPT-o1, which showed lower vulnerability to ASCII encoding. However, when alternative schemes like emoji encodings were introduced, jailbreak success returned to high levels, suggesting that stronger reasoning alone is insufficient to guarantee safety.



Source link

Follow on Google News Follow on Flipboard
Share. Facebook Twitter Pinterest LinkedIn Tumblr Email Copy Link
Previous ArticleChina Restricts Tech Giants from Buying Nvidia AI Chips
Next Article We Finally Know How Much It Cost to Train China’s Astonishing DeepSeek Model
Advanced AI Editor
  • Website

Related Posts

Gemini Nano Banana AI: After Saree trend, try these 5 Google-approved prompts to transform your profile picture

September 18, 2025

Alibaba releases open-source AI agent to rival OpenAI’s flagship Deep Research

September 18, 2025

Down and out with Cerebras Code

September 17, 2025

Comments are closed.

Latest Posts

Jackson Pollock Masterpiece Found to Contain Extinct Manganese Blue

Marian Goodman Adds Edith Dekyndt, New Gagosian Director: Industry Moves

How Much to Pay for Emerging Artists’ Work? Art Adviser Says $15,000 Max

Basquiat Biopic ‘Samo Lives’ Filming in East Village

Latest Posts

OpenAI Acknowledges the Teen Problem

September 18, 2025

MIT investigates multiple reports of swastikas, pro-Palestinian graffiti on campus

September 18, 2025

Huawei announces new AI infrastructure as Nvidia gets locked out of China

September 18, 2025

Subscribe to News

Subscribe to our newsletter and never miss our latest news

Subscribe my Newsletter for New Posts & tips Let's stay updated!

Recent Posts

  • OpenAI Acknowledges the Teen Problem
  • MIT investigates multiple reports of swastikas, pro-Palestinian graffiti on campus
  • Huawei announces new AI infrastructure as Nvidia gets locked out of China
  • MARKET WARNING: FED Rate Cuts aren’t working. This is bad.
  • AI INVESTOR ALERT: Bronstein, Gewirtz & Grossman LLC Announces that C3.ai, Inc. Investors with Substantial Losses Have Opportunity to Lead Class Action Lawsuit

Recent Comments

  1. RonaldNeefe on 1-800-CHAT-GPT—12 Days of OpenAI: Day 10
  2. royalqq on 1-800-CHAT-GPT—12 Days of OpenAI: Day 10
  3. GustavoPsymn on C3 AI and Arcfield Announce Partnership to Accelerate AI Capabilities to Serve U.S. Defense and Intelligence Communities
  4. Porno Film izle on 1-800-CHAT-GPT—12 Days of OpenAI: Day 10
  5. Juniorfar on 1-800-CHAT-GPT—12 Days of OpenAI: Day 10

Welcome to Advanced AI News—your ultimate destination for the latest advancements, insights, and breakthroughs in artificial intelligence.

At Advanced AI News, we are passionate about keeping you informed on the cutting edge of AI technology, from groundbreaking research to emerging startups, expert insights, and real-world applications. Our mission is to deliver high-quality, up-to-date, and insightful content that empowers AI enthusiasts, professionals, and businesses to stay ahead in this fast-evolving field.

Subscribe to Updates

Subscribe to our newsletter and never miss our latest news

Subscribe my Newsletter for New Posts & tips Let's stay updated!

LinkedIn Instagram YouTube Threads X (Twitter)
  • Home
  • About Us
  • Advertise With Us
  • Contact Us
  • DMCA
  • Privacy Policy
  • Terms & Conditions
© 2025 advancedainews. Designed by advancedainews.

Type above and press Enter to search. Press Esc to cancel.