Paper Page - Alignment Quality Index (AQI) : Beyond Refusals: AQI As An Intrinsic Alignment Diagnostic Via Latent Geometry, Cluster Divergence, And Layer Wise Pooled Representations

The paper introduces the Alignment Quality Index (AQI), a decoding-invariant metric leveraging latent geometric representations and clustering indices to diagnose hidden misalignments in large language models (LLMs), even under behavioral compliance.

Intrinsic Latent Geometry Metric: AQI measures alignment by assessing how well safe and unsafe prompts form distinct clusters in a model’s latent space using a combination of Xie-Beni and Calinski-Harabasz indices, making it invariant to decoding strategies and resistant to alignment faking.
Layerwise Pooled Representation Learning: It uses a sparse, learned pooling mechanism over hidden transformer layers to capture alignment-relevant abstractions without modifying the base model, enabling robust internal safety diagnostics.
Empirical Failures of Behavioral Metrics: AQI reveals misalignments missed by traditional metrics (e.g., G-Eval, refusal rates) in scenarios like jailbreaks, safety-agnostic fine-tuning, and stochastic decoding—showcasing its strength as an early-warning alignment audit tool.

Source link

What's Hot

Hcltech Joins Mit Media Lab in the Us to Collaborate on Next-gen Ai Research

IBM Releases Open-Source Granite 4.0 Generative AI

New AI training method creates powerful software agents with just 78 examples

Paper page – Alignment Quality Index (AQI) : Beyond Refusals: AQI as an Intrinsic Alignment Diagnostic via Latent Geometry, Cluster Divergence, and Layer wise Pooled Representations

Your Agent May Misevolve: Emergent Risks in Self-evolving LLM Agents – Takara TLDR

Go with Your Gut: Scaling Confidence for Autoregressive Image Generation – Takara TLDR

Optimizing What Matters: AUC-Driven Learning for Robust Neural Retrieval – Takara TLDR

Tomb of Amenhotep III Reopens After Two-Decade Renovation

Limited Edition Print of Ozzy Osbourne Art Sold To Benefit Charities

Odili Donald Odita Sues Jack Shainman Gallery over ‘Withheld’ Artworks

Mohamed Hamidi, Moroccan Modernist Painter, Has Died at 84

Hcltech Joins Mit Media Lab in the Us to Collaborate on Next-gen Ai Research

IBM Releases Open-Source Granite 4.0 Generative AI

New AI training method creates powerful software agents with just 78 examples

What's Hot

Paper page – Alignment Quality Index (AQI) : Beyond Refusals: AQI as an Intrinsic Alignment Diagnostic via Latent Geometry, Cluster Divergence, and Layer wise Pooled Representations

Related Posts

Subscribe to Updates