Paper Page - Alignment Quality Index (AQI) : Beyond Refusals: AQI As An Intrinsic Alignment Diagnostic Via Latent Geometry, Cluster Divergence, And Layer Wise Pooled Representations

The paper introduces the Alignment Quality Index (AQI), a decoding-invariant metric leveraging latent geometric representations and clustering indices to diagnose hidden misalignments in large language models (LLMs), even under behavioral compliance.

Intrinsic Latent Geometry Metric: AQI measures alignment by assessing how well safe and unsafe prompts form distinct clusters in a model’s latent space using a combination of Xie-Beni and Calinski-Harabasz indices, making it invariant to decoding strategies and resistant to alignment faking.
Layerwise Pooled Representation Learning: It uses a sparse, learned pooling mechanism over hidden transformer layers to capture alignment-relevant abstractions without modifying the base model, enabling robust internal safety diagnostics.
Empirical Failures of Behavioral Metrics: AQI reveals misalignments missed by traditional metrics (e.g., G-Eval, refusal rates) in scenarios like jailbreaks, safety-agnostic fine-tuning, and stochastic decoding—showcasing its strength as an early-warning alignment audit tool.

Source link

What's Hot

Reinforcement Learning Scaling Trends: Insights from Andrej Karpathy on AI Business Opportunities in 2025 | AI News Detail

Google’s Newest AI Model Acts Like a Satellite to Track Climate Change

‘Could fundamentally change how we power our world’

Paper page – Alignment Quality Index (AQI) : Beyond Refusals: AQI as an Intrinsic Alignment Diagnostic via Latent Geometry, Cluster Divergence, and Layer wise Pooled Representations

LiveMCP-101: Stress Testing and Diagnosing MCP-enabled Agents on Challenging Queries – Takara TLDR

Visual Autoregressive Modeling for Instruction-Guided Image Editing – Takara TLDR

Snap-Snap: Taking Two Images to Reconstruct 3D Human Gaussians in Milliseconds – Takara TLDR

Mütter Museum in Philadelphia Announces New Policy for Human Remains

Inigo Philbrick, Art Dealer Convicted of Fraud, Appears in BBC Film

Links for August 22, 2025

White House Targets Specific Artworks at Smithsonian Museums

Reinforcement Learning Scaling Trends: Insights from Andrej Karpathy on AI Business Opportunities in 2025 | AI News Detail

Google’s Newest AI Model Acts Like a Satellite to Track Climate Change

‘Could fundamentally change how we power our world’

What's Hot

Paper page – Alignment Quality Index (AQI) : Beyond Refusals: AQI as an Intrinsic Alignment Diagnostic via Latent Geometry, Cluster Divergence, and Layer wise Pooled Representations

Related Posts

Subscribe to Updates