The study identifies a linear reasoning bottleneck in Visual-Language Models and proposes the Linear Separability Ceiling as a metric to evaluate it, suggesting targeted alignment rather than improved representation learning as a solution.
Most state-of-the-art Visual-Language Models (VLMs) are seemingly limited by
the linear separabilty of their visual embeddings on abstract reasoning tasks.
This work investigates this “linear reasoning bottleneck” by introducing the
Linear Separability Ceiling (LSC), the performance of a simple linear
classifier on a VLM’s visual embeddings. We find this bottleneck is widespread
and stems not from poor perception, but from failures in the language model’s
reasoning pathways. We demonstrate this is a solvable alignment issue. The
required intervention, however, is task-dependent: activating existing pathways
suffices for semantic concepts, while complex relational reasoning requires
adapting core model weights. Using postfix tuning as a methodological control,
we find strong evidence for powerful, dormant reasoning pathways within VLMs.
However, for complex relational tasks requiring deeper adaptation, explicitly
improving representation quality causes the model to fail on new prompt formats
despite its embeddings remaining well separated. Ultimately, this work provides
a new lens for VLM analysis, showing that robust reasoning is a matter of
targeted alignment, not simply improved representation learning.