Paper page - How Do Large Vision-Language Models See Text in Image? Unveiling the Distinctive Role of OCR Heads

The study identifies and analyzes OCR Heads within Large Vision Language Models, revealing their unique activation patterns and roles in interpreting text within images.

Despite significant advancements in Large Vision Language Models (LVLMs), a
gap remains, particularly regarding their interpretability and how they locate
and interpret textual information within images. In this paper, we explore
various LVLMs to identify the specific heads responsible for recognizing text
from images, which we term the Optical Character Recognition Head (OCR Head).
Our findings regarding these heads are as follows: (1) Less Sparse: Unlike
previous retrieval heads, a large number of heads are activated to extract
textual information from images. (2) Qualitatively Distinct: OCR heads possess
properties that differ significantly from general retrieval heads, exhibiting
low similarity in their characteristics. (3) Statically Activated: The
frequency of activation for these heads closely aligns with their OCR scores.
We validate our findings in downstream tasks by applying Chain-of-Thought (CoT)
to both OCR and conventional retrieval heads and by masking these heads. We
also demonstrate that redistributing sink-token values within the OCR heads
improves performance. These insights provide a deeper understanding of the
internal mechanisms LVLMs employ in processing embedded textual information in
images.

Source link

What's Hot

From LLMs to hallucinations, here’s a simple guide to common AI terms

The Odds are Odd: A Statistical Test for Detecting Adversarial Examples

Physics in 4 Dimensions…How?

Paper page – How Do Large Vision-Language Models See Text in Image? Unveiling the Distinctive Role of OCR Heads

LaViDa: A Large Diffusion Language Model for Multimodal Understanding

Paper page – Think or Not? Selective Reasoning via Reinforcement Learning for Vision-Language Models

Paper page – TinyV: Reducing False Negatives in Verification Improves RL for LLM Reasoning

Ideale Men’s Style According To Members’ Club Curator Nikole Powers

Inaugural Boston Public Art Triennial Strives To Bring City More ‘Wow’ Moments

A Bold And Poetic Celebration Of A Renaissance Giant

David Beckham Evolves His Style For Beckham X Boss 2025 Design Debut

From LLMs to hallucinations, here’s a simple guide to common AI terms

The Odds are Odd: A Statistical Test for Detecting Adversarial Examples

Physics in 4 Dimensions…How?

What's Hot

Paper page – How Do Large Vision-Language Models See Text in Image? Unveiling the Distinctive Role of OCR Heads

Related Posts

Subscribe to Updates