Paper Page - Towards Visual Text Grounding Of Multimodal Large Language Model

Despite the existing evolution of Multimodal Large Language Models (MLLMs), a
non-neglectable limitation remains in their struggle with visual text
grounding, especially in text-rich images of documents. Document images, such
as scanned forms and infographics, highlight critical challenges due to their
complex layouts and textual content. However, current benchmarks do not fully
address these challenges, as they mostly focus on visual grounding on natural
images, rather than text-rich document images. Thus, to bridge this gap, we
introduce TRIG, a novel task with a newly designed instruction dataset for
benchmarking and improving the Text-Rich Image Grounding capabilities of MLLMs
in document question-answering. Specifically, we propose an OCR-LLM-human
interaction pipeline to create 800 manually annotated question-answer pairs as
a benchmark and a large-scale training set of 90$ synthetic data based on four
diverse datasets. A comprehensive evaluation of various MLLMs on our proposed
benchmark exposes substantial limitations in their grounding capability on
text-rich images. In addition, we propose two simple and effective TRIG methods
based on general instruction tuning and plug-and-play efficient embedding,
respectively. By finetuning MLLMs on our synthetic dataset, they promisingly
improve spatial reasoning and grounding capabilities.

Source link

What's Hot

GCPO: When Contrast Fails, Go Gold – Takara TLDR

I’m fed up of AI chatbots replacing customer service

New Legislation Is Likely To Drive AI Adoption Rather Than Create Jobs

Paper page – Towards Visual Text Grounding of Multimodal Large Language Model

GCPO: When Contrast Fails, Go Gold – Takara TLDR

A^2Search: Ambiguity-Aware Question Answering with Reinforcement Learning – Takara TLDR

Learning on the Job: An Experience-Driven Self-Evolving Agent for Long-Horizon Tasks – Takara TLDR

The Rubin Names 2025 Art Prize, Research and Art Projects Grants

Kochi-Muziris Biennial Announces 66 Artists for December Exhibition

Instagram Launches ‘Rings’ Awards for Creators—With KAWS as a Judge

Museums Prepare to Close Their Doors as Government Shutdown Continues

GCPO: When Contrast Fails, Go Gold – Takara TLDR

I’m fed up of AI chatbots replacing customer service

New Legislation Is Likely To Drive AI Adoption Rather Than Create Jobs

What's Hot

Paper page – Towards Visual Text Grounding of Multimodal Large Language Model

Related Posts

Subscribe to Updates