Paper Page - CrossWordBench: Evaluating The Reasoning Capabilities Of LLMs And LVLMs With Controllable Puzzle Generation

Existing reasoning evaluation frameworks for Large Language Models (LLMs) and
Large Vision-Language Models (LVLMs) predominantly either assess text-based
reasoning or vision-language understanding capabilities, with limited dynamic
interplay between textual and visual constraints. To address this limitation,
we introduce CrossWordBench, a benchmark designed to evaluate the reasoning
capabilities of both LLMs and LVLMs through the medium of crossword puzzles-a
task requiring multimodal adherence to semantic constraints from text-based
clues and intersectional constraints from visual grid structures.
CrossWordBench leverages a controllable puzzle generation framework that
produces puzzles in multiple formats (text and image) and offers different
evaluation strategies ranging from direct puzzle solving to interactive modes.
Our extensive evaluation of over 20 models reveals that reasoning LLMs
outperform non-reasoning models substantially by effectively leveraging
crossing-letter constraints. We further demonstrate that LVLMs struggle with
the task, showing a strong correlation between their puzzle-solving performance
and grid-parsing accuracy. Our findings offer insights into the limitations of
the reasoning capabilities of current LLMs and LVLMs, and provide an effective
approach for creating multimodal constrained tasks for future evaluations.

Source link

What's Hot

YouTube’s multi-language audio feature for dubbing videos rolls out to all creators

Jus Mundi Launches Agentic Tool, Explains How It Works – Artificial Lawyer

Parallel-R1: Towards Parallel Thinking via Reinforcement Learning – Takara TLDR

Paper page – CrossWordBench: Evaluating the Reasoning Capabilities of LLMs and LVLMs with Controllable Puzzle Generation

Parallel-R1: Towards Parallel Thinking via Reinforcement Learning – Takara TLDR

Mini-o3: Scaling Up Reasoning Patterns and Interaction Turns for Visual Search – Takara TLDR

Visual Representation Alignment for Multimodal Large Language Models – Takara TLDR

Ralph Rugoff to Leave London’s Hayward Gallery After 20 Years

New York Foundation for the Arts Workers Move to Unionize

Patrizia Sandretto Re Rebaudengo Teams Up with New Museum

Growing Support for Parthenon Marbles’ Return to Greece, More Art News

YouTube’s multi-language audio feature for dubbing videos rolls out to all creators

Jus Mundi Launches Agentic Tool, Explains How It Works – Artificial Lawyer

Parallel-R1: Towards Parallel Thinking via Reinforcement Learning – Takara TLDR

What's Hot

Paper page – CrossWordBench: Evaluating the Reasoning Capabilities of LLMs and LVLMs with Controllable Puzzle Generation

Related Posts

Subscribe to Updates