Paper Page - AV-Reasoner: Improving And Benchmarking Clue-Grounded Audio-Visual Counting For MLLMs

CG-AV-Counting is a new benchmark for video counting tasks that includes multimodal data and supports end-to-end and reasoning-based models. AV-Reasoner, trained with GRPO and curriculum learning, achieves top results but shows limitations on out-of-domain tasks.

Despite progress in video understanding, current MLLMs struggle with counting
tasks. Existing benchmarks are limited by short videos, close-set queries, lack
of clue annotations, and weak multimodal coverage. In this paper, we introduce
CG-AV-Counting, a manually-annotated clue-grounded counting benchmark with
1,027 multimodal questions and 5,845 annotated clues over 497 long videos. It
supports both black-box and white-box evaluation, serving as a comprehensive
testbed for both end-to-end and reasoning-based counting. To explore ways to
improve model’s counting capability, we propose AV-Reasoner, a model trained
with GRPO and curriculum learning to generalize counting ability from related
tasks. AV-Reasoner achieves state-of-the-art results across multiple
benchmarks, demonstrating the effectiveness of reinforcement learning. However,
experiments show that on out-of-domain benchmarks, reasoning in the language
space fails to bring performance gains. The code and benchmark have been
realeased on https://av-reasoner.github.io.

Source link

What's Hot

Why Big Investors Are All Ears For Voice AI Startups

AI gaming startup Born raises $15M to build ‘social’ AI companions that combat loneliness

Moveworks releases its next-generation copilot, taking action across all business systems using natural language

Paper page – AV-Reasoner: Improving and Benchmarking Clue-Grounded Audio-Visual Counting for MLLMs

UMO: Scaling Multi-Identity Consistency for Image Customization via Matching Reward – Takara TLDR

F1: A Vision-Language-Action Model Bridging Understanding and Generation to Actions – Takara TLDR

Q-Sched: Pushing the Boundaries of Few-Step Diffusion Models with Quantization-Aware Scheduling – Takara TLDR

Leon Black and Leslie Wexner’s Letters to Jeffrey Epstein Released

School of Visual Arts Transfers Ownership to Nonprofit Alumni Society

Cristin Tierney Moves Gallery to Tribeca for 15th Anniversary Exhibition

Anne Imhof Reimagines Football Jerseys with Nike

Why Big Investors Are All Ears For Voice AI Startups

AI gaming startup Born raises $15M to build ‘social’ AI companions that combat loneliness

Moveworks releases its next-generation copilot, taking action across all business systems using natural language

What's Hot

Paper page – AV-Reasoner: Improving and Benchmarking Clue-Grounded Audio-Visual Counting for MLLMs

Related Posts

Subscribe to Updates