CG-AV-Counting is a new benchmark for video counting tasks that includes multimodal data and supports end-to-end and reasoning-based models. AV-Reasoner, trained with GRPO and curriculum learning, achieves top results but shows limitations on out-of-domain tasks.
Despite progress in video understanding, current MLLMs struggle with counting
tasks. Existing benchmarks are limited by short videos, close-set queries, lack
of clue annotations, and weak multimodal coverage. In this paper, we introduce
CG-AV-Counting, a manually-annotated clue-grounded counting benchmark with
1,027 multimodal questions and 5,845 annotated clues over 497 long videos. It
supports both black-box and white-box evaluation, serving as a comprehensive
testbed for both end-to-end and reasoning-based counting. To explore ways to
improve model’s counting capability, we propose AV-Reasoner, a model trained
with GRPO and curriculum learning to generalize counting ability from related
tasks. AV-Reasoner achieves state-of-the-art results across multiple
benchmarks, demonstrating the effectiveness of reinforcement learning. However,
experiments show that on out-of-domain benchmarks, reasoning in the language
space fails to bring performance gains. The code and benchmark have been
realeased on https://av-reasoner.github.io.