AV-Reasoner: Improving and Benchmarking Clue-Grounded Audio-Visual Counting for MLLMs
Despite progress in video understanding, current MLLMs struggle with counting tasks. Existing benchmarks are limited by short videos, close-set queries, lack of clue annotations, and weak multimodal coverage. In this paper, we introduce CG-AV-Counting, a manually-annotated clue-grounded counting benchmark with 1,027 multimodal questions and 5,845 annotated clues over 497 long videos. It supports both black-box and white-box evaluation, serving as a comprehensive testbed for both end-to-end and reasoning-based counting. To explore ways to improve model's counting capability, we propose AV-Reasoner, a model trained with GRPO and curriculum learning to generalize counting ability from related tasks. AV-Reasoner achieves state-of-the-art results across multiple benchmarks, demonstrating the effectiveness of reinforcement learning. However, experiments show that on out-of-domain benchmarks, reasoning in the language space fails to bring performance gains.
Model | Modality | Black-box Evaluation (Long Acc) | Black-box Evaluation (Ref Acc) | White-box Evaluation | |||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Acc ↑ | OBOA ↑ | MAE ↓ | RMSE ↓ | Acc ↑ | OBOA ↑ | MAE ↓ | RMSE ↓ | WCS ↑ | IFA ↑ | ||
Human (full-video) | A+V | 85.00 | 96.49 | 0.65 | 0.95 | 91.53 | 98.05 | 0.23 | 0.45 | 71.93 | 100.00 |
Gemini 2.5 Pro | A+V | 40.80 | 65.82 | 2.33 | 7.20 | 41.48 | 67.96 | 2.01 | 4.87 | 6.35 | 95.03 |
Gemini 2.5 Flash | A+V | 36.90 | 61.05 | 2.71 | 6.32 | 37.39 | 64.07 | 2.45 | 5.75 | 4.20 | 95.03 |
Eagle2.5-8B | V | 28.04 | 51.80 | 2.76 | 5.23 | 22.88 | 46.06 | 3.23 | 6.63 | 1.02 | 74.49 |
SEED-1.5 VL | V | 27.85 | 52.19 | 2.93 | 6.39 | 28.43 | 55.11 | 2.70 | 4.73 | 2.06 | 99.03 |
GPT-4.1 | V | 27.17 | 49.95 | 2.68 | 4.78 | 27.75 | 49.07 | 2.48 | 4.69 | 2.38 | 98.73 |
GPT-4o | V | 22.30 | 45.57 | 3.25 | 5.82 | 22.59 | 46.35 | 3.30 | 5.52 | 2.19 | 98.73 |
Qwen2.5-Omni-7B | A+V | 22.30 | 48.69 | 3.92 | 8.49 | 23.08 | 49.76 | 3.03 | 6.79 | 0.41 | 95.52 |
AV-Reasoner (Ours) | A+V | 22.30 | 48.30 | 3.15 | 5.89 | 23.56 | 48.78 | 3.14 | 6.12 | 0.92 | 80.82 |
InternVideo2.5-8B | V | 22.20 | 48.30 | 3.17 | 5.92 | 21.71 | 47.81 | 3.29 | 6.78 | - | - |
AV-Reasoner-Thinking (Ours) | A+V | 21.03 | 48.78 | 3.26 | 8.20 | 22.78 | 48.59 | 3.13 | 6.69 | 1.22 | 79.65 |
Qwen2.5-VL-7B | V | 20.84 | 48.00 | 4.08 | 7.91 | 22.20 | 44.79 | 3.51 | 8.31 | 0.60 | 85.78 |
VideoChat-Flash-7B | V | 19.86 | 43.91 | 3.49 | 6.17 | 20.45 | 44.79 | 3.41 | 5.98 | - | - |
InternVL3-8B | V | 17.92 | 43.43 | 3.21 | 5.55 | 19.08 | 45.67 | 3.53 | 7.96 | 0.72 | 92.11 |
Ola-7B | A+V | 17.92 | 38.85 | 4.57 | 10.52 | 16.94 | 37.10 | 4.54 | 10.59 | 0.42 | 75.66 |
MiniCPM-V 2.6 | V | 13.83 | 33.20 | 4.06 | 6.58 | 14.31 | 36.22 | 4.04 | 6.39 | 0.43 | 61.25 |
VideoLLaMA3-7B | V | 12.46 | 30.57 | 4.52 | 12.60 | 12.66 | 31.45 | 4.00 | 6.21 | 1.00 | 89.14 |
Eagle2-9B | V | 12.46 | 34.08 | 3.84 | 6.33 | 14.51 | 33.89 | 3.96 | 6.71 | 0.57 | 76.14 |
UnifiedIO-2 XXL | A+V | 10.61 | 30.48 | 3.99 | 6.30 | 13.92 | 35.74 | 3.83 | 6.98 | 0.00 | 2.24 |
GPT-4.1 (text) | - | 6.04 | 13.15 | 8.90 | 17.60 | - | - | - | - | 0.00 | 76.24 |
VideoLLaMA2.1-7B-AV | A+V | 5.06 | 13.73 | 5.11 | 7.34 | 5.65 | 15.48 | 5.06 | 7.35 | 0.11 | 13.73 |
Random | - | 1.56 | 4.97 | 30.35 | 36.66 | - | - | - | - | 0.25 | 100.00 |
CG-AV-Counting is based on a subset of 497 videos from CG-Bench. The benchmark includes 1,027 multimodal-query questions and 5,845 fine-grained manually-annotated clue annotations. Nearly 40% of the samples require the model to use both audio and visual modalities for counting, while others only require the visual modality. This design ensures that the benchmark is applicable to both visual models and audio-visual models. The benchmark includes object, event, and attribute counting target. Among them, attribute counting is more challenging because it requires grouping objects with the same attribute based on the query.
This benchmark spans a numerical range from 1 to 76, with a long-tail distribution, where most counts fall between 1 and 20. Video content includes over 10 categories, such as sports, life record, humor, tutorials, etc., offering greater domain diversity than existing benchmarks. All videos in the benchmark exceed 10 minutes, and reference intervals range from seconds to minutes, covering both short-term and long-range dependencies.
To the best of our knowledge, there is currently no comprehensive benchmark specifically designed to evaluate MLLMs' video counting capabilities. DVD-Counting and VideoNIAH use synthetic data for object counting. They have limited counting target variety and do not have long videos. Other benchmarks, such as MVBench and WorldSense, include real-world videos, but counting is only a subtask of the overall evaluation, resulting in a smaller number of samples. Datasets for repetitive action counting feature short videos and simple queries, making them unsuitable for evaluating MLLMs. Besides, most benchmarks only have visual queries, which limits their ability to fully evaluate Omni-MLLMs.
Unlike previous counting benchmarks, our benchmark incorporates both audio and visual modalities, features more complex queries, and provides fine-grained counting clues to jointly evaluate models' abilities in both end-to-end and reasoning-based counting.
We follow CG-Bench's dual evaluation protocol to assess MLLMs' counting ability:
Assesses end-to-end counting under two settings:
Counting measured by four metrics:
Assesses localization with explicit evidence:
White-box Counting Score (WCS): Combines localization accuracy and counting penalty:
Score ranges 0–100; 100 means perfect count and localization, 0 means large count mismatch or format error.
Instruction-Following Accuracy (IFA): Proportion of outputs matching required format, ensuring reliability and interpretability.
The video shows a civil defense siren, which is a type of alarm used for emergencies. The siren is clearly visible and is the main source of sound in the video. The other options, such as car horn, small bell, and chimes, are not present in the video. Therefore, the main sound source is the civil defense siren.B
The saxophone sound is present in the background, and the drum sound is present in the foreground. Both are part of the music being played. The saxophone is a common instrument in jazz music, and the drum is a fundamental component of any rhythm section. The presence of both instruments in the music is evident from the sound.yes
The video shows a baby crying and sobbing, with occasional coughing. The baby's distres is evident throughout the video. The woman speaking is also noted, but it is not the main focus of the video. The baby's crying and sobbing are the primary events in the video.[ { "start": "0.00 seconds", "end": "34.01 seconds", "event": "baby crying", "type": "audio visual" }, { "start": "0.00 seconds", "end": "34.01 seconds", "event": "baby sobbing", "type": "audio visual" }, { "start": "31.01 seconds", "end": "34.01 seconds", "event": "woman speaking", "type": "audio visual" } ]
The objects making sound are the two large, round, woven instruments that the man and boy are playing. These instruments are commonly used in Indian music and are known for their distinctive sound and appearance.[ { "coordinates": [ [ 152, 446 ], [ 1096, 720 ] ], "name": "squirrel" } ]
The ball is in motion, so it is not considered stationary. The other objects are stationary. Counting the stationary objects: cube, sphere, cylinder, cone, and another cone. Total count is 5.5
@misc{lu2025avreasoner,
title={AV-Reasoner: Improving and Benchmarking Clue-Grounded Audio-Visual Counting for MLLMs},
author={Lidong Lu and Guo Chen and Zhiqi Li and Yicheng Liu and Tong Lu},
year={2025},
eprint={2506.05328},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2506.05328},
}
powered by Academic Project Page Template