AV-Reasoner: Improving and Benchmarking Clue-Grounded Audio-Visual Counting for MLLMs

Lidong Lu¹Guo Chen¹Zhiqi Li¹Yicheng Liu¹Tong Lu¹

¹Nanjing University

Abstract

Despite progress in video understanding, current MLLMs struggle with counting tasks. Existing benchmarks are limited by short videos, close-set queries, lack of clue annotations, and weak multimodal coverage. In this paper, we introduce CG-AV-Counting, a manually-annotated clue-grounded counting benchmark with 1,027 multimodal questions and 5,845 annotated clues over 497 long videos. It supports both black-box and white-box evaluation, serving as a comprehensive testbed for both end-to-end and reasoning-based counting. To explore ways to improve model's counting capability, we propose AV-Reasoner, a model trained with GRPO and curriculum learning to generalize counting ability from related tasks. AV-Reasoner achieves state-of-the-art results across multiple benchmarks, demonstrating the effectiveness of reinforcement learning. However, experiments show that on out-of-domain benchmarks, reasoning in the language space fails to bring performance gains.

Benchmark

LeaderBoard

Model	Modality	Black-box Evaluation (Long Acc)				Black-box Evaluation (Ref Acc)				White-box Evaluation
Model	Modality	Acc ↑	OBOA ↑	MAE ↓	RMSE ↓	Acc ↑	OBOA ↑	MAE ↓	RMSE ↓	WCS ↑	IFA ↑

Human (full-video)	A+V	85.00	96.49	0.65	0.95	91.53	98.05	0.23	0.45	71.93	100.00
Gemini 2.5 Pro	A+V	40.80	65.82	2.33	7.20	47.42	72.25	1.47	3.53	6.71	95.03
Gemini 2.5 Flash	A+V	36.90	61.05	2.71	6.32	41.48	65.53	1.86	4.08	4.20	95.03
Eagle2.5-8B	V	28.04	51.80	2.76	5.23	34.47	63.97	2.07	4.39	2.59	83.35
SEED-1.5 VL	V	27.85	52.19	2.93	6.39	36.12	57.64	2.35	4.45	2.46	99.03
GPT-4.1	V	27.17	49.95	2.68	4.78	37.39	64.85	1.77	3.53	2.78	98.73
GPT-4o	V	22.30	45.57	3.25	5.82	32.91	58.62	2.06	4.37	2.59	98.73
Qwen2.5-Omni-7B	A+V	22.30	48.69	3.92	8.49	33.05	60.20	3.05	5.79	1.17	95.52
AV-Reasoner (Ours)	A+V	22.30	48.30	3.15	5.89	35.83	61.44	2.38	4.44	1.11	80.82
InternVideo2.5-8B	V	22.20	48.30	3.17	5.92	28.33	56.47	2.75	5.70	-	-
AV-Reasoner-Thinking (Ours)	A+V	21.03	48.78	3.26	8.20	34.08	60.95	2.40	4.66	1.68	79.65
Qwen2.5-VL-7B	V	20.84	48.00	4.08	7.91	27.45	52.68	2.76	6.17	0.87	85.78
VideoChat-Flash-7B	V	19.86	43.91	3.49	6.17	25.41	46.54	3.53	6.48	-	-
InternVL3-8B	V	17.92	43.43	3.21	5.55	30.09	56.67	2.57	5.76	0.71	97.57
Ola-7B	A+V	17.92	38.85	4.57	10.52	25.33	46.53	3.30	5.98	0.84	75.66
MiniCPM-V 2.6	V	13.83	33.20	4.06	6.58	18.99	39.44	3.62	6.74	0.57	67.67
VideoLLaMA3-7B	V	12.46	30.57	4.52	12.60	18.50	41.48	3.23	5.68	1.03	91.72
Eagle2-9B	V	12.46	34.08	3.84	6.33	21.91	45.37	3.24	5.77	0.57	76.14
UnifiedIO-2 XXL	A+V	10.61	30.48	3.99	6.30	15.29	37.49	3.38	5.79	0.00	2.24
GPT-4.1 (text)	-	6.04	13.15	8.90	17.60	-	-	-	-	0.00	76.24
VideoLLaMA2.1-7B-AV	A+V	5.06	13.73	5.11	7.34	8.57	18.40	4.93	7.24	0.11	13.83
Random	-	1.56	4.97	30.35	36.66	-	-	-	-	0.25	100.00

Statistics

CG-AV-Counting is based on a subset of 497 videos from CG-Bench. The benchmark includes 1,027 multimodal-query questions and 5,845 fine-grained manually-annotated clue annotations. Nearly 40% of the samples require the model to use both audio and visual modalities for counting, while others only require the visual modality. This design ensures that the benchmark is applicable to both visual models and audio-visual models. The benchmark includes object, event, and attribute counting target. Among them, attribute counting is more challenging because it requires grouping objects with the same attribute based on the query.

This benchmark spans a numerical range from 1 to 76, with a long-tail distribution, where most counts fall between 1 and 20. Video content includes over 10 categories, such as sports, life record, humor, tutorials, etc., offering greater domain diversity than existing benchmarks. All videos in the benchmark exceed 10 minutes, and reference intervals range from seconds to minutes, covering both short-term and long-range dependencies.

Comparison

To the best of our knowledge, there is currently no comprehensive benchmark specifically designed to evaluate MLLMs' video counting capabilities. DVD-Counting and VideoNIAH use synthetic data for object counting. They have limited counting target variety and do not have long videos. Other benchmarks, such as MVBench and WorldSense, include real-world videos, but counting is only a subtask of the overall evaluation, resulting in a smaller number of samples. Datasets for repetitive action counting feature short videos and simple queries, making them unsuitable for evaluating MLLMs. Besides, most benchmarks only have visual queries, which limits their ability to fully evaluate Omni-MLLMs.

Unlike previous counting benchmarks, our benchmark incorporates both audio and visual modalities, features more complex queries, and provides fine-grained counting clues to jointly evaluate models' abilities in both end-to-end and reasoning-based counting.

Evaluation Metrics

We follow CG-Bench's dual evaluation protocol to assess MLLMs' counting ability:

Black-box Evaluation

Assesses end-to-end counting under two settings:

Long Acc: Model counts and temporally localizes in the full video.
Ref Acc: Model counts in a trimmed reference segment, isolating counting from localization.

Counting measured by four metrics:

Accuracy (Acc): Exact count prediction rate.
Off-By-One Accuracy (OBOA): Correct if off by ≤1.
Mean Absolute Error (MAE): Average counting error magnitude.
Root Mean Square Error (RMSE): Penalizes larger errors more.

White-box Evaluation

Assesses localization with explicit evidence:

Event counting: Temporal Intersection over Union (tIoU) of predicted segments.
Object counting: Spatial IoU of predicted bounding boxes for first object appearances.
Attribute counting: Clustered bounding boxes compared by IoU.

White-box Counting Score (WCS): Combines localization accuracy and counting penalty:

\text{WCS} = \frac{1}{K} \sum_{k=1}^{K} \sqrt{\text{LA}_k \times \text{CAP}_k} \times 100\%

\text{LA}_k = \frac{1}{|\text{GT}_k|} \sum_{j=1}^{|\text{GT}_k|} \text{IoU}(\text{Pred}_k, \text{GT}_k)

\text{CAP}_k = \max\left\{ 0, 1 - \frac{\left| |\text{Pred}_k| - |\text{GT}_k| \right|}{|\text{GT}_k|} \right\}

Score ranges 0–100; 100 means perfect count and localization, 0 means large count mismatch or format error.

Instruction-Following Accuracy (IFA): Proportion of outputs matching required format, ensuring reliability and interpretability.

Baseline: AV-Reasoner

Comparison with State-of-the-Art MLLMs

Comparison with Base Model on Counting Tasks

Examples

What is the main sound source of the video?
A. Car horn B. Civil defense alarm C. small bell D. Chimes

The video shows a civil defense siren, which is a type of alarm used for emergencies. The siren is clearly visible and is the main source of sound in the video. The other options, such as car horn, small bell, and chimes, are not present in the video. Therefore, the main sound source is the civil defense siren.

From the 34.22-second video, 64 frames are sampled at these timestamps: ... . Please describe the events and their time ranges from the video. Output in the following JSON Format: ...

The video shows a baby crying and sobbing, with occasional coughing. The baby's distres is evident throughout the video. The woman speaking is also noted, but it is not the main focus of the video. The baby's crying and sobbing are the primary events in the video.

[ { "start": "0.00 seconds", "end": "34.01 seconds", "event": "baby crying", "type": "audio visual" }, { "start": "0.00 seconds", "end": "34.01 seconds", "event": "baby sobbing", "type": "audio visual" }, { "start": "31.01 seconds", "end": "34.01 seconds", "event": "woman speaking", "type": "audio visual" } ]

Based on the given image in size of 1280x720 and its corresponding audio, please recognize the category of object making sound in the image, and then find out the bounding box coordinates of the object that makes the sound at the 2nd second of the audio.

The objects making sound are the two large, round, woven instruments that the man and boy are playing. These instruments are commonly used in Indian music and are known for their distinctive sound and appearance.

[ { "coordinates": [ [ 152, 446 ], [ 1096, 720 ] ], "name": "squirrel" } ]

Citation

@misc{lu2025avreasoner,
    title={AV-Reasoner: Improving and Benchmarking Clue-Grounded Audio-Visual Counting for MLLMs}, 
    author={Lidong Lu and Guo Chen and Zhiqi Li and Yicheng Liu and Tong Lu},
    year={2025},
    eprint={2506.05328},
    archivePrefix={arXiv},
    primaryClass={cs.CV},
    url={https://arxiv.org/abs/2506.05328}, 
}
@misc{chen2024cgbench,
    title={CG-Bench: Clue-grounded Question Answering Benchmark for Long Video Understanding}, 
    author={Guo Chen and Yicheng Liu and Yifei Huang and Yuping He and Baoqi Pei and Jilan Xu and Yali Wang and Tong Lu and Limin Wang},
    year={2024},
    eprint={2412.12075},
    archivePrefix={arXiv},
    primaryClass={cs.CV}
}

Relevant Projects

ICLR'25

CG-Bench: Clue-grounded Question Answering Benchmark for Long Video Understanding

CG-Bench is the largest benchmark for clue-grounded question answering in long videos, featuring 1,219 curated videos and 12,129 QA pairs across perception, reasoning, and hallucination tasks. It introduces clue-based white box and black box evaluations to assess genuine video understanding, revealing significant performance gaps among MLLMs.