AV-Reasoner: Improving and Benchmarking Clue-Grounded Audio-Visual Counting for MLLMs

1Nanjing University

Abstract

Despite progress in video understanding, current MLLMs struggle with counting tasks. Existing benchmarks are limited by short videos, close-set queries, lack of clue annotations, and weak multimodal coverage. In this paper, we introduce CG-AV-Counting, a manually-annotated clue-grounded counting benchmark with 1,027 multimodal questions and 5,845 annotated clues over 497 long videos. It supports both black-box and white-box evaluation, serving as a comprehensive testbed for both end-to-end and reasoning-based counting. To explore ways to improve model's counting capability, we propose AV-Reasoner, a model trained with GRPO and curriculum learning to generalize counting ability from related tasks. AV-Reasoner achieves state-of-the-art results across multiple benchmarks, demonstrating the effectiveness of reinforcement learning. However, experiments show that on out-of-domain benchmarks, reasoning in the language space fails to bring performance gains.

Benchmark

LeaderBoard

ModelModalityBlack-box Evaluation (Long Acc)Black-box Evaluation (Ref Acc)White-box Evaluation
Acc ↑
OBOA ↑
MAE ↓
RMSE ↓
Acc ↑
OBOA ↑
MAE ↓
RMSE ↓
WCS ↑
IFA ↑
Human (full-video)
A+V85.0096.490.650.9591.5398.050.230.4571.93100.00
Gemini 2.5 Pro
A+V40.8065.822.337.2041.4867.962.014.876.3595.03
Gemini 2.5 Flash
A+V36.9061.052.716.3237.3964.072.455.754.2095.03
Eagle2.5-8B
V28.0451.802.765.2322.8846.063.236.631.0274.49
SEED-1.5 VL
V27.8552.192.936.3928.4355.112.704.732.0699.03
GPT-4.1
V27.1749.952.684.7827.7549.072.484.692.3898.73
GPT-4o
V22.3045.573.255.8222.5946.353.305.522.1998.73
Qwen2.5-Omni-7B
A+V22.3048.693.928.4923.0849.763.036.790.4195.52
AV-Reasoner (Ours)
A+V22.3048.303.155.8923.5648.783.146.120.9280.82
InternVideo2.5-8B
V22.2048.303.175.9221.7147.813.296.78--
AV-Reasoner-Thinking (Ours)
A+V21.0348.783.268.2022.7848.593.136.691.2279.65
Qwen2.5-VL-7B
V20.8448.004.087.9122.2044.793.518.310.6085.78
VideoChat-Flash-7B
V19.8643.913.496.1720.4544.793.415.98--
InternVL3-8B
V17.9243.433.215.5519.0845.673.537.960.7292.11
Ola-7B
A+V17.9238.854.5710.5216.9437.104.5410.590.4275.66
MiniCPM-V 2.6
V13.8333.204.066.5814.3136.224.046.390.4361.25
VideoLLaMA3-7B
V12.4630.574.5212.6012.6631.454.006.211.0089.14
Eagle2-9B
V12.4634.083.846.3314.5133.893.966.710.5776.14
UnifiedIO-2 XXL
A+V10.6130.483.996.3013.9235.743.836.980.002.24
GPT-4.1 (text)
-6.0413.158.9017.60----0.0076.24
VideoLLaMA2.1-7B-AV
A+V5.0613.735.117.345.6515.485.067.350.1113.73
Random
-1.564.9730.3536.66----0.25100.00

Statistics

CG-AV-Counting is based on a subset of 497 videos from CG-Bench. The benchmark includes 1,027 multimodal-query questions and 5,845 fine-grained manually-annotated clue annotations. Nearly 40% of the samples require the model to use both audio and visual modalities for counting, while others only require the visual modality. This design ensures that the benchmark is applicable to both visual models and audio-visual models. The benchmark includes object, event, and attribute counting target. Among them, attribute counting is more challenging because it requires grouping objects with the same attribute based on the query.

This benchmark spans a numerical range from 1 to 76, with a long-tail distribution, where most counts fall between 1 and 20. Video content includes over 10 categories, such as sports, life record, humor, tutorials, etc., offering greater domain diversity than existing benchmarks. All videos in the benchmark exceed 10 minutes, and reference intervals range from seconds to minutes, covering both short-term and long-range dependencies.

Comparison

To the best of our knowledge, there is currently no comprehensive benchmark specifically designed to evaluate MLLMs' video counting capabilities. DVD-Counting and VideoNIAH use synthetic data for object counting. They have limited counting target variety and do not have long videos. Other benchmarks, such as MVBench and WorldSense, include real-world videos, but counting is only a subtask of the overall evaluation, resulting in a smaller number of samples. Datasets for repetitive action counting feature short videos and simple queries, making them unsuitable for evaluating MLLMs. Besides, most benchmarks only have visual queries, which limits their ability to fully evaluate Omni-MLLMs.

Unlike previous counting benchmarks, our benchmark incorporates both audio and visual modalities, features more complex queries, and provides fine-grained counting clues to jointly evaluate models' abilities in both end-to-end and reasoning-based counting.

Evaluation Metrics

We follow CG-Bench's dual evaluation protocol to assess MLLMs' counting ability:

Black-box Evaluation

Assesses end-to-end counting under two settings:

  • Long Acc: Model counts and temporally localizes in the full video.
  • Ref Acc: Model counts in a trimmed reference segment, isolating counting from localization.

Counting measured by four metrics:

  • Accuracy (Acc): Exact count prediction rate.
  • Off-By-One Accuracy (OBOA): Correct if off by ≤1.
  • Mean Absolute Error (MAE): Average counting error magnitude.
  • Root Mean Square Error (RMSE): Penalizes larger errors more.

White-box Evaluation

Assesses localization with explicit evidence:

  • Event counting: Temporal Intersection over Union (tIoU) of predicted segments.
  • Object counting: Spatial IoU of predicted bounding boxes for first object appearances.
  • Attribute counting: Clustered bounding boxes compared by IoU.

White-box Counting Score (WCS): Combines localization accuracy and counting penalty:

WCS=1Kk=1KLAk×CAPk×100%\text{WCS} = \frac{1}{K} \sum_{k=1}^{K} \sqrt{\text{LA}_k \times \text{CAP}_k} \times 100\%
LAk=1GTkj=1GTkIoU(Predk,GTk)\text{LA}_k = \frac{1}{|\text{GT}_k|} \sum_{j=1}^{|\text{GT}_k|} \text{IoU}(\text{Pred}_k, \text{GT}_k)
CAPk=max{0,1PredkGTkGTk}\text{CAP}_k = \max\left\{ 0, 1 - \frac{\left| |\text{Pred}_k| - |\text{GT}_k| \right|}{|\text{GT}_k|} \right\}

Score ranges 0–100; 100 means perfect count and localization, 0 means large count mismatch or format error.

Instruction-Following Accuracy (IFA): Proportion of outputs matching required format, ensuring reliability and interpretability.

Baseline: AV-Reasoner

Comparison with State-of-the-Art MLLMs

Comparison with Base Model on Counting Tasks

Examples

Citation

@misc{lu2025avreasoner,
    title={AV-Reasoner: Improving and Benchmarking Clue-Grounded Audio-Visual Counting for MLLMs}, 
    author={Lidong Lu and Guo Chen and Zhiqi Li and Yicheng Liu and Tong Lu},
    year={2025},
    eprint={2506.05328},
    archivePrefix={arXiv},
    primaryClass={cs.CV},
    url={https://arxiv.org/abs/2506.05328}, 
}