ActionAtlas

Abstract

Our world is full of varied actions and moves across specialized domains that we, as humans, strive to identify and understand. Within any single domain, actions can often appear quite similar, making it challenging for deep models to distinguish them accurately. To evaluate the effectiveness of multimodal foundation models in helping us recognize such actions, we present ActionAtlas v1.0, a multiple-choice video question answering benchmark featuring short videos across various sports.

Each video in the dataset is paired with a question and four or five choices. The question pinpoints specific individuals, asking which choice "best" describes their action within a certain temporal context. Overall, the dataset includes 934 videos showcasing 580 unique actions across 56 sports, with a total of 1896 actions within choices. Unlike most existing video question answering benchmarks that only cover simplistic actions, often identifiable from a single frame, ActionAtlas focuses on intricate movements and rigorously tests the model's capability to discern subtle differences between moves that look similar within each domain.

We evaluate open and proprietary foundation models on this benchmark, finding that the best model, GPT-4o, achieves a maximum accuracy of 45.52%. Meanwhile, Non-expert crowd workers, provided with action description for each choice, achieve 61.64% accuracy, where random chance is approximately 21%. Our findings with state-of-the-art models indicate that having a high frame sampling rate is important for accurately recognizing actions in ActionAtlas, a feature that some leading proprietary video models, such as Gemini, do not include in their default configuration.

Leaderboard

For reference, non-expert humans attain 61.64% accuracy, while random chance gets 20.91%.

Open-source models

Model	Input Frames (#)	Input Video Tokens (#)	Inference TFLOPs (#)	Accuracy (%)
Qwen2-VL-7B	16	8 x 576	13.38	30.24
CLIP ViT-L-14-336	16	16 x 576	-	23.71
VideoLLaMA	16	16 x 256	6.12	22.71
VideoChat2	64	64 x 196	7.51	21.27
mPLUG-Owl-Video	16	16 x 256	2.94	19.49
LLaVA-Next-Video-7B	64	64 x 144	83.2	22.90

Proprietary models

Model	Input Frames (#)	Accuracy (%)
GPT-4o	16*	42.95
Gemini 1.5 Pro	all frames**	35.59
GPT-4 Turbo	8	34.25
GPT-4o-mini	4	33.42
Gemini 1.5 Pro	1 fps	32.37
Gemini 1.5 Flash	1 fps	30.49

*GPT models were tested on 1, 4, 8, 16, and 32 input frames. The best configuration for each model is reported above. See paper for full results.

**By reformatting clips so that each (original) frame takes up one second apiece, we exploited Gemini's 1 fps sampling to feed it all available frames from a video.

BibTeX

@misc{salehi2024actionatlasvideoqabenchmarkdomainspecialized,
      title={ActionAtlas: A VideoQA Benchmark for Domain-specialized Action Recognition}, 
      author={Mohammadreza Salehi and Jae Sung Park and Tanush Yadav and Aditya Kusupati and Ranjay Krishna and Yejin Choi and Hannaneh Hajishirzi and Ali Farhadi},
      year={2024},
      eprint={2410.05774},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2410.05774}, 
}

ActionAtlas: A VideoQA Benchmark for Domain-specialized Action Recognition

ActionAtlas assesses a model's ability to recognize domain-specialized actions.

Abstract

Examples

Leaderboard

Open-source models

Proprietary models

BibTeX