Abstract
In this work, we introduce VQA 360°, a novel task of visual question answering on 360° images.
Unlike a normal field-of-view image, a 360° image captures the entire visual content around the optical center of a camera, demanding more sophisticated spatial understanding and reasoning.
To address this problem, we collect the first VQA 360° dataset, containing around 17,000 real-world image-question-answer triplets for a variety of question types.
We then study two different VQA models on VQA 360°, including one conventional model that takes an equirectangular image (with intrinsic distortion) as input, and one dedicated model that first projects a 360° image onto cubemaps and subsequently aggregates the information from multiple spatial resolutions.
We demonstrate that the cubemap-based model with multi-level fusion and attention diffusion performs favorably against other variants and the equirectangular-based models.
Nevertheless, the gap between the humans' and machines' performance reveals the need for more advanced VQA 360° algorithms.
We, therefore, expect our dataset and studies to serve as the benchmark for future development in this challenging task.
VQA 360° Dataset
Following resources are provided: