Video and Language

[VideoQA] [Video Title Generation] [VTW Dataset]

Leveraging Video Descriptions to Learn Video Question Answering

AAAI 2017


  We propose a scalable approach to learn video-based question answering (QA): to answer a free-form natural language question about the contents of a video.

  Our approach automatically harvests a large number of videos and descriptions freely available online. Then, a large number of candidate QA pairs are automatically generated from descriptions rather than manually annotated. Next, we use these candidate QA pairs to train a number of video-based QA methods extended from MN (Memory Network), VQA (Visual-QA), SA (Soft-attention), and SS (Sequence to sequence). In order to handle non-perfect candidate QA pairs, we propose a self-paced learning procedure to iteratively identify them and mitigate their effects in training.

  Finally, we evaluate performance on manually generated video-based QA pairs. The results show that our self-paced learning procedure is effective, and the extended SS model outperforms various baselines.




Title Generation for User Generated Videos

ECCV 2016


  A great video title describes the most salient event compactly and captures the viewer's attention. In contrast, video captioning tends to generate sentences that describe the video as a whole. Although generating a video title automatically is a very useful task, it is much less addressed than video captioning. We address video title generation for the first time by proposing two methods that extend state-of-the-art video captioners to this new task.

  First, we make video captioners highlight sensitive by priming them with a highlight detector. Our framework allows for jointly training a model for title generation and video highlight localization. Second, we induce high sentence diversity in video captioners, so that the generated titles are also diverse and catchy. This means that a large number of sentences might be required to learn the sentence structure of titles. Hence, we propose a novel sentence augmentation method to train a captioner with additional sentence-only examples that come without corresponding videos.

  We collected a large-scale Video Titles in the Wild (VTW) dataset of 18100 automatically crawled user-generated videos and titles. On VTW, our methods consistently improve title prediction accuracy, and achieve the best performance in both automatic and human evaluation. Finally, our sentence augmentation method also outperforms the baselines on the M-VAD dataset.




Introduction Video:

VTW Dataset

    A large-scale user-generated video benchmark for language-level understanding. Our benchmark covers a wide range of latest language-level understanding tasks, including Video Title Generation and Video Question Answering. [See examples


  • Kuo-Hao Zeng, Tseng-Hung Chen, Ching-Yao Chuang, Yuan-Hong Liao, Juan Carlos Niebles, Min Sun, "Leveraging Video Descriptions to Learn Video Question Answering." AAAI 2017 [arXiv] [technical report].
  • Kuo-Hao Zeng, Tseng-Hung Chen, Juan Carlos Niebles, Min Sun, "Title Generation for User Generated Videos." ECCV 2016 [arXiv] [technical report].

Contact : Kuo-Hao Zeng

Last update : January, 2017