Diverse Visuo-Lingustic Question Answering (DVLQA) Challenge. (arXiv:2005.00330v1 [cs.CV])


[Submitted on 1 May 2020]

Abstract: Existing question answering datasets mostly contain homogeneous contexts,
based on either textual or visual information alone. On the other hand,
digitalization has evolved the nature of reading which often includes
integrating information across multiple heterogeneous sources. To bridge the
gap between two, we compile a Diverse Visuo-Lingustic Question Answering
(DVLQA) challenge corpus, where the task is to derive joint inference about the
given image-text modality in a question answering setting. Each dataset item
consists of an image and a reading passage, where questions are designed to
combine both visual and textual information, i.e. ignoring either of them would
make the question unanswerable. We first explore the combination of best
existing deep learning architectures for visual question answering and machine
comprehension to solve DVLQA subsets and show that they are unable to reason
well on the joint task. We then develop a modular method which demonstrates
slightly better baseline performance and offers more transparency for
interpretation of intermediate outputs. However, this is still far behind the
human performance, therefore we believe DVLQA will be a challenging benchmark
for question answering involving reasoning over visuo-linguistic context. The
dataset, code and public leaderboard will be made available at
this https URL.

Submission history

From: Shailaja Keyur Sampat [view email]
Fri, 1 May 2020 12:18:55 UTC (5,435 KB)

Source: http://arxiv.org/abs/2005.00330

