Skip to main content

Multimodal Graph Networks for Compositional Generalization in Visual Question Answering

Author(s): Saqur, Raeid; Narasimhan, Karthik

Download
To refer to this page use: http://arks.princeton.edu/ark:/88435/pr1hv78
Full metadata record
DC FieldValueLanguage
dc.contributor.authorSaqur, Raeid-
dc.contributor.authorNarasimhan, Karthik-
dc.date.accessioned2021-10-08T19:51:18Z-
dc.date.available2021-10-08T19:51:18Z-
dc.date.issued2020en_US
dc.identifier.citationSaqur, Raeid, and Karthik Narasimhan. "Multimodal Graph Networks for Compositional Generalization in Visual Question Answering." Advances in Neural Information Processing Systems 33 (2020): pp. 3070–3081.en_US
dc.identifier.issn1049-5258-
dc.identifier.urihttps://proceedings.neurips.cc/paper/2020/hash/1fd6c4e41e2c6a6b092eb13ee72bce95-Abstract.html-
dc.identifier.urihttp://arks.princeton.edu/ark:/88435/pr1hv78-
dc.description.abstractCompositional generalization is a key challenge in grounding natural language to visual perception. While deep learning models have achieved great success in multimodal tasks like visual question answering, recent studies have shown that they fail to generalize to new inputs that are simply an unseen combination of those seen in the training distribution. In this paper, we propose to tackle this challenge by employing neural factor graphs to induce a tighter coupling between concepts in different modalities (e.g. images and text). Graph representations are inherently compositional in nature and allow us to capture entities, attributes and relations in a scalable manner. Our model first creates a multimodal graph, processes it with a graph neural network to induce a factor correspondence matrix, and then outputs a symbolic program to predict answers to questions. Empirically, our model achieves close to perfect scores on a caption truth prediction problem and state-of-the-art results on the recently introduced CLOSURE dataset, improving on the mean overall accuracy across seven compositional templates by 4.77\% over previous approaches.en_US
dc.format.extent3070 - 3081en_US
dc.language.isoen_USen_US
dc.relation.ispartofAdvances in Neural Information Processing Systemsen_US
dc.rightsFinal published version. Article is made available in OAR by the publisher's permission or policy.en_US
dc.titleMultimodal Graph Networks for Compositional Generalization in Visual Question Answeringen_US
dc.typeConference Articleen_US
pu.type.symplectichttp://www.symplectic.co.uk/publications/atom-terms/1.0/conference-proceedingen_US

Files in This Item:
File Description SizeFormat 
MultiVQA.pdf2.42 MBAdobe PDFView/Download


Items in OAR@Princeton are protected by copyright, with all rights reserved, unless otherwise indicated.