Towards Unique and Informative Captioning of Images

Wang, Zeyu; Feng, Berthy; Narasimhan, Karthik; Russakovsky, Olga

Towards Unique and Informative Captioning of Images

Author(s): Wang, Zeyu; Feng, Berthy; Narasimhan, Karthik; Russakovsky, Olga

Download

To refer to this page use: http://arks.princeton.edu/ark:/88435/pr1283p

Full metadata record

DC Field	Value	Language
dc.contributor.author	Wang, Zeyu	-
dc.contributor.author	Feng, Berthy	-
dc.contributor.author	Narasimhan, Karthik	-
dc.contributor.author	Russakovsky, Olga	-
dc.date.accessioned	2021-10-08T19:47:18Z	-
dc.date.available	2021-10-08T19:47:18Z	-
dc.date.issued	2020	en_US
dc.identifier.citation	Wang, Zeyu, Berthy Feng, Karthik Narasimhan, and Olga Russakovsky. "Towards Unique and Informative Captioning of Images." European Conference on Computer Vision (2020): pp. 629-644. doi:10.1007/978-3-030-58571-6_37	en_US
dc.identifier.issn	0302-9743	-
dc.identifier.uri	https://arxiv.org/pdf/2009.03949.pdf	-
dc.identifier.uri	http://arks.princeton.edu/ark:/88435/pr1283p	-
dc.description.abstract	Despite considerable progress, state of the art image captioning models produce generic captions, leaving out important image details. Furthermore, these systems may even misrepresent the image in order to produce a simpler caption consisting of common concepts. In this paper, we first analyze both modern captioning systems and evaluation metrics through empirical experiments to quantify these phenomena. We find that modern captioning systems return higher likelihoods for incorrect distractor sentences compared to ground truth captions, and that evaluation metrics like SPICE can be ‘topped’ using simple captioning systems relying on object detectors. Inspired by these observations, we design a new metric (SPICE-U) by introducing a notion of uniqueness over the concepts generated in a caption. We show that SPICE-U is better correlated with human judgements compared to SPICE, and effectively captures notions of diversity and descriptiveness. Finally, we also demonstrate a general technique to improve any existing captioning model – by using mutual information as a re-ranking objective during decoding. Empirically, this results in more unique and informative captions, and improves three different state-of-the-art models on SPICE-U as well as average score over existing metrics (Code is available at https://github.com/princetonvisualai/SPICE-U).	en_US
dc.format.extent	629 - 644	en_US
dc.language.iso	en_US	en_US
dc.relation.ispartof	European Conference on Computer Vision	en_US
dc.rights	Author's manuscript	en_US
dc.title	Towards Unique and Informative Captioning of Images	en_US
dc.type	Conference Article	en_US
dc.identifier.doi	10.1007/978-3-030-58571-6_37	-
dc.identifier.eissn	1611-3349	-
pu.type.symplectic	http://www.symplectic.co.uk/publications/atom-terms/1.0/conference-proceeding	en_US

Files in This Item:

File	Description	Size	Format
TowardsUniqueInformativeCaptioningImages.pdf		2.53 MB	Adobe PDF	View/Download

Show Simple Item Record