Who are you referring to? Coreference resolution in image narrations

ICCV 2023

Arushi Goel, Basura Fernando, Frank Keller and Hakan Bilen

Abstract
Coreference resolution aims to identify words and phrases which refer to same entity in a text, a core task in natural language processing. In this paper, we extend this task to resolving coreferences in long-form narrations of visual scenes. First we introduce a new dataset with annotated coreference chains and their bounding boxes, as most existing image-text datasets only contain short sentences without coreferring expressions or labeled chains. We propose a new technique that learns to identify coreference chains using weak supervision, only from imagetext pairs and a regularization using prior linguistic knowledge. Our model yields large performance gains over several strong baselines in resolving coreferences. We also show that coreference resolution helps improving grounding narratives in images.

Dataset
Our Coreferenced Image Narratives (CIN) dataset contains 1880 images from the Localized Narratives dataset [1] that come with long-form text descriptions (narrations) and mouse traces. These images are originally a subset of the test and validation set of the Flickr30k dataset [2] . We annotated this subset with coreference chains and bounding boxes in the image that are linked with the textual coreference chains, and use them only for validation and testing. Note that we also include singletons (i.e., coreference chains of length one). The figure below shows an example image from CIN.
Figure 1. Coreference resolution from an image and narration pair. Each highlighted block of text is referred to as a mention. The mentions in the same color corefer to the same entity, belong to the same coreference chain.
Contact the authors for dataset and more details.

References

[1] Jordi Pont-Tuset, Jasper Uijlings, Soravit Changpinyo, Radu Soricut, Vittorio Ferrari; Connecting Vision and Language with Localized Narratives ; ECCV 2020.

[2] Bryan A. Plummer, Liwei Wang, Chris M. Cervantes, Juan C. Caicedo, Julia Hockenmaier, Svetlana Lazebnik; Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models ; IJCV 2017.

Updated: