Who are you referring to? Coreference resolution in image narrations
ICCV 2023
Arushi Goel, Basura Fernando, Frank Keller and Hakan Bilen
Abstract
Coreference resolution aims to identify words and phrases which refer to same entity in a text, a core task in natural language processing. In this paper, we extend this task to resolving coreferences in long-form narrations of visual scenes. First we introduce a new dataset with annotated coreference chains and their bounding boxes, as most existing image-text datasets only contain short sentences without coreferring expressions or labeled chains. We propose a new technique that learns to identify coreference chains using weak supervision, only from imagetext pairs and a regularization using prior linguistic knowledge. Our model yields large performance gains over several strong baselines in resolving coreferences. We also show that coreference resolution helps improving grounding narratives in images.
Dataset
Our Coreferenced Image Narratives (CIN) dataset contains 1880 images from the Localized Narratives dataset [1] that come with long-form text descriptions (narrations) and mouse traces. These images are originally a subset of the test and validation set of the Flickr30k dataset [2] . We annotated this subset with coreference chains and bounding boxes in the image that are linked with the textual coreference chains, and use them only for validation and testing. Note that we also include
singletons (i.e., coreference chains of length one). The figure below shows an example image from CIN.
Contact the authors for dataset and more details.
References
[1] Jordi Pont-Tuset, Jasper Uijlings, Soravit Changpinyo, Radu Soricut, Vittorio Ferrari; Connecting Vision and Language with Localized Narratives ; ECCV 2020.
[2] Bryan A. Plummer, Liwei Wang, Chris M. Cervantes, Juan C. Caicedo, Julia Hockenmaier, Svetlana Lazebnik; Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models ; IJCV 2017.