RET

About

This page presents a corpus of referring expressions where references to the same entity are linked over a 20 year period. The entities are for now limited to persons and organizations.

The data was extracted from the New York Times Annotated Corpus (NYT) which contains the text of all articles published in the New York Times (NYT) between 1987 to 2007. We have used the corpus for analyzing trajectories of entities from hearer-new status to hearer-old. This work and the corpus are descriped in our EMNLP 2018 paper: Getting to "Hearer-old": Charting Referring Expressions Across Time, Ieva Staliunaite, Hannah Rohde, Bonnie Webber and Annie Louis. See our paper for details about the corpus and our models.

Since the NYT corpus is licensed by the LDC, we provide our data in the form of byte offsets into the NYT files. You will need to obtain the corpus from LDC and then compute the referring expression spans based on our offsets. grab_re.py provides an example of how the RE spans and time of mention can be extracted from the NYT corpus using these offsets.

Corpus statistics

This version has 475,933 REs for organizations from 75,686 unique entities, and 742,721 person mentions from 324,765 unique entities. This download does not apply any cutoffs on the entities. In the paper, we used those which appear at least in two documents and filtered out those entities where any two consecutive mentions have a gap of more than 6 months. This download also contains REs from all years of the NYT.

Download

Persons
Organizations
grab_re.py

The Persons and Organizations files each have 3 columns. One row corresponds to one referring expression.

entity id: is a unique id (number) for each entity. All rows with the same entity id are mentions to the same entity. The entity ids should be interpreted separately for people and organizations. Even though only a number is provided to identify the entity, the referring expressions themselves should make clear who the people and organizations are. The expressions can also be matched against article metadata in NYT (which lists salient people and organizations) to get a canonical name for the entity (the name provided by NYT editors). All the entities we extracted correspond to those which were tagged as salient in the corpus metadata, so a canonical name for each of them will be obtainable. The metadata to look for in the NYT are <person> and <org> xml tags.
For example:
<org class="indexing_service">Cleveland Browns</org>
<person class="indexing_service">Wallace, William N</person>
nyt path: is the path to the xml file containing the RE. The paths originate from the year folder in the NYT.

span offset: gives the character offsets. There are two sets separated by 'H'. That before 'H' is the full span of the referring expression. The one after gives the span of the head of the RE as defined in our paper.

As an example, for the row:
0 1987/01/02/0000405.xml 2822:2835H2822:2835
There is a RE mentioning entity 0 in the file 1987/01/02/0000405.xml. The span of the RE is 2822:2835 (up to but not including character 2835) character positions in the xml. The head of the RE appears between 2822:2835 character positions.

If you use this corpus, please cite:

@InProceedings{D18-1466, author = "Stali{\={u}}nait{\.{e}}, Ieva and Rohde, Hannah and Webber, Bonnie and Louis, Annie", title = "Getting to ``Hearer-old'': Charting Referring Expressions Across Time", booktitle = "Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing", year = "2018", publisher = "Association for Computational Linguistics", pages = "4350--4359", location = "Brussels, Belgium", url = "http://aclweb.org/anthology/D18-1466" }

More Information

For more information on the corpus, please contact us:

Ieva Staliunate: i.r.staliunaite (at) students [dot] uu {dot} nl
Annie Louis: alouis (at) inf [dot] ed {dot} ac.uk