HCRC Map Task Corpus XML File Structure

The XML Map Task corpus consists of a set of linked XML files. For information about hyperlinking see the papers about the Map Task corpus. There are:

three files which apply to the whole corpus:
- maptask-landmarks.xml Information about all the landmarks on the maps
- maptask-participants.xml Information about the participants
- maptask-corpus.xml Links to annotation files for each dialogue

two files which relate to each participant's reading of the list of landmarks on the maps they have used:
- *.wordlist.xml Transcription of the landmark list
- *.citations.xml Links from landmarks to the citation speech

the following files which relate to each dialogue:
- *.timed-units.xml Timed Units, one per speaker
- *.tokens.xml Tokens, one per speaker
- *.pos.xml Part of Speech Tags, one per speaker
- *.syn.xml Parse Trees, one per speaker
- *.moves.xml Dialogue Moves, one per speaker
- *.games.xml Dialogue Games, one per dialogue
- *.trans.xml Dialogue Transactions, one per dialogue
- *.gaze.xml Gaze, one per speaker
- *.drawing.xml Drawing, one per dialogue
- *.pr.xml Prosody, one per speaker
- *.landmark-refs.xml Landmark References, one per speaker

This diagram represents the XML files involved in all the annotation levels listed above, for one dialogue in the corpus.

Green boxes represent the files which contain information about the whole corpus

F refers to the information follower (see the description of the corpus for details) and G refers to the information giver

Red boxes represent files which contain timing information

Blue boxes represent files which contain pointers to other files

pointed at themselves

An arrow between boxes mean that there can be a link between an element in the file being pointed from to one or more elements in the file being pointed to

Dotted lines pointing to TIME signify that the file has some direct relation to time, the speech files for each speaker were recorded on separate channels, and XML elements have start and end time attributes which refer to a time offset in seconds in the speech.

The corpus file has links for each dialogue to all the XML files which pertain to it, these are the ones within the dotted box in the diagram

Last modified: Fri Sep 28 10:24:18 BST 2007