Switchboard in NXT |
Data Structure |
||||||||||
|
The translation of
the multiple layers of annotation of Switchboard into Nite XML
format
allows us to describe the relationships between these layers of
annotation as part of the data structure itself. NXT
tools can then be
used to search over the corpus to extract text with varied discourse,
syntactic, prosodic and phonetic features. Here we briefly describe how
the corpus data is structured in NXT. See the NXT
documentation for
more details, and here for
a brief guide to using NXT tools to search the corpus and the XML
coding itself. Before beginning, it is important for users to be aware of the relationship between the two versions of the transcript
used in the corpus.
Two Versions of the TranscriptAt the time the project was started, there were two major transcripts of the Switchboard corpus: the original Switchboard/Penn Treebank release, which did not have word timing information; and the corrected MS-State transcript, which included time alignments at the word level. The Penn Treebank transcript already had substantial levels of annotation attached to it, while the MS-State version was obviously necessary to produce phonetic and prosodic annotation. A substantial part of the effort in this project was therefore spent in aligning the two transcripts.This effort has been reasonably successful. However, it is not possible to integrate the two transcripts completely while preserving the accuracy of all the annotations attached to them. For example, if the correction of a mis-transcribed word changes its part-of-speech, this evidently affects the syntactic structure above it. Further, in some cases, such as contractions, the same word is represented as one element in one transcript, and two in the other, e.g. doesn't below. Therefore, it was decided to maintain the two transcripts separately in the corpus, and represent the relationship between them in the data structure, as follows: ($w word) ($pw phonword) ($s syllable): ($w >"phon" $pw) && ($pw ^ $s) Overall StructureThis diagram shows some of the layers of annotation attached to a small portion of speech from a Switchboard conversation (see here for a full summary). We can use this to illustrate how data is represented in NXT:
|