Switchboard in NXT


The Switchboard in NXT project arose as a result of three Edinburgh-Stanford Link grants, funded by Scottish Enterprise, which aimed to use corpora to study the relationship between discourse semantics and language variation, including syntactic and prosodic realisation. The aim of these projects was to improve speech and language technology applications. However, we have decided to publicly release the enhanced Switchboard corpus in XML, as we believe it is a valuable resource for researchers both in linguistics and language technology.

The Switchboard corpus (Godfrey, Holliman & McDaniel 1992) consists of spontaneous telephone conversations between previously unacquainted speakers of American English on a variety of topics chosen from a pre-determined list. A subset of one million words from this corpus was annotated for syntactic structure and disfluencies as part of the Penn Treebank project. This subset forms the basis of the NXT Switchboard corpus. As part of the original Edinburgh-Stanford Link grant, NPs in this subset were annotated for animacy, and a portion for information status. In the later grants, the Treebank transcript was aligned with the corrected MS-State transcript of the corpus, in order to provide word timing information. We then used the enhanced corpus to include our own original focus/contrast and prosodic annotations, as well as phone/syllable alignment. We have also converted and included previous annotations of dialog acts and prosody on the corpus.

The beauty of converting and enhancing the corpus within the NXT framework is that all annotations are integrated and in the same format. NXT technology is open source software designed to aid the formation and analysis of multi-modal corpora involving multiple annotations that may be crossing. It has been successfully used for a growing number of other corpus development projects, fostering a centre of expertise in this type of work at Edinburgh. NXT provides tools for users to query data structures in the corpus, output the results in different formats, as well as tools to create new annotations automatically in XML.

Most of the layers of annotation of the corpus produced prior to our project have either been fully publicly released, or through the LDC. This release adds value through extra layers and by bringing them together into one framework.

It is our hope that the corpus will be useful to a wide variety of researchers. Further, the scope is open within the NXT framework both for improvement/addition to existing annotations, or for the inclusion of new layers of annotation. Thus we hope that the corpus will continue to expand and provide an even richer resource for the study of the linguistic features of spontaneous speech.