currently has the following layers of annotation, integrated
within the XML structure
Annotation layers are grouped according to the version
of the Switchboard transcript they used. See here
summary of the XML elements, their attributes and relationships to the
rest of the annotations in the XML. And see here
for links to detailed explanations of the annotation guidelines used
for each layer and discussions of technical issues to be aware of in
the integration of the data into
Note that not all annotations are available for all conversations in the NXT release, see here
for a summary of annotation coverage by conversation.
- Based on the Penn Treebank transcript:
- terminals: the
original orthographic transcription of the corpus, as included in the
Switchboard Penn Treebank release. Includes words,
punctuation and silence, as well
as traces marking the origin of 'moved'
syntactic elements. Part-of-speech information is included. This
version did not
originally include timing information, so word timings have been
derived by automatic alignment with the MS-State version of
- syntax: the
syntactic structure of the terminals, annotated
as part of the Penn Treebank project. The hierarchical syntax structure
is represented by parent-child
relationships in the XML. The syntactic phrase category (e.g. VP, NP),
optional sub-category (e.g. SBJ, MNR), timing
information and a word count of the phrase is included.
- movement: marks the
link between traces
and antecedents as co-indexed in the Treebank
annotation. For example, in "What
book_i did you buy t_i?", what book is the
antecedent of the trace (t).
- turns: encodes
the speaker turns within each conversation, i.e. the approximate linear
order in which the sentences were said by each speaker (note that
overlapping speech can appear before or after the other speaker's
- disfluency: coding
of disfluent speech from the Treebank release. Disfluencies consist of
i.e. the words where the speaker hesitated or made a false start; and a
repair, where the speaker corrected the error, e.g. "the-_reparandum the government_repair"
- active: sentences
which have been automatically identified as being in the active voice
encoding of selected NPs at Edinburgh and
Stanford for information
mediated or new) and animacy (e.g.
human, animal, non-concrete). Note only a portion of the
corpus is annotated for information status.
marks the relationship between each anaphor
(i.e. NP marked as old) and its antecedent,
i.e. the previous mention of the referent of that NP in the discourse.
Done as part of the information status annotation.
encoding of selected content words (e.g.
nouns, verbs, adjectives) at Edinburgh as to whether they are kontrastive,
i.e. made salient to distinguish them from alternatives to that word
which could have been used in the context. Coding was done according to
certain categories of kontrast,
subset or answer. Note only the portion of the corpus
annotated for information status was annotated for kontrast.
encodes the relationship between certain kontrasts and the
word(s) that motivated their marking. For example, if A says "I
live in Garland", and B
replies "Well, I
prefer San Antonio", then Garland motivates
the marking of San
Antonio as contrastive
(a type of kontrast).
- dialAct: the
dialogue (or speech) act of units of the discourse, e.g. statement, question by
Shriberg et al 1998.
Note the units used were based on conversation purpose, not
syntax. They are roughly equivalent to syntactic sentences, but often
do not align with them.
- Based on the MS-State transcript:
- phonwords: representation
of the corrected, time-aligned MS-State transcript of the corpus.
Includes words, laughter and noise. Timing information, and the stress
profile of the syllables in the word is included, e.g. "agree" has the
i.e. no stress-primary
- syllables: automatically
derived syllable information, done at Stanford. Includes stress
information (primary, secondary or no stress).
- phones: automatically
derived phone information based on MS-State transcript, done at
Includes phone identity, and timing information. Users should
aware of technical issues in the automatic phone boundary detection
when using the phone times.
- accent: pitch
accents, associated with words in the MS-State transcript. The time of
the peak of the accent as annotated, and the strength of the accent (weak,full)
are given. Annotations fall into three sets according to their source: some of the annotation was done at the University of
(UW prosody), some we have converted from the UW set to our standards
(Ed converted) and some are our own annotations (Ed original). For the
Ed annotations, word association was marked manually; whereas for the
UW annotations it was derived automatically from word timing. Accent
type is also given for accents in the
Edinburgh/Stanford set (nuclear, pre-nuclear, plain).
Note prosody annotation has only been completed on a portion
of the corpus.
- phrase: grouping of
words in the MS-State transcript into prosodic phrases. Timing
information is given, as well as the phrase type, determined by the
ToBI break index following the last word in the phrase (minor,
major, disfluent, backchannel).
Note in the Edinburgh/Stanford annotations phrases were directly marked
by annotators, whereas for the University of Washington annotations
they were determined automatically from the break information and
- breaks: ToBI
break index, for the University of Washington
annotations only. Breaks
have been aligned automatically with the nearest word boundary in the
MS-State transcript. The ToBI break index, and phrase and boundary tone
(where applicable), are given, along with the time of the break in the
original University of Washington annotation. For the Ed original set,
these were generated automatically from the phrase files and do not
include boundary tones.
- prosnotes: notes
made by annotators during the prosody annotations, for the
Edinburgh set only. Includes comments on errors in the
transcription, f0 tracking problems, etc.