AMI Corpus - Transcription

Presents an overview of the transcription process and conventions used, along with a sample transcription and link to the written guidelines

An NXT-formatted file containing a word-level transcription is available for each meeting in the AMI Corpus. Transcribers worked from a set of written guidelines and were encouraged to resolve any uncertainties by consulting a discussion Wiki. To ensure a high quality of transcript, data were handled in two, and sometimes three, passes.

First pass transcribers used ChannelTrans, a multispeaker adaptation of the Transcriber tool developed at Berkeley. They were asked to strive for a balance between speed and accuracy. To facilitate the adjustment of time boundaries, transcribers were provided with presegmented ‘empty’ transcription files. Presegmented files were generated automatically by applying a simple energy-based technique to segment silence from speech for each meeting participant (see Audio and Video Signals). First pass transcribers listened to and transcribed only those areas identified by the presegmenter as ‘speech’, also adjusting segment boundaries where needed. Second pass transcribers then reviewed all segments, ensuring that any missed speech was transcribed, and resolving all uncertainties. As a final step, a validation script was run flagging any errors, such as unknown spellings or file format irregularities. Once a transcription was successfully validated, it was up-translated to NXT format.

Speech has been transcribed verbatim using British spellings, and without correcting grammatical errors, e.g. ‘me and him have done this’. An exhaustive list of reduced lexical forms, such as ‘gonna’ and ‘kinda’, is featured. Normal capitalization is used on proper nouns and at the beginning of sentences, along with simplified standard English punctuation, i.e. commas, hyphens, full stops, and question marks.

An additional set of punctuation marks flag certain speech and non-speech events. While missed target pronunciations influenced by a speaker’s native language are unflagged, all other mispronunciations and neologisms are flagged with an asterisk, e.g. ‘velocity*’ (pronounced as ‘velocily’) and ‘bumblebeeish*’. Discontinuity and disfluency at the word or the utterance level are indicated with a hyphen, e.g. ‘I think basi-’ and ‘I just meant - I mean ...’. Particular care has also been taken to indicate whether a turn continues (no punctuation or comma) or ends (full stop, question mark, or hyphen). Simple symbols denote laughing ‘$’, throat noises ‘%’, and other nonverbal vocalizations ‘#’. Other qualitative features of the signal, such as emphasized speech or ‘while laughing’, were ignored. A special category of noises, including onomatopoetic and other highly meaningful sounds, have been indicated with a meta-noise tag enclosed in square brackets, e.g. ‘[sound imitating beep]’. Instances where a string was undecipherable to the second pass transcriber are marked with ‘@’.

Transcripts are time-synchronized with the digitized audio recordings and feature microphone channel IDs for distinguishing speakers. Automatically generated word and phoneme level timings of the transcripts were achieved through forced alignment.

A sample transcription presented in a human-readable format is shown below.

(ID) That’s our number one prototype.
(PM) /@ like a little lightning in it.
(ID) Um do you wanna present the potato,
(ID) or shall I present the Martian?
(UI) /Okay, um -
(PM) /The little lightning bolt in it, very cute.
(UI) /What -
(UI) We call that one the rhombus, uh the rhombus.
(ME) /I could -
(PM) /The v- the rhombus rhombus?
(ID) /That’s
(ID) the rhombus, yep.
(UI) Um this one is known as the potato, uh it’s
(UI) it’s a $ how can I present it? It’s an ergonomic shape,
(ID) /$
(ME) /$
(UI) so it it fits in your hand nicely. Um,
(UI) it’s designed to be used either in your left hand or or
(UI) in your right hand.

Transcription Problems

All public releases of NXT format annotations up to and including 1.4 contain these errors. Errors have been fixed and will be included in releases after 1.4.

There are some known errors in the NXT reference transcription that are due to both transcriber error as well as one transform issues. The transform issue affects utterances that have multiple vocal sounds within one utterance that are described by phrases of more than one word in duration. This happens 11 times in the NXT corpus:

  • EN2002b.D.words1567: performed]
  • EN2002b.D.words1696: performed]
  • EN2002c.C.words1696: process]
  • EN2004a.A.words314: d]
  • EN2004a.B.words1865: d]
  • EN2004a.D.words568: schwa]
  • EN2006b.C.words1641: filler]
  • ES2014d.A.words769: sound]
  • IS1006d.D.words42: flatulence]
  • TS3005d.D.words1362: air]
  • TS3012c.A.words3084: noise]

These cases were due to an error in the transform that has now been fixed. Thse were awkward to retrospectively repair since many diverse automatic and manual annotated on top of the word layer, some of which cannot easily be rerun. This error resulted in extra words appearing in the word stream. The solution we opted for was to replace these bogus words with a transformerror element which has an errortype attribute describing the error and a w attribute containing the original text content of the word element.

Transcriber errors that we have noted include failing to edit out the string ".." that appears in the ChannelTrans program to signal an empty utterance, and failing to leave spaces between special symbols and the rest of the transcript. Noted issues:

  • EN2006b.B.words4183: ..$
  • EN2009d.C.words1135: ..#
  • EN2009d.C.words3308: ..#
  • EN2009d.D.words2147: @
  • ES2008d.C.words935: happen.$
  • ES2014b.B.words477: bear.$
  • IB4001.D.words619: :
  • IN1013.D.words2464: But@
  • IS1000a.A.words331: Mm-hmm.$
  • IS1004b.D.words14: A*N
  • IS1004b.D.words272: R*_SI
  • IS1006d.A.words831: #.
  • TS3006d.B.words1081: ..$
  • TS3006d.B.words1412: @
  • TS3012d.D.words804: ..#

These errors were fixed by first editing the ChannelTrans trs file, then either changing a word element to a nonword element with the same ID, or adding a nonword element to the stream. All other IDs were preserved in the word stream.