AMI Corpus - List of Regularised Spellings
Basic spelling is British English.
These are the spellings to use for common non-words or tricky to spell words.
CONTRACTIONS: gonna, wanna, gotta, kinda, sorta, shoulda, woulda, coulda, dunno, lotta (= 'lot of') **Please also refer to the discussion page wiki "Contractions" section below.**
AGREEMENT: uh-huh, mm-hmm, yeah, yep, aye (Scottish)
DISAGREEMENT: uh-uh, mm-mm, nope, nah
BACKCHANNELS: ah, huh, hmm, mm
HESITATIONS: uh (any variant of a pure vowel), um (any variant of vowel plus nasal), mm (any variant of pure nasal)
TAG QUESTIONS: eh, huh
INITIAL ELLIPSIS: 'kay (= okay), 'scuse (= excuse), 'til (= until), 'cause (= because), 'em (= them),
ACRONYMS & PROPER NAMES: AMI, IDIAP, T_N_O_, T_V_, V_C_R_, L_C_D_, iPod, A_SAP (pronounced 'ay-sap')
INTERJECTIONS/EXCLAMATIONS: ah, ah-ha, argh, doh, gee, geez, oh, ooh, phew, whoa, whoo, whoohoo, yay
MISC: blah, mic (microphone), okay, Euro(s), Dollar(s), Pound(s), Project Manager, Industrial Designer, User Interface Designer, Marketing Expert, Play-Doh, Real Reaction, PowerPoint
SPECIAL CHARACTERS BANNED FROM USE: Under no circumstances should the symbols '<', '>', and '&' be used in transcriptions, as these are special characters used in the XML processing.
Discussion concerning Transcription
QUESTION (MT): If the ellipsis, e.g. 'kay, 's, is sentence initial, should the letter after the apostrophy be upper case or lower case? ANSWER (SA): uppercase.
QUESTION (BK): I for my part have been using contracted forms when things clearly sound contracted. I did however assume that they were grammatically correct. So just to clarify things: are contractions allowed with pronouns only? I thought they worked with nouns and proper names as well. ANSWER (SA): Up until now, we haven't had any hard and fast rules on the use of contracted forms. In general, if the spoken form involves the elision of one or more *otherwise* fairly prominent phonemes, e.g. (1) 'the doors'll close at midnight' => 'doors will' minus /wI/ plus /schwa/ -- use the contracted form => 'doors'll'. Alternatively, if the spoken form is more a case of the vowel getting reduced, e.g. (2) 'ya know' => 'you know' minus /u/ plus /schwa/ -- use the full form => 'you know'. I'm not sure how useful these examples will be to you. as always, when in doubt, please email melissa or myself.
QUESTION (KT): While we're on the subject! I was wondering about 'do you's which are often pronounced as 'd'you' as in 'd'you know what I mean?' - is this contracted form okay to use? ANSWER (SA): i'd prefer to see this one transcribed in full, i.e. 'do you'. Somewhat like the example (2) I gave above, it is more a case of the vowel /u/ in 'do' being lenited (or reduced) to /^h/.
3. Discontinuity marker
QUESTION (KT): Sorry to bring this up again! It just occured to me because I was transcribing some native English speakers who use a lot of expressions like 'yeah', 'you know', 'like' in the middle of sentences, for example "maybe we should go on a bit - yeah - about the project evaluation". Should this be marked as in the example with hypens or just with commas? ANSWER (SA): In the above example, the speaker's use of 'yeah' is parenthetic and the utterance is continued. Stick to using commas for this type of thing.
QUESTION (MT): If somebody says: "It's not" and bursts out laughing without ever finishing their utterance, should I describe it 'It's not $ -' or just 'It's not $' without the dash? ANSWER (MK): This should be transcribed with the dash at the end to indicate the utterance wasn't completed, because laughter doesn't really replace the rest of the utterance.
QUESTION (BK): If an utterance is a grammatically complete sentence, but the speaker's intonation suggests that there might have been more to follow, should we mark it as discontinuity? ANSWER (SA): Yes.
QUESTION (FK): re INITIAL ELLIPSIS section of regularised spellings on the Wiki: I have an example of <'S> for <It's> at the beginning of a sentence. Is that OK, or can it only mean <has>? ANSWER (SA): use <'S> it can mean either.
(SA): In the case of contracted forms, e.g. 'it's', and ellipsis, e.g. 'the technology's come a long way', please do NOT include a space before the apostrophe. (Sorry about any past confusion!) ***MAIS ATTENDEZ-VOUS! Okay, a small clarification: If the contracted form involves an auxiliary verb (e.g. have, has), you should *not* include a space before the apostrophe. E.g. 'they've', 'technology's'. Anything else (within reason) should be treated as a case of ellipsis, in which case you should include a space before the apostrophe. E.g. 'need 'em' (NOT 'need'em') and 'have'em' (NOT 'have'em'). Again, my apologies for not being more clear.
5. Elongated segments
QUESTION (KT): I noticed when I was checking a file that a speaker sort of dragged out the first consonants of certain words like "ssssso" - how should this be transcribed? I think I put 's- so'. ANSWER (SA): If it sounds like two syllables, as in 'ssss so', then yes, type 's- so'. However, I'm guessing it was pronounced more as an elongated initial /s/ sound, in which case just type 'so'. (It may get a qualitative tag during some subsequent transcription phase.)
(SA) Please adhere to a strict convention for naming transcription files, e.g. 'IS1008a.trs' (not 'SA_IS1008a.trs').
7. Filled pauses
(SA) An important part of transcription involves listening for things like filled pauses (e.g. uh, um), and including them in the transcription. Please listen carefully and make sure these aren't overlooked.
8. Foreign words
QUESTION (MT): If there's more than one foreign word, shall I enclose each word in carat signs, or just use a single pair of carats around the sequence of foreign words? ANSWER (SA): Please enclose sequences of foreign words within a single pair of carats, e.g. 'ˆsacre bleuˆ'.
QUESTION (MT): I was wondering, when speaking about letters, how exactly should we transcribe them? Capitalised? With an underscore? E.g: 'We have the emphasis on the R_s.' ANSWER (SA): yes, just as you have in your example.
10. Mispronounced or noncanonical word forms
QUESTION (MT): If a person clearly says 'tree', and equally clearly means 'three', shall I then transcribe it 'tree' or 'three*'? ANSWER (SA): transcribe as 'three*'.
(SA): Transcribe noncanonical pronunciations in the same manner, e.g. 'with all the tools availables*'.
* Important note concerning non-native English speakers (From 'The ICSI Meeting Corpus: Transcription Methods' by Jane Edwards -- Do not tag "words which are tinged by a non-native speaker's mother tongue ... That is, if the person is a native Spanish speaker and says what sounds like 'Espain' instead of 'Spain', the word is simply transcribed as 'Spain' without special marking."
QUESTION (FK): And what about non-words? For example, my current speaker is trying to say 'increase the price' and she says 'crise', mixing the words 'increase' and 'price'. How should that be transcribed? There are a few similar examples with non-native speakers making words up. ANSWER (SA): If you can make out enough of the phonemes, transcribe it as a word fragment, e.g. "the lef- ess- less efficient...", otherwise, use '??'. (Note to 2nd pass transcribers: If an instance of misspeech remains undecipherable, replace the set of double question marks with the '@' symbol.)
QUESTION (BK): How do we denote lexical words missing an onset? For instance one person says "tech-" and the other completes "-nologically advanced". Do we us apostrophe, as we do with functional words such as "'cause" and "'kay"? ANSWER (SA): Type <nologically*> (note: angle brackets are just for emphasis). Word forms should never begin with a hyphen. If it is a matter of a speech error and a missing onset, type as a word fragment, e.g. 'weet-' in 'Just gotta weet- sweeten the deal.' Since the example you give is not a speech error, but rather an attempt to complete another speaker's word, I'm more inclined to flag it with an asterisk -- call it personal preference. In any case, you definitely should not treat it as a contracted form since this notation (e.g. 'cause and 'kay) is reserved only for *very common* contracted lexical forms.
(SA) E.g. 'It looks really poodlish*' or 'The little remotey* thing ...' -- transcribe these as best you can, and flag them with an asterisk.
QUESTION (FS): A clarification, twenty-five or twenty five? The guidelines say no hyphen but I just wanted to check it. (SA) Thanks for checking. Please transcribe without a hyphen, e.g. 'twenty five'.
(SA) With the exception of the full stop, question mark, and comma, no other punctuation should be used, even if you think it helps to make a sentence more grammatically correct. This goes for colons, semicolons, quotation marks, and the use of the dash '-' to mark anything other than discontinuity or compound words. If you come across a phrase that you feel needs to be offset, e.g. a parenthetic remark, please use the lovely and understated comma.
15. Repeated words
QUESTION (MT): I was wondering whether I needed to use a dash everwhere where the speaker repeats a word, e.g.: "I meant something like - like a sensor." Or is it ok to transcribe it: "I meant something like like a sensor." ANSWER (SA): no need for a dash here. just type the repeated word, as in your 2nd example.
16. Segment boundaries
(SA) For transcribed segments, it is important to check the boundaries to make sure no important signal information has been cut off. This tends to occur most frequently with word-final /s/ and aspirated stops.
17. Silence buffers
QUESTION (MT): Oh, and also the segments needing to be padded by silence - so far I did not do it when I broke segments into smaller ones. If the silence buffers are needed, I'll, of course, correct the stuff I have done so far. ANSWER (SA): for now, don't go back and adjust any of the time bins for sections of the data you have already transcribed. (it will be dealt with during the quality control stage.) the main idea here is that when you're adjusting segments or creating new ones, try to leave a sufficient buffer on either end so that no important signal information is lost.
18. Silent segments
(SA) Transcribers should not be listening to segments marked by the presegmentation tool as silence '..'. These segments will be dealt with during the checking (i.e. quality control) pass.
* within utterance silences (SA, MK) You will find a number of utterances that are marked by one or more intrautterance pauses. In some cases, the automatic presegmentation tool will have detected these pauses and pre-labelled them as silent segments. Please note that it is not necessary to insert a separate silence segment for an intrautterance pause, even if it seems to be of considerable length. If such pauses have already been labelled by the presegmentation tool, however, please feel free to leave them as is.
* pre-meeting silence segments QUESTION (MT): I just wanted to check once more about the 'silence' segments. When doing checking, we don't need to listen to the segments and transcribe whatever is said, before the Project Manager officially opens the meeting, is that correct? ANSWER (SA): That's correct.
(SA) If you are unsure how a word should be spelled, please consult the online Oxford English Dictionary at http://www.oed.com. (Remember: Transcriptions should be done using British standard spelling.)
20. Unintelligible word(s)
QUESTION (BK): if more than one word is unintelligible, how do I transcribe it? is it "(??)" several times over, or several sets of "??" within one pair of parentheses? Another problem is that sometimes it's unclear how many words there are. ANSWER (SA): just one set of parentheses and set of double question marks is sufficient, i.e. "(??)".
21. Vocal noises
* general QUESTION (MT): And yet again the vocal noises - okay I don't mark any aspiration or sniffs, but should I make new segments for random tongue clicks, coughs and throat clearings that are not surrounded by speech? And the same about laughing ... ANSWER (SA): Yes to coughs, throat clearings and laughter *as long as they are roughly as audible as the person's speech*.
* infrequent vocal noises bearing linguistic content (SA): E.g. If a speaker imitates a typing noise by saying something like 'tik-a-tik-a-tik-tik-tik', do not not attempt to transcribe it as a word. Rather, mark it with a qualitative tag enclosed in square brackets, i.e. '[sounds imitating typing]'.
* laughter QUESTION (FK): Is there a way to transcribe when a word is said through laughter? ANSWER (SA): Yes, but for our current purposes, you should only transcribe laughter when it is heard as distinct from speech.
* and new segments QUESTION (MT): I just wanted to check about the silence segments once more. Do we just make a new segment if there is some speech or some coughing or laughing, or do we also make a segment for loud breathing? ANSWER (SA): Create new segments for speech, and only very audible instances of coughing or laughter. You should not be labelling inbreaths or outbreaths.
* sniffs (SA): Do not mark these.
* tongue clicks QUESTION (MT): What did we decide in the end about the tongue clicks before the start of a turn? Do they merit being transcribed? ANSWER (SA): If they're quite audible (i.e. of roughly the same decibel level as the surrounding speech), then transcribe these with the '#'. Otherwise, ignore them.
22. Whispered speech
QUESTION (FK): Do you want whispered speech to be marked in a particular way in transcription? ANSWER (SA): Nope - it will get some qualitative tag at a later stage.