NITE XML Toolkit - Tutorial 1

This one-hour tutorial is intended to give the basic skills for searching an existing corpus, using existing annotation tools and command line utilities. The tutorial uses the AMI Meeting Corpus. For a tutorial covering topics you need as the creator / manager of an NXT tutorial see Tutorial 2.

In the tutorial, we assume access to space on the DICE service at Edinburgh Informatics, and use paths and existing files there. Hints about how to adapt this tutorial for use on other systems are given in italics. When we gave the tutorial locally in an Informatics lab, we had some problems with accessing libraries and corpus files that had recently been moved to AFS space. Please ask if you are a local user and you get errors.

What is NXT?

NXT is a set of libraries and tools that provide for the native representation, manipulation, query and analysis of multimedia language data with complex annotations.

Starting up a graphical user interface

The tutorial materials are available on DICE. Log in in go to the right directory.

cd /group/project/ami9/NXTtutorial

Run the script by cutting and pasting the code below into a terminal window.

sh ami.sh

Choose the Named Entity Coder using the menu, and then choose IS1004d.

You will be asked for an annotator name - you should have been told which one to choose. You can of course choose any annotator name you like, or enter a new one, but it's easier to see some of the features of NXT if you choose one we have pre-populated with some annotations.

NXT runs on Windows, Linux, or Mac --- look for the .bat, .sh, or .command start-up script distributed with the corpus you are using, or use the generic start-up script that comes with NXT, NXT-start-GUI.bat/.sh/.command, which will then ask you where to find the metadata file for the corpus you wish to use.

On the clock's pull-down menu, choose the mix-lapel signal. Press the play button. You should hear sound. The blue highlighting on the transcription shows the current word. Now pause the recording.

On the clock's pull-down menu, choose one of the videos. Play with the interface to familiarize yourself with the controls. If you have a high-spec machine, you can play multiple videos, but today, take that on trust. With the recording paused, left-click on the code at the beginning of a transcription line that identifies the speaker (A, B, C or D) and then use ctrl-right-click at the same place. The interface plays just that turn.

On DICE some users were having the problem that trying this then gets the player running without being stoppable. We think this only happens if you ask it to play a turn when the signals are already playing. That means it may be a bug, either in NXT or in the version of JMF on DICE, but not necessarily one to do with memory.

Add a named entity annotation

This user interface is for annotating "named entities" on the corpus. If you don't know what a named entity is, that doesn't matter - all it does is let you select a word or phrase and attach one of a number of pre-defined labels to it. (If you do know what a named entity is, be aware that on this corpus may not use the same labels you already know for them.)

The coloured fonts show named entities that have already been coded - the transcription is enclosed by parentheses and preceded by an abbreviation for the label. Each colour corresponds to a different label. Try adding a new named entity. Use your left mouse button to select a word or phrase, and then choose a label in the list of named entities. Notice that you don't have to get the word boundaries right when you select. Now select another word or phrase, and type "m". This is the keyboard short-cut for "money". You can delete a named entity by clicking on the label or parentheses and using the delete key on your keyboard.

Explore the existing named entity annotation

Using the view menu, call up the search interface. Copy the following query into the search window.

($n named-entity):

Paste in NXT is ctrl-v, even on a Mac.

The query is in logic-based language that is specifically designed to work well with the kind of data for which people use NXT. Variables start with a dollar sign. Before the colon, there are variable bindings constrained by NXT data type. After the colon, there is a boolean combination of constraints on the matches. Queries are evaluated by finding all tuples that match the given bindings and then applying the constraints to restrict the set of tuples returned.

In this query, there is one variable binding, and no added constraints. Select the "matchlist" line of the result set, and look at the spreadsheet display below the list. You may need to expand the search window to see it. On the transcription window, you can see any transcription that relate to the results with orange highlighting. The spreadsheet view gives some basic information about the results. Select an individual result and see what happens to the highlighting and spreadsheet.

Now go back to the query tab, and use the following query:

($n named-entity)($t ne-type):($n>$t)

This query returns pairs of named entities and named entity types (used to represent their labels). Next, try

($n named-entity)($t ne-type):

You get a lot of results. If you can't think why, ask a tutor when you get a chance - there's an important insight we need to make sure you get. Meanwhile, go on to:

($n named-entity)($t ne-type):($n>$t)&($t@name="MATERIALS")

This constrains the returned pairs to be just the named entities that match the "MATERIALS" type.

Play with a few more queries, using other attributes for the type besides the name. You can see the attribute names on the spreadsheet. Timings are in seconds, but for start() and end(), you need to use e.g.

($n named-entity)($t ne-type):($n>$t)&(START($n)>"500")

Explore other existing annotations and learn a little about the query language

It's not very useful to explore the named entities using this interface, because they're already pretty visible. Now we need to try some other annotations. Try the following query. It returns pairs consisting of dialogue acts and their types.

($d dact)($t da-type):($d>$t)

Look at the search highlighting. If NXT can find something on the screen associated with something from the result tuple, even if it's just words contained in something bigger, it will mark them.

Try the following queries. They give a little introduction to the most important parts of the query language.

($d dact)($t da-type):($d>$t) & ($t@gloss~/Elicit.*/)

matches pairs of dialogue act and types where the type is some kind of elicitation (of information or opinion or action...). The tilde matches on a regular expression.

($w w):TEXT($w)~/the.*/

matches all words that start with (the textual content) "the". Notice this time, we left out the parenthesis around the (one) boolean constraint.

($d dact)($w w):($d ^ $w) & (TEXT($w)="banana")

matches dialogue act/word pairs where the dialogue act contains the word and the word is (exactly) "banana". "Contains" is structural - in NXT, the dialogue act is a "parent" of the word.

($s dact)($t dact):($s # $t) && ($s != $t) && START($s)<START($t)

matches pairs of dialogue acts with overlapping speech, listing the first starting act first in the pair. && is a synonym for &. # is for temporal overlap.

($h head):

matches head gestures (most of which say "no gesture happening"). Notice there is no orange highlighting. That's because there's nothing in this particular display we can highlight.

($h head):($h@type="concord_signal")

matches head movements that mean "I agree". You may have been expecting something more like

($h head)($t head-type):($h>$t)&($t@name="concord_signal")

but the suppliers of this annotation chose to encode the type directly as an attribute on the head entities.

($d dact)($h head):($d # $h) && ($h@type="concord_signal") && ($d@who==$h@who)

matches pairs of dialogue acts and concord signals that overlap in time where the person speaking is the same as the person agreeing. = and == are synonyms.

Use the corpus help and query help to write queries

There is a help menu on most of the NXT graphical user interfaces, but the help accessible from there is about how to use that particular interface. The really useful help is on the menu in the search window. There are two types. The first is corpus help. This is generated from the corpus metadata and tells you what types, attributes, and relationships to expect in the corpus. Familiarize yourself with the interface. Some corpora also come with documentation, either packaged with the corpus distribution or available from the corpus website.

The second is query language help. This explains in detail what the query operators do. You can get the same basic material using the on-line documentation if you prefer:

There's now some free time to try writing whatever queries you would like. If you get bored, you can try the "::" operator, as in:

($d dact):($d=$d)::($t da-type):($d>$t) & ($t@gloss~/elicit.*/)

or try some of the other annotation tools by exiting NXT and starting again. The Generic Display isn't intended for corpora with this many annotations or this many signals, so it won't necessarily work on the machine you're using.

The command line tools

So far, you've just been exploring the data using one meeting, or "observation". NXT also has command line tools, which can be used on individual observations or on an entire corpus. There are quite a few different tools - in this tutorial, we only want to convey the basic idea of how they work.

Quit the graphical user interface, and go back to your terminal window. Make sure you're still in the tutorial directory.

cd /group/project/ami9/NXTtutorial

Before we start, we need to set up the Java classpath to find the NXT command line utilities:

export NXT="/group/project/ami9/NXTtutorial"
export CLASSPATH=".:$NXT/lib:$NXT/lib/nxt.jar:$NXT/lib/jdom.jar:$NXT/lib/xalan.jar:$NXT/lib/xercesImpl.jar:$NXT/lib/xml-apis.jar"

Paste in the following command:

java CountQueryResults -corpus Data/AMI/NXT-format/AMI-metadata.xml -observation IS1004d -q '($n named-entity):'

Now try the same thing, omitting the "observation" flag to run over the entire corpus - use a redirect to get the diagnostic messages out of the way:

java CountQueryResults -c Data/AMI/NXT-format/AMI-metadata.xml -q '($n named-entity):' 2> /dev/null

CountQueryResults just counts the number of n-tuple matches returned by the query. Remember that this is not the same as counting the number of times the variable you're interested in occurs in those n-tuples, since each one could match more than once. It's a good strategy to develop queries using a browser so that you can explore the results and make sure you're getting what you want before you run them over the entire corpus.

FunctionQuery is NXT's most important command line tool - it extracts tab-delimited output from a corpus. In this tutorial, we only have time to get the sense that it exists - in the the second tutorial, there will be time to discuss how to use it in both data analysis and the connection to other tools.

java FunctionQuery -o IS1004d -c Data/AMI/NXT-format/AMI-metadata.xml -q '($h head):' -atts '$h@type' '$h@starttime' '$h@endtime' 2> /dev/null

In queries, you can use START($h) to get the start time of an entity, but unfortunately, FunctionQuery doesn't implement that shorthand. The designers of the corpus chose to use an attribute called "starttime" to store entity start times. $h@starttime works in the query language for this corpus as well as START($h).

Have a play - try some other extractions. If you get bored, you can read about the other features and try

java FunctionQuery -o IS1004d -c Data/AMI/NXT-format/AMI-metadata.xml -q '($d dact)($t da-type):($d>$t)&($t@name="inf")' -atts '@count(($w w):($d^ $w)&!($w@punc="true"))' '$d' 2> /dev/null

Or think about this example, which was used (along with mencoder) to extract videos of people nodding during backchannels for a NAACL keynote.

java FunctionQuery -o ES2008a -c Data/AMI/NXT-format/AMI-metadata.xml -q '($d dact)($t da-type)($h head):($d>$t) & ($t@name="bck") & ($d # $h) & ($d@who=$h@who) & ($h@type="concord_signal")' -atts obs '@extract(($sp speaker)($m meeting):$m@observation=$d@obs && $m^$sp & $d@who==$sp@nxt_agent, global_name, 0)' '@extract(($sp speaker)($m meeting):$m@observation=$d@obs && $m^$sp & $d@who==$sp@nxt_agent, camera, 0)' starttime endtime 2> /dev/null

Understanding how the data is stored

Although you don't need to look at the data storage just to use an existing corpus, it's useful to have some idea what's going on behind the scenes.

cd /group/project/ami9/NXTtutorial/Data/AMI/NXT-format
ls

You can see one directory for each type of annotation in the corpus.

ls words

The words are given in separate files for each speaker within each meeting. Any files at this directory level are "gold standard", but there can be files in subdirectories produced by specific annotators. There is a system of "resources" that tracks dependencies between the gold standard and annotators across the entire set of annotations, but there's no time to look at it today.

more words/IS1004d.A.words.xml

The files are just XML. The ids are unique within the corpus.

more namedEntities/IS1004d.A.ne.xml

The named entities use "stand-off" annotation to draw words as children. They also point to named entity types (the labels) in a separate ontology.

more AMI-metadata.xml

There is one file that tells NXT what files belong in the corpus, how all the files relate to each other, and what XML structure to expect in each file, including which attributes should be interpreted as things like start and end times. NXT calls this information "metadata".

Addendum for external users: if the signals don't work

The AMI Meeting Corpus uses the DIVX codec to encode video, so you will need this on the machine you use.

NXT relies on a few things about your machine setup. They are reasonably standard items, so your machine may already have what you need.

Windows: You need a Java run time, and the Java Media Framework (JMF), centrally installed on the machine. NXT uses FOBS4JMF underneath, but that's packaged with NXT.
Mac: On a Mac, NXT uses FMJ and Quicktime underneath - if you can play the signals in Quicktime, it should just work.
Linux: You need a Java run time. NXT uses FOBS4JMF, which is packaged with NXT. It also uses JMF, but draws the methods it needs from a jar file packaged with NXT.

NITE XML Toolkit Tutorial 1