NITE XML Toolkit - Tutorial 2

This one-hour tutorial is intended to provide a guide to some more advanced features of the NITE XML Toolkit. See Tutorial 1 for an introduction to NXT. Some of this tutorial uses the AMI Meeting Corpus. NXT documentation is available as directly styled XML or formatted for browsers that can't style XML. Query language documentation is always available from any NXT GUI (from the search window) and also online in XML or HTML.

Go through the parts of the tutorial that interest you in any order.

In the tutorial, we assume access to space on the DICE service at Edinburgh Informatics, and use paths and existing files there. Hints about how to adapt this tutorial for use on other systems are given in italics.

changing an ontology to change the set of available data labels
changing the corpus metadata to make new annotations available
changing the configuration of an annotation tool to behave differently
working with competing annotations
import and export to share data with other tools
other NXT corpora and tools; java programming

Preliminaries for Tutorial 2

Because you'll be editing some corpus-level files in this tutorial, first copy some files into a private space. Find a suitable space with at least 5M of free space, make a new directory and go into that directory. All commands should be executed in that new directory unless stated. First type:

unzip /group/project/ami9/NXTtutorial/Data/corpuschunk.zip

Now you need to set your CLASSPATH and PATH by copying and pasting these commands:

export NXT="/group/project/ami9/NXTtutorial"

export CLASSPATH=".:$NXT/lib:$NXT/lib/nxt.jar:$NXT/lib/jdom.jar:$NXT/lib/xalan.jar:$NXT/lib/xercesImpl.jar:$NXT/lib/xml-apis.jar"


export PATH="$PATH:/group/ltg/projects/lcontrib/bin"

Finally copy and paste this command to allow you to see AMI audio and video

ln -s /group/project/amiweb/AMICorpusMirror/amicorpus signals

For external users, the corpus extract is here. Please note that some of this material is simply a cut-down version of the AMI corpus whose annotations you can downlaod freely. Please read the LICENSE file before using any AMI data. You will need to download your own copy of NXT and change the NXT export command above to reflect your local NXT location. You also need to open ami.sh in a text editor and make it to point to your local copy of NXT. If you want startup scripts for different platforms please let us know.

Changing an ontology to change the set of available data labels

Open the named-entity tool as you did in tutorial 1:

sh ami.sh

Select the named entity tool from the list of options. Notice that the sub-window titled NEGUI contains a type hierarchy. This comes from an NXT ontology file. Open the named entity ontology file in a text-editor e.g.:

emacs ontologies/ne-types.xml

And have a look at how the on-screen version in the tool matches the XML version in your editor. Try editing the ontology in a couple of ways, restarting the named entity coder (you can exit the tool using control-c in the terminal window) and looking at your changes on-screen. If you edit the ontology so it's invalid XML or IDs are duplicated, the tool will not start. RXP, which is a validating xml parser, will catch the dumbest typing problems for you if you can't see where you've gone wrong:

rxp ontologies/ne-types.xml

RXP spews out the XML if it's valid, and error messages for anything that isn't. If you get stuck and can't undo your edits, you can take a fresh copy and try again:

cp /group/project/ami9/NXTtutorial/Data/AMI/NXT-format/ontologies/ne-types.xml ontologies/

You should be able to change the keyboard shortcuts used to annotate an element; change the names of the elements; add/delete individual entries in the ontology or complete sections of the hierarchy.

RXP is available here. To get a fresh copy of ne-types.xml you'll have to extract from the zip file again.

Changing the corpus metadata to make new annotations available

NXT metadata files describe the data types and relationships within a corpus as well as how and where it should be stored on disk. As an example, we commented out the description of the movement layer in your copy of the AMI metadata. To confirm this try:

java CountQueryResults -c AMI-metadata.xml -q '($m movement)'

You should see there are zero results. Now open AMI-metadata.xml in a text editor, search for the coding-file with the name movement and remove the XML comment surrounding it. While you're looking at the metadata file, notice a couple of things about how the movement annotation is defined. First of all it's a time-aligned-layer which means it is aligned directly to the signal: each element will have start time and end time but no child elements. Also note that the type of the movement layer is declared as an enumerated attribute directly on the element itself - unlike the named-entity which points into an ontology. Now try the CountQueryResults command again and check that you get 26 results.

Changing the configuration of an annotation tool to behave differently

NXT has three generic configurable tools for annotation. If your annotation task can be cast as one of these, you just need to create an XML configuration file for your task.

Named entity annotation is an instance of the discourse entity coder and here we'll be editing its configuration file. Start up the named entity annotator using

sh ami.sh

and selecting the named entity coder. Try to create a nested named-entity by sweeping out a word with your mouse and labelling it using a keyword shortcut, then sweeping out a region that contains your new annotation and label it with a different keyboard shortcut. You should find that your original label disappears and only your most recent annotation survives. That's because we explicitly disallow nesting in this tool.

Open the configuration file in a text editor for example:

emacs configuration/amiConfig.xml

Search for the corpus-settings named dac-cs-ami: that set of config options are associated with the named entity tool in the metadata file. Now change the nenesting parameter to be true. Save the file and start the named entity annotation tool again: you should see a different nesting behaviour.

Another simple edit you could try making to the configuration file would be to set the value of neroot - look at the file named-entity ontology file ontologies/ne-types.xml to find suitable values. If you changed the value to ne_11 for example, you would then only be able to annotate named entities representing people: the rest of the type hierarchy becomes unavailable.

Working with competing annotations

NXT provides for creation and query of multiple versions of the same annotation. Annotations may explicitly rely on the presence of other annotations and disallow co-loading of others. This control is provided by an NXT resource file. For the AMI corpus this is called resources.xml. Try running a query:

java CountQueryResults -c AMI-metadata.xml -q '($n named-entity)'

This loads the default set of named-entities as named in the resource file. You can change the resources loaded using programmatic control NXT_RESOURCES. Try:

java -DNXT_RESOURCES='neVK,neMK' CountQueryResults -c AMI-metadata.xml -q '($n named-entity)'

The count is different because you have forced the simultaneous load of two competing versions of the same resource. Now we can start doing useful things with such loads: Here's an example that shows competing annotators' named entities that contain the same word but have different types:

java -DNXT_RESOURCES='neVK,neSN' FunctionQuery -c /group/project/ami9/NXTtutorial/Data/AMI/NXT-format/AMI-metadata.xml -o IS1001a -q '($n named-entity)($nt ne-type)($n2 named-entity)($nt2 ne-type)($w w):($n>$nt) & ($n2>$nt2) & ($n@res="neVK") & ($n2@res="neSN") & ($n^$w) & ($n2^$w) & ($nt@name!=$nt2@name)' -atts '$w' '$w@starttime' '$w@endtime' '$n@res' '$nt@name' '$n2@res' '$nt2@name'

Note that most annotation tools like the named-entity coder will not display multiple annotators' data in a helpful manner as they are not designed for that purpose. There are some display tools that allow you to see competing versions of the same annotation. See below.

This command uses a complete version of the NXT AMI annotations which is not available publically. You could instead create your own competing annotations using the named-entity annotator and try these queries, or just take from this the general method of distinguishing between different annotators' codings in a query

We find there are just two clashes of this type between these annotators, and if we observe the meeting at this time, the participants are in fact discussing a drawing of a cat on the whiteboard. So the neVK version of the annotation, annotated as DRAWING, is better here than the neSN version, which is annotated as OTHER. If you understand what's going on in the query you could perhaps find words which are annotated as named entities by one annotator but unannotated by the other.

NXT has another programmatic control called NXT_RESOURCES_ALWAYS_ASK. If it's set to true, NXT passes the choice to the user whenever there is a choice of possible annotation versions to load.

There are some tools for viewing competing annotations graphically though as a rule these are less well supported. As an example we can show two versions of a meeting transcript side-by-side: the manual annotation vs output of Automatic Speech Recognition.

sh ami.sh

Select the DualTrascriptionResourceDisplay element toward the bottom of the list of programs, and you'll see a side-by-side comparison of the two transcripts. Select a signal from the list: you'll at least want an audio signal. Use the slider on the Nite Clock to move forward and back in the meeting. Now press play and compare the manual transcript on the left to the automatic output on the right. Both should be synchronized with the signal.

If you need to compare compare non-spanning annotations, or wish to see a timeline view of competing resources, it's likely that you won't need to start from scratch as there are pre-release versions of tools that we are developing.

Data Import and Export

Existing Transforms

NXT provides some transforms to and from file formats used by many annotation tools, but since they are a bit specialist, you have to get them from NXT's CVS source repository at SourceForge page).

Event editor - commonly used to annotate phenomena directly aligned to the signal.
Transcriber / Channeltrans format - transcription tools
MLF format - automatic speech recognition - the transform is parameterized as MLF formats differ quite widely.
The Observer

Using FunctionQuery

FunctionQuery was introduced in Tutorial 1. It's the standard way of extracting data from NXT for further processing. It's worth noting that your round trip back to NXT is worth planning and in particular you should try to retain NXT element IDs through your processing.

Using standard XML processing with NXT data

NXT files can be treated as standalone XML files and processed by normal XML-aware software.
To show this, we've written a stylesheet, xsl/accord.xsl, that can be used to extract the concord_signal head movements from an XML file, using XSLT:

java org.apache.xalan.xslt.Process -in headGesture/IS1004d.A.head.xml -xsl xsl/accord.xsl -out new.xml

Knitting

For most practical purposes individual NXT files are too fragmented to be useful: instead we want standoff markup resolved. The knit process was developed for this purpose. There are two ways of knitting at the moment, but neither is metadata-aware, so they require all of the related XML files to be in the same directory. We've copied them to one for you, so cd there:

cd KnitTest

We'll try knitting on an extractive summary file. Extractive summary elements have dialogue acts as children, which in turn have word children. This knit command resolves 9 NXT files into one XML tree. You can see the results more easily in a web browser than at the command line because it will indent and colour the XML helpfully.

lxinclude -t nite IS1004d.extsumm.xml > yourfile.xml

lxinclude is part of LTXML2, which not everyone has. There's also a slow stylesheet-based method:

java org.apache.xalan.xslt.Process -in IS1004d.extsumm.xml -xsl ../xsl/knit.xsl -out myfile.xml

The lxinclude version is very much faster.

There are tools to help unknit your data back into NXT.

Write a Java Program

NXT has a full Java API you can use if what you want to export or import is particularly complex.

Other NXT corpora and tools etc

Other Corpora

There are a number of corpora available in NXT format some of which have one-meeting samples available to play with before you download. There are one-meeting samples for the Switchboard, Maptask, and Monitor corpora with startup scripts for DICE available in /group/project/ami9/NXTtutorial/Data. As an example, try starting up a one-meeting sample of the switchboard corpus:


cd /group/project/ami9/NXTtutorial/Data/SwitchboardSample

sh switchboard.sh

You'll need to download the single-observation versions of the corpora from this page, or indeed the entire corpora.

If you want to start to formulate queries use the Corpus Help tool from the search GUI as described in Tutorial 1.

Other Tools

There are two other configurable tools available for NXT that we haven't shown examples of: the discourse segmenter, and the continuous video labeller. To see an example of the former:


cd /group/project/ami9/NXTtutorial/

sh ami.sh

select the dialogue act coder and a meeting

You can see the main features of the discourse segmenter - it allows you to segment a dialogue and label the segment. You can also allow the annotation of pairs of elements, and label the pairs.

To see an example of the continuous video labeller, try:


cd /group/project/ami9/NXTtutorial/

sh ami.sh

select the ContinuousVideoLabeling program and meeting ES2002a

The screen starts blank, but select agent B and the movement layer from the Annotate menu, and select Closeup 4 and Overhead from the signals menu on the Nite Clock. You'll need to rearrange things a bit. Now we'll use a trick from tutorial 1: find an annotation you're interested in, for example a move annotation. First left-click it, then ctrl-right-click it and you'll see that part of the video playing. This tool is actually quite good for replaying annotations and getting a feel for what's there but is not a good annotation tool because it lacks frame accuracy. Secondly, the facility for going back and editing existing annotations - particularly moving the boundary between elements - is poor.

NXT Programming

If you find the built-in annotation tools are not sufficient for your needs, or you need to extract some data in a way that's not easy to do using command-line tools, you will be falling through to NXT's Java API. There are plenty of examples to guide you in the samples directory that's available from CVS. There is also extensive JavaDoc for the API, and people around to help you when you're stuck: you can go through the NXT SourceForge page or email Jonathan directly.

NITE XML Toolkit Tutorial 2

Preliminaries for Tutorial 2

Changing an ontology to change the set of available data labels

Changing the corpus metadata to make new annotations available

Changing the configuration of an annotation tool to behave differently

Working with competing annotations

Data Import and Export

Existing Transforms

Using FunctionQuery

Using standard XML processing with NXT data

Knitting

Write a Java Program

Other NXT corpora and tools etc

Other Corpora

Other Tools

NXT Programming