Analysis

Analysis
Prev		Next

The fundamental tool for analysis in NXT is the NXT Query Language used through the command line tools. Query development can very usefully be done using the GUI tools, but corpus-wide analysis will normally require command line tools. Some helper tools exist for special cases like the study of reliability.

Command line tools for data analysis

This section describes the various command line utilities that are useful for searching a corpus using NXT's query language. Command line examples below are given in the syntax for bash. It is possible to run NXT command line utilities from the DOS command line without installing anything further on Windows, but many users will find it easier to install cygwin, which comes with a bash that runs under Windows. The command line tools can be found in the XXXX directory of the NXT source, and are useful code examples.

Preliminaries

Before using any of the utilities, you need to set your classpath and perhaps consider a few things about your local environment.

Setting the classpath

The command line utilities require the classpath environment variable to be set up so that the shell can find the software. Assuming $NXT is set to the top level directory in which the software is installed, this can be done as follows (remove the newlines):

if [ $OSTYPE = 'cygwin' ]; the
n	export CLASSPATH=".;$NXT/lib;$NXT/lib/nxt.jar;$NXT/lib/jdom.jar;
           $NXT/lib/xalan.jar;$NXT/lib/xercesImpl.jar;$NXT/lib/xml-apis.jar;
           $NXT/lib/jmanual.jar;$NXT/lib/jh.jar;$NXT/lib/helpset.jar;
           $NXT/lib/poi.jar"
else
	export CLASSPATH=".:$NXT/lib:$NXT/lib/nxt.jar:$NXT/lib/jdom.jar:
           $NXT/lib/xalan.jar:$NXT/lib/xercesImpl.jar:$NXT/lib/xml-apis.jar:
           $NXT/lib/jmanual.jar:$NXT/lib/jh.jar:$NXT/lib/helpset.jar:
           $NXT/lib/poi.jar"
fi

This is not the full classpath that is needed for running NXT GUIs, but contains all of the methods used by the command line tools.

It is possible instead to specify the classpath on each individual call to java using the -cp argument.

Shell interactions

You'll need to be careful to use single quotes at shell level and double quotes within queries - although we've found one shell environment that requires the quotes the other way around. Getting the quoting to work correctly in a shell script is difficult even for long-time Unix users. There is an example shell script that shows complex use of quoting in the sample directory of the NXT distribution called "quoting-example.sh".

Don't forget that you can use redirection to divert warning and log messages:

java CountQueryResults -corpus swbd-metadata.xml -query '($n nt):' 2> logfile

Diverting to /dev/null gets rid of them without the need to save to a file.

Memory usage

It is possible to increase the amount of memory available to java for processing, and depending on the machine set up, this may speed things up. This can be done by using flags to java, e.g.

java -Xincgc -Xms127m -Xmx512m -Xfuture CountQueryResults ...

but also as an edit to the java calls in any of the existing scripts. This is what they mean:

Java Arguments Controlling Memory Use

-Xincgc: use incremental garbage collection to get back unused memory
-Xmssize: initial memory heap size
-Xmxsize: maximum memory heap size

The best choice of values will depend on your local environment.

Common Arguments

Where possible, the command line tools use the same argument structure. The common arguments are as follows.

Common Arguments for Command Line Tools

-corpus corpus: the path and filename specifying the location of the metadata file
-observation obs: the name of an observation. If this argument is not given, then the tools process all of the observations in the corpus
-query query: a query expressed in NXT's query language
-allatonce: an instruction to load all of the observations for a corpus at the same time. This can require a great deal of memory and slow down processing, but is necessary if queries draw context from outside single observations.

SaveQueryResults

java SaveQueryResults {-c corpus} {-q query} [[-o observation] | [-allatonce]] [-f outputfilename] [-d directoryname]

SaveQueryResults saves the results of a query as an XML document whose structure corresponds to the one displayed in the search GUI and described in Query results. Saved query results can be knit with the corpus to useful effect (see Knitting and Unknitting NXT Data Files) as well as subjected to external XML-based processing.

If no output filename is indicated, the output goes to System.out. (Note that this isn't very sensible to do unless running -allatonce, because the output will just concatenate separate XML documents.) In this case, everything else that could potentially be on System.out is redirected to System.err.

If outputfilename is given, output is stored in the directory directoryname. If running -allatonce or if an observation is specified, the output ends up in the file outputfilename. Otherwise, it is stored is a set of files found by prefixing outputfilename by the name of the observation and a full stop (.).

Caution

Under cygwin, -d takes Windows-style directory naming; e.g., -d "C:" not -d "/cygdrive/c". Using the latter will create the unexpected locatio nC:/cygdrive/c.

In distributions before 05 May 2004 (1.2.6 or earlier), the default was -allatonce, and the flag -independent was used to indicate that one observation should be processed at a time.

CountQueryResults

java CountQueryResults {-c corpus} {-q query} [[-o observation] | [-allatonce]]

CountQueryResults counts query results for an entire corpus, showing the number of matches but not the result tree. In the case of complex queries, the counts reflect the number of top level matches (i.e., matches to the first query that survive the filtering performed by the subsequent queries - matches to a subquery drop out if there are no matches for the next query). Combine CountQueryResults with command line scripting, for instance, to fill in possible attribute values from a nenumerated list.

When running -allatonce or on a named observation, the result is a bare count; otherwise, it is a table containing one line per observation, with observation name, whitespace, and then the count.

In versions before NXT-1.2.6, CountQueryResults runs -allatonce and a separate utility, CountOneByOne, handles the independent case.

MatchInContext

java MatchInContext {-c corpus} {-q query} [[-o observation] | [-allatonce]] [-context contextquery] [-textatt textattribute]

MatchInContext evaluates a query and prints any orthography corresponding to matches of the first variable in it, sending the results to standard output. It was developed for a set of users familiar with tgrep. contextquery is a noptional additional query expressing surrounding context to be show nfor matches. If it is present, for each main query match, the context query will be evaluated, with the additional proviso that the match for the first variable of the main query must dominate (be an ancestor of) the match for the first variable of the context query. If any such match for the context query is found, then the orthography of the for the first variable of the first match found will be shown, and the orthography relating to the main query will be given completely in upper case. Where the context query results in more than one match, a comment is printed to this effect. The context query must not share variable names with the main query.

By default, the utility looks for orthography in the textual content of a node. If textattribute is given, the nit uses the value of this attribute for the matched node instead. This is useful for corpora where orthography is stored in attributes and for getting other kinds of information, such as part-of-speech tags.

Since not all nodes contain orthography, MatchInContext ca nproduce matches with no text or with context but no main text. There is no clean way of knowing where to insert line breaks, speaker attributions, etc. in a general utility such as this one; for better displays write a tailored tool.

In versions before NXT-1.2.6, MatchInContext means -allatonce and a separate utility, MatchInContextOneByOne, handles the independent case.

NGramCalc: Calculating N-Gram Sequences

java NGramCalc {-c corpus} [-q query] [-o observation] {-tag tagname} [-att attname] [-role rolename] [-n n]

Background

An n-gram is a sequence of n states in a row drawn from an enumerated list of types. For instance, consider Parker's floor state model (Journal of Personality and Social Psychology 1988). It marks spoke nturns in a group discussion according to their participation i npairwise conversations. The floor states are newfloor (first to establish a new pairwise conversation), floor (in a pairwise conversation), broken (breaks a pairwise conversation), regai n(re-establishes a pairwise conversation after a broken), and nonfloor (not in a pairwise conversation). The possible tri-grams of floor states are newfloor/floor/broken, newfloor/floor/floor, regain/broken/ nonfloor, and so on. We usually think of n-grams as including all ways of choosing a sequence of n types, but in some models, not all of them are possible; for instance, in Parker's model, the bi-gram newfloor/newfloor can't happen. N-grams are frequently used i nengineering-oriented disciplines as background information for statistical modelling, but they are sometimes used in linguistics and psychology as well. Computationalists can easily calculate n-grams by extracting data from NXT into the format for another tool, but sometimes this is inconvenient or the user who requires the n-grams may not have the correct skills to do it.

Operation

NGramCalc calculates n-grams from NXT format data and prints on standard output a table reflecting the frequencies of the resulting n-grams for the given n. The default value for n is 1 (i.e., raw frequencies). NGramCalc uses as the set of possible states the possible values of attribute for the node type tag; the attribute must be declared in the corpus metadata as enumerated. NGramCalc then determines a sequence of nodes about which to report by finding matches to the first variable of the given query and placing them in order of start time. If role is given, it then substitutes for these nodes the nodes found by tracing the first pointer found that goes from the sequenced nodes with the given role. (This is useful if the data has been annotated using values stored in an external ontology or corpus resource.) At this point, the sequence is assumed to contain nodes that contain the named attribute, and the value of this attribute is used as the node's state.

Tag is required, but query is itself optional; by default, it is the query matching all nodes of the type named i ntag. Generally, the query's first variable will be of the node type specified in tag, and canonically, the query will simply filter out some nodes from the sequence. However, as long as a state can be calculated for each node in the sequence using the attribute specified, the utility will work. There is no -allatonce option; if no observation is specified, only one set of numbers is reported but the utility loads only one observation at a time when calculating them.

Examples

java NGramCalc -c METADATA -t turn -a fs -n 3

will calculate trigrams of fs attributes of turns and output a tab-delimited table like

500	newfloor	floor	broke
n0	newfloor	newfloor	newfloor

Suppose that the way that the data is set up includes an additional attribute value that we wish to skip over when calculating the tri-grams, called "continued".

java NGramCalc -c METADATA -t turn -a fs -n 3 -q '($t turn):($t@fs != "continued")'

will do this. Entries for "continued" will still occur in the output table because it is a declared value, but will have zero in the entries.

java NGramCalc -c METADATA -t gesture-type -a name -n 3 -q '($g gest):'
     -r gest-target

will produce trigrams where the states are found by tracing the gest-target role from gest elements, which finds gesture-type elements (canonically, part of some corpus resource), and further looking at the values of their name attributes. Note that in this case, the tag type given in -t is what results from tracing the role from the query results, not the type returned in the query.

FunctionQuery: Time ordered, tab-delimited output, with aggregate functions

java FunctionQuery {-c corpus} {-q query} [-o observation] {-att attribute_or_aggregate...}

FunctionQuery is a utility for outputting tab-delimited data. It takes all elements resulting from the result of a query, as long as they are timed, and put them in order of start time. Then it outputs one line per element containing the values of the named attributes or aggregates with a tab character between each one.

The value of -atts must be a space-separated list of attribute and aggregate specifiers. If an attribute or aggregate does not exist for some matched elements, a blank tab-stop will be output for the corresponding field.

Attribute Specifiers

Attribute values can be specified using the form var@attributename (e.g., $v@label, where label is the name of the attribute). If the variable specifier (e.g., $v) is omitted, the attribute belonging to the first variable in the query (the "primary variable") is returned. If the attribute specifier (e.g.. label) is omitted, the ntextual content for the node will be shown. Nodes may have either direct textual content or children; in the case of children, the textual content shown will be the concatenated textual content of its descendants separated by spaces. For backwards compability with a norder utility called SortedOutput, instead of specifying it in the list of attributes, -text can be used to place this textual content in the last field, although this is not recommended.

Aggregate Specifiers

Aggregate functions are identified by a leading '@' character. The first argument to an aggregate function is always a query to be evaluated in the context of the current result using the variable bindings from the main query. For instance, if $m has bee nbound in the main query to nodes of type move, the context query ($w w):($m ^ $w) will find all w nodes descended from the move corresponding to the current return value, and the context query ($g gest):($m # $g), all gest nodes that temporally overlap with it. The list of returned results for the context query are then used in the aggregation.

For the following functions, optional arguments are denoted by an equals sign followed by the default value of that argument. There are currently four aggregate functions included in FunctionQuery.

Aggregate Functions

@count(conquery): returns the number of results from evaluating conquery
@sum(conquery, attr): returns the sum of the values of attr for all results of conquery. attr should be numerical attribute.
@extract(conquery, attr, n=0, last=n+1): returns the attr attribute of the nth result of conquery evaluated in the context of query. If n is less than 0, extract returns the attr attribute of the nth last result. If last is provided, the attr value of all results whose index is at least n and less tha n last is returned. If last is less than 0, it will count back from the final result. If last equals zero, all items between n and the end of the result list will be returned.
@overlapduration(conquery): returns the length of time that the results of conquery overlap with the results of the main query. For some conquery results, this number may exceed the duration of the main query result. For example, the duration of speech for all participants over a period of time may exceed the duration of the time segment if there are multiple simultaneous speakers. This can be avoided, for example, by using conquery to restrict matches to a specific agent.

Example

java FunctionQuery -c corpus -o observation -q '($m move)' 
	 -atts type nite:start nite:end '@count(($w w):$w#$m)' '$m'

will output a sorted list of moves for the observation consisting of type attribute, start and end times, the count of w (words) that overlap each move, and any text included in the move, or any children.

Indexing

java Index {-c corpus} {-q query} [-o observation] [-t tag] {-r role...}

Index modifies a corpus by adding new nodes that index the results of a query so that they can be found quickly. If observation is omitted, all observations named in the metadata file are indexed in turn. One new node is created for each query match. The new nodes have type tag, which defaults to "markable". If -r is omitted, the new node is made a parent of the match for the first unquantified variable of the query. If -r is included, then the new node will instead use the role names to point to the nodes in the n-tuple returned at the top level of the query, using the role names in the order given and the variables in the order used in the query until one of the two lists is exhausted. Index does not remove existing tags of the given type before operatio nso that an index can be built up gradually using several different queries.

Note that the same node can be indexed more than once, if the query returns n-tuples that involve the same node. The tool does nothing to check whether this is the case even when creating indices that are parents of existing nodes, which can lead to invalid data if you are not careful. Using roles, however, is always safe, as is using parents when the top level of the given query matches only one unquantified variable.

Note that if you want one pointer for every named variable in a simple query, or you want tree-structured indices corresponding to the results for complex queries, you can use SaveQueryResults and load the results as a coding. For cases where you could use either, the main difference is that SaveQueryResults doesn't give control over the tag name and roles.

Metadata requirements

The tool assumes that a suitable declaration for the new tag have already bee nadded into the metadata file. It is usual to put it in a new coding, and it would be a bad idea to put in a layer that anything points to, since no work is done to attach the indices to prospective parents or anything else besides what they index. If the indexing adds parents, then the type of the coding file (interaction or agent) must match the type of the coding file that contains the matches to the first variable. If an observation name is passed, it creates a index only for the one observation; if none is, it indexes each observation in the metadata file by loading one at a time (that is, there is no equivalent to -allatonce operation).

The canonical metadata form for an index file, assuming roles are used, is an interaction coding declared as follows:

<coding-file name="foo">
  <featural-layer name="baz">
      <code name="tag">
         <pointer number="1" role="role1" target="LAYER_CONTAINING_MATCHES"/>
          ...
      </code>
  </featural-layer>
</coding-file>

The name of the coding file determines the filenames where the indices get stored. The name of the featural-layer is unimportant but must be unique. The tags for the indices must not already be used in some other part of the corpus, including other indices.

Example of Indexing

To add indices that point to active sentences in the Switchboard data, add the following coding-file tag to the metadata as an interaction-coding (i.e., as a sister to the other coding file declarations).

<coding-file name="sentences">
    <featural-layer name="sentence-layer">
        <code name="sentenceindex">
            <pointer number="1" role="at"/>
        </code>
    </featural-layer>
</coding-file>

This specifies that the indices for sw2005 (for example) should go in sw2005.sentences.xml. Then, for example,

java Index -c swbd-metadata.xml -t active -q '($sent nt):($sent@cat=="S")'

After indexing,

($n nt)($i sentenceindex):($i >"at" $n)

gets the sentences.

Projecting Images Of Annotations

Sometimes even though an annotation layer draws children from some lower layer, it's useful to know what the closest correspondence is between the segments in that layer and some different lower layer. For instance, consider having both hand transcription and hand annotation for dialogue acts above it, and also ASR output with automatic dialogue act annotation on top of that. There is no relationship apart from timing between the hand and automatic dialogue acts, but to find out how well the automatic process works, it's useful to know whether it segments the hand transcribed words the same way, and with the same categories, as the hand annotation does.

ProjectImage is a tool that allows this comparison to be made. Given some source annotation that segments the data by drawing children from a lower layer, and the name of a target annotation that is defined as drawing children from a different lower layer, it creates the target annotation by adding annotations that are just like the source but with the other children. A child is inside a target segment if its timing midpoint is after the start and before the end of the source segment. If there are no such children, then the target element will be empty. ProjectImage adds a pointer from each target element back to its source element so that it's easy to check categories etc.

Note

ProjectImage was committed to CVS on 21/11/2006 and will be in all subsequent NXT builds.

Checkout and build from CVS (or use a build if there is one post 21/11/06).

Edit your metadata file and prepare the ground. You need to decide what NXT element is being projected onto which other. As an example we'll look at Named Entities on the AMI corpus: imagine we want to project manually generated NEs onto ASR output to take a look at the result. You'll already have the manual NEs and ASR transcription declarations in your metadata:

<coding-file name="ne" path="namedEntities">
    <structural-layer draws-children-from="words-layer" name="ne-layer">
        <code name="named-entity" text-content="false">
            <pointer number="0" role="type" target="ne-types"/>
        </code>
    </structural-layer>
</coding-file>

<!-- ASR version of the words -->
<coding-file name="asr" path="ASR">
    <time-aligned-layer name="asr-words-layer">
        <code name="asrword" text-content="true"/>
        <code name="asrsil"/>
    </time-aligned-layer>
</coding-file>

and now you need to add the projection layer into the metadata file, remembering to add a pointer from the target to source layer:

<!-- ASR Named entities -->
<coding-file name="ane" path="ASRnamedEntities">
    <structural-layer draws-children-from="asr-words-layer" name="asr-ne-layer">
        <code name="asr-named-entity" text-content="false">
            <pointer number="0" role="source_element" target="ne-layer"/>
            <pointer number="0" role="type" target="ne-types"/>
        </code>
    </structural-layer>
</coding-file>

Using a standard NXT CLASSPATH or just using the -cp argument to the java command below like this: -cp lib/nxt.jar:lib/xercesImpl.jar, run ProjectImage:
```
java net.sourceforge.nite.util.ProjectImage -c /path/to/AMI-metadata.xml 
          -o ES2008a -s named-entity -t asr-named-entity
```
The arguments to ProjectImage are:
- -c metadata file including definition for the target annotation
- -o Optional observation argument. If it's not there the projection will be done for the entire corpus
- -s source element name
- -t target element name

The output is a (set of) standard NXT files that can be loaded with the others. To get textual output, use FunctionQuery on the target annotation resulting from running ProjectImage (see FunctionQuery).

Notes

ProjectImage can be used to project any type of data segment onto a different child layer, and so has many uses beyond the one described. The main restriction is that the segments must all use the same tag name. Although it might be more natural to define the imaging in terms of a complete NXT layer, the user would have to specify at the command line a complete mapping from source tags to target tags, which would be cumbersome. Moreover, many current segmentation layers use single tags. In future NXT versions we may consider generalizing to remove this restriction.

Reliability Testing

This section contains documentation of the facility for loading multiply-annotated data that forms the core of NXT's support for reliability tests, plus a worked example from the AMI project, kindly supplied by Vasilis Karaiskos. For more information, see the JavaDoc corresponding to the NOM loading routine for multiply-annotated data, for CountQueryMulti, and for MultiAnnotatorDisplay.

The facilities described on this page are new for NXT v 1.3.3.

Generic documentation

Many projects wish to know how well multiple human annotators agree on how to apply their coding manuals, and so they have different human annotators read the same manual and code the same data. They then need to calculate some kind of measurement statistic for the resulting agreement. This measurement can depend on the structure of the annotation (agreement on straight categorization of existing segments being simpler to measure than annotations that require the human to segment the data as well) as well as what field they are in, since statistical development for this form of measurement is still in progress, and agreed practice varies from community to community.

NXT 1.3.3 and higher provides some help for this statistical measurement, in the form of a facility that can load the data from multiple annotators into the same NOM (NXT's object model, or internal data representation, which can be used as the basis for Java applications that traverse the NOM counting things or for query execution).

This facility works as follows. The metadata specifies a relative path from itself to directories at which all coding files containing data can be found. (The data can either be all together, in which case the path is usually given on the <codings> tag, or it can be in separate directories by type, in which case the path is specified on the individual <coding-file> tags.) NXT assumes that if there is annotation available from multiple annotators, it will be found not in the specified directory itself, but in subdirectories of the directory specified, where the subdirectories is called by the names (or some other unique designators) of the annotators. Annotation schemes often require more than one layer in the NOM representation. The loading routine takes as arguments the name of the highest layer containing multiple annotations; the name of a layer reached from that layer by child links that is common between the two annotators, or null if the annotation grounds out at signal instead; and a string to use as an attribute name in the NOM to designate the annotator for some data. Note that the use of a top layer and a common layer below it allows the program to know exactly where the multiply annotated data is - it is in the top layer plus all the layers between the two layers, but not in the common layer. (It is possible to arrange annotation schemes so that they do not fit this structure, in which case, NXT will not support reliability studies on them.) The routine loads all of the versions of these multiply-annotated layers into the NOM, differentiating them by using the subdirectory name as the value for the additional attribute representing the annotator.

NXT is agnostic as to which statistical measures are appropriate. It does not currently (June 05) implement any, but leaves users to write Java applications or sets of NXT queries that allow their chosen measures to be calculated. (Being an open source project, of course, anyone who writes such applications can add them to NXT for the benefit of others who make the same choices.) Version 1.3.3 provides two end user facilities that will be helpful for these studies, which are essentially multiple annotator versions of the GenericDisplay GUI and of CountQueryResults.

MultiAnnotatorDisplay

This is a version of the GenericDisplay that takes additional command line arguments as required by the loading routine for multiply-annotated data, and renders separate windows for each annotation for each annotator. The advantage of using the GUI is, as usual, for debugging queries, since queries can be executed, with the results highlighted on the data display.

To call the GUI:

java net.sourceforge.nite.gui.util.MultiAnnotatorDisplay -c METADATAFILE 
	          -o OBSERVATION -tl TOPLAYER [[-cl COMMONLAYER] [-a ANNOTATOR]]

-c METADATAFILENAME names a metadata file defining the corpus to be loaded.

-tl TOPLAYER names the data layer at the top of the multiple annotations to be loaded.

-cl COMMONLAYER is required only if the multiple annotations ground out in a common layer, and names the first data layer, reached by descending from the toplayer using child links, that is common between the multiple annotations.

-a ANNOTATOR is the name of the attribute to add to the loaded data that contains the name of the subdirectory from which the annotations were obtained - that is, the unique designator for the annotation. Optional; defaults to coder.

CountQueryMulti

To call:

java CountQueryMulti -corpus METADATAFILE -query QUERY 
	   -toplayer TOPLAYER -commonlayer COMMONLAYER 
	   [[-attribute ANNOTATOR] [-observation OBSERVATION][-allatonce]]

where arguments are as for MultiAnnotatorDisplay, apart from the following (which are as for CountQueryResults):

-observation OBSERVATION: the observation whose annotations are to be loaded. Optional; if not given, all observations are processed one by one with counts given in a table.

-query QUERY: the query to be executed.

-allatonce: Optional; if used, then the entire corpus is loaded together, with output counting over the entire corpus. This option is very slow and memory-intensive, and assuming you are willing to total the results from the individual observations, is only necessary if queries draw context from outside single observations.

Example reliability study

The remainder of this web page demonstrates an annotation scheme reliability test in NITE. The example queries below come from the agreement test on the named entities annotation of the AMI corpus. Six recorded meetings were annotated by two coders, whose marking were consequently compared. The categories and attributes that come into play are the following:

named-entity: new named entities - the data for which we are doing the reliability test. These are parents of words in the transcript. They are in a layer called ne-layer.

w: the words in the transcript. They are in a layer called word-layer.

ne-type: the categories a named entity can be assigned to. They are in an ontology, with the named entities pointing to them, using the type role.

name: an attribute of a named entity type that gives the category for the named entity (e.g., timex, enamex).

coder:an attribute of a named entity, signifying who marked the entity.

Loading the data into the GUI

The tests are being carried out by loading the annotated data on the NXT display MultiAnnotatorDisplay (included in nxt_1.3.3 and above). The call can be incorporated in a shell script along with the appropriate classpaths. For example, the following is included in our multi.sh script run from the root of the NXT install (% sh multi.sh). All the CLASSPATHs should be in a single line in the actual script.

#!/bin/bash
# Note that a Java runtime should be on the path.
# The current directory should be root of the nxt install.
# unless you edit this variable to contain the path to your install
# then you can run from anywhere. CLASSPATH statements need to be
# in a single line
NXT="."

# Adjust classpath for running under cygwin.
if [ $OSTYPE = 'cygwin' ]; then

export CLASSPATH=".;$NXT;$NXT/lib;$NXT/lib/nxt.jar;$NXT/lib/jdom.jar;
  $NXT/lib/JMF/lib/jmf.jar;$NXT/lib/pnuts.jar;$NXT/lib/resolver.jar; 
  $NXT/lib/xalan.jar;$NXT/lib/xercesImpl.jar;$NXT/lib/xml-apis.jar; 
  $NXT/lib/jmanual.jar;$NXT/lib/jh.jar;$NXT/lib/helpset.jar;$NXT/lib/poi.jar; 
  $NXT/lib/eclipseicons.jar;$NXT/lib/icons.jar;$NXT/lib/forms-1.0.4.jar; 
  $NXT/lib/looks-1.2.2.jar;$NXT/lib/necoderHelp.jar;$NXT/lib/videolabelerHelp.jar; 
  $NXT/lib/dacoderHelp.jar;$NXT/lib/testcoderHelp.jar"

else

export CLASSPATH=".:$NXT:$NXT/lib:$NXT/lib/nxt.jar:$NXT/lib/jdom.jar: 
  $NXT/lib/JMF/lib/jmf.jar:$NXT/lib/pnuts.jar:$NXT/lib/resolver.jar: 
  $NXT/lib/xalan.jar:$NXT/lib/xercesImpl.jar:$NXT/lib/xml-apis.jar: 
  $NXT/lib/jmanual.jar:$NXT/lib/jh.jar:$NXT/lib/helpset.jar:$NXT/lib/poi.jar: 
  $NXT/lib/eclipseicons.jar:$NXT/lib/icons.jar:lib/forms-1.0.4.jar: 
  $NXT/lib/looks-1.2.2.jar:$NXT/lib/necoderHelp.jar:$NXT/lib/videolabelerHelp.jar: 
  $NXT/lib/dacoderHelp.jar:$NXT/lib/testcoderHelp.jar"

# echo "CLASSPATH=.:$NXT:$NXT/lib:$NXT/lib/nxt.jar:$NXT/lib/jdom.jar: 
    $NXT/lib/JMF/lib/jmf.jar:$NXT/lib/pnuts.jar:$NXT/lib/resolver.jar: 
    $NXT/lib/xalan.jar:$NXT/lib/xercesImpl.jar:$NXT/lib/xml-apis.jar: 
    $NXT/lib/jmanual.jar:$NXT/lib/jh.jar:$NXT/lib/helpset.jar:$NXT/lib/poi.jar: 
    $NXT/lib/eclipseicons.jar:$NXT/lib/icons.jar:lib/forms-1.0.4.jar: 
    $NXT/lib/looks-1.2.2.jar:$NXT/lib/necoderHelp.jar:$NXT/lib/videolabelerHelp.jar: 
    $NXT/lib/dacoderHelp.jar:$NXT/lib/testcoderHelp.jar\n";
fi

java net.sourceforge.nite.gui.util.MultiAnnotatorDisplay -c Data/AMI/AMI-metadata.xml 
       -tl ne-layer -cl words-layer

A GUI with a multitude of windows will load (each window contains the data of one of the various layers of data and annotations), thus allowing comparisons between the choices of these coders. In our examples below the annotators are named Coder1 and Coder2.

Selecting Search off the menu bar will bring up a small GUI where the queries such as the ones below can be written. Clicking on any of the query results, highlights the corresponding data in the rest of the windows (words, named entities, coders' markings etc.). Simultaneously, underneath the list of matches, the query GUI expands whichever n-tuple is selected. For a the low-down on the NITE query language (NiteQL), look at the query language documentation or the Help menu in the query GUI.

Querying data related to a single annotator

($a named-entity) : $a@coder=="Coder1": Give a list of all the named entities marked by Coder1.
($w w)(exists $a named-entity) : $a@coder="Coder1" && $a ^ $w: Give a list of all the words marked as named entities by Coder1.
($a named-entity): $a@coder=="Coder1" :: ($w w): $a ^ $w: Gives all the named entities marked by Coder1 showing the words included in each entity.
($a named-entity)($t ne-type) : ($a >"type"^ $t) && ($t@name == "EntityType") && ($a@coder == "Coder1"): Gives the named entities of type EntityType annotated by Coder1. The entity types (and their names) to choose from can be seen in the respective window in the GUI (titled "Ontology: ne-types" in this case).
($a named-entity)($t ne-type) : ($a >"type"^ $t) && ($t@name == "EntityType") && ($a@coder == "Coder1") :: ($w w): $a ^ $w: Like the previous query, only each match also includes the words forming the entity.
($t ne-type) :: ($a named-entity) : $a@coder=="Coder1" && $a >"type"^ $t: Gives a list of all the named entity types (including root), and for each type, the entities of that type annotated by Coder1. By writing the last term of the query as $a >"type" $t, the query will match only the bottom level entity types (the ones used as actual tags), that is it will display MEASURE entities, but not NUMEX ones (assuming here that MEASURE is a sub-type of NUMEX).
($a named-entity)($t ne-type) : $a@coder=="Coder1" && $a >"type"^ $t :: ($w w): $a ^ $w: Like the previous query, only each match (n-tuple) also includes the words forming the entity.

Querying data related to two annotators

Checking for co-extensiveness

The following examples check for agreement between the two annotators as to whether some text should be marked as a named entity:

($a named-entity)($b named-entity): $a@coder=="Coder1" && $b@coder=="Coder2" :: ($w1 w) (forall $w w) : ($a ^ $w1) && ($b ^ $w1) &&(($a ^ $w) -> ($b ^ $w)) && (($b ^ $w) -> ($a ^ $w)): Gives a lost of all the co-extensive named entities between Coder1 and Coder2 along with the words forming the entities (the entities do not have to be of the same type, but they have to span exactly the same text).
($a named-entity)($b named-entity): $a@coder=="Coder1" && $b@coder=="Coder2" :: ($w1 w) (exists $w w) : ($a ^ $w1) && ($b ^ $w1) &&(($a ^ $w) -> ($b ^ $w)) && (($b ^ $w) -> ($a ^ $w)): Like the previous query, but includes named entities that are only partially co-extensive. The words showing in the query results are only the ones where the entities actually overlap.
($a named-entity)(forall $b named-entity)(forall $w w): $a@coder=="Coder1" && (($b@coder=="Coder2" && ($a ^ $w))->!($b ^ $w)): Gives the list of entities that only Coder1 has marked, i.e. there is no corresponding entity in Coder2. Switching Coder1 and Coder2 in the query, gives the respective set of entities for Coder2.
($a named-entity)(forall $b named-entity)(forall $w w): $a@coder=="Coder2" && (($b@coder=="Coder1" && ($a ^ $w))->!($b ^ $w)) || $a@coder=="Coder1" && (($b@coder=="Coder2" && ($a ^ $w))->!($b ^ $w)): Like the previous query, only this time both sets of non-corresponding entities is given in one go.

Checking for categorisation agreement

The following examples check how the two annotators agree on the categorisation of co-extensive entities:

($a named-entity)($b named-entity) ($t ne-type): $a@coder=="Coder1" && $b@coder=="Coder2" && ($a >"type" $t) && ($b >"type" $t) :: ($w1 w) (forall $w w) : ($a ^ $w1) && ($b ^ $w1) &&(($a ^ $w) -> ($b ^ $w)) && (($b ^ $w) -> ($a ^ $w)): Gives all the common named entities between Coder1 and Coder2 along with the entity type and text; the entities have to be co-extensive (fully overlapping) and of the same type.
($a named-entity)($b named-entity) ($t ne-type): $a@coder=="Coder1" && $b@coder=="Coder2" && ($a >"type" $t) && ($b >"type" $t) :: ($w1 w) (exists $w w) : ($a ^ $w1) && ($b ^ $w1) &&(($a ^ $w) -> ($b ^ $w)) && (($b ^ $w) -> ($a ^ $w)): Like the previous query, but includes partially co-extensive entities. The words showing in the query results are only the ones that actually do overlap.
($a named-entity)($b named-entity) ($t ne-type): $a@coder=="Coder1" && $b@coder=="Coder2" && ($a >"type" $t) && ($b >"type" $t) :: ($w2 w):($a ^ $w2) && ($b ^ $w2) :: ($w w):(($b ^ $w) && !($a ^ $w)) || (($a ^ $w) && !($b ^ $w)): Gives the list of entities which are the same type, but only partially co-extensive. The results include the entire set of words from both codings.
($a named-entity)($b named-entity) ($t ne-type)($t1 ne-type): $a@coder=="Annotator1" && $b@coder=="Annotator2" && ($a >"type" $t) && ($b >"type" $t1) && ($t != $t1) :: ($w1 w) (exists $w w) : ($a ^ $w1) && ($b ^ $w1) &&(($a ^ $w) -> ($b ^ $w)) && (($b ^ $w) -> ($a ^ $w)) :: ($w2 w): ($b ^ $w2): Gives the list of entities, which are partially or fully co-extensive, but for which the two coders disagree as to the type.
($a named-entity)($b named-entity)($c ne-type)($d ne-type): $a@coder=="Coder1" && $b@coder=="Coder2" && $c@name="EntityType1" && $d@name="EntityType2"&& $a>"type"^ $c && $b>"type"^ $d :: ($w2 w):($a ^ $w2) && ($b ^ $w2): Gives the list of entities which are partially or fully co-extensive, and which Coder1 has marked as EntityType1 (or one of its sub-types) and Coder2 has marked as EntityType2 (or one of its sub-types). This checks for type-specific disagreements between the two coders.
($t ne-type): !($t@name="ne-root") :: ($a named-entity)($b named-entity): $a@coder=="Coder1" && $b@coder=="Coder2" && (($a >"type"^ $t) && ($b >"type"^ $t)) :: ($w1 w) (forall $w w) : ($a ^ $w1) && ($b ^ $w1) &&(($a ^ $w) -> ($b ^ $w)) && (($b ^ $w) -> ($a ^ $w)): The query creates a list of all the entity types, and slots in each entry all the (fully) co-extensive entities as marked by the two coders. The actual text forming each entity is also included in the results.
($t1 ne-type): !($t1@name="ne-root") :: ($a named-entity)($b named-entity): $a@coder=="Coder1" && $b@coder=="Coder2" && (($a >"type"^ $t1) && ($b >"type"^ $t1)) :: ($w1 w) (exists $w w) : ($a ^ $w1) && ($b ^ $w1) &&(($a ^ $w) -> ($b ^ $w)) && (($b ^ $w) -> ($a ^ $w)): Like the previous query, but includes partially co-extensive entities. The words showing in the query results are only the ones that actually do overlap.

Prev		Next
The NXT Query Language (NQL)	Home	Graphical user interfaces