The fundamental tool for analysis in NXT is the NXT Query Language used through the command line tools. Query development can very usefully be done using the GUI tools, but corpus-wide analysis will normally require command line tools. Some helper tools exist for special cases like the study of reliability.
This section describes the various command line utilities that
are useful for searching a corpus using NXT's query language.
Command line examples below are given in the syntax for bash
.
It is possible to run NXT command line utilities from the DOS command
line without installing anything further on Windows, but many users
will find it easier to install cygwin,
which comes with a bash that runs under Windows. The command
line tools can be found in the XXXX directory of the NXT source,
and are useful code examples.
Before using any of the utilities, you need to set your classpath and perhaps consider a few things about your local environment.
The command line utilities require the classpath environment variable to be set up so that the shell can find the software. Assuming $NXT is set to the top level directory in which the software is installed, this can be done as follows (remove the newlines):
if [ $OSTYPE = 'cygwin' ]; the n export CLASSPATH=".;$NXT/lib;$NXT/lib/nxt.jar;$NXT/lib/jdom.jar; $NXT/lib/xalan.jar;$NXT/lib/xercesImpl.jar;$NXT/lib/xml-apis.jar; $NXT/lib/jmanual.jar;$NXT/lib/jh.jar;$NXT/lib/helpset.jar; $NXT/lib/poi.jar" else export CLASSPATH=".:$NXT/lib:$NXT/lib/nxt.jar:$NXT/lib/jdom.jar: $NXT/lib/xalan.jar:$NXT/lib/xercesImpl.jar:$NXT/lib/xml-apis.jar: $NXT/lib/jmanual.jar:$NXT/lib/jh.jar:$NXT/lib/helpset.jar: $NXT/lib/poi.jar" fi
This is not the full classpath that is needed for running NXT GUIs, but contains all of the methods used by the command line tools.
It is possible instead to specify the classpath on each individual call to java using the -cp argument.
You'll need to be careful to use single quotes at shell level and double quotes within queries - although we've found one shell environment that requires the quotes the other way around. Getting the quoting to work correctly in a shell script is difficult even for long-time Unix users. There is an example shell script that shows complex use of quoting in the sample directory of the NXT distribution called "quoting-example.sh".
Don't forget that you can use redirection to divert warning and log messages:
java CountQueryResults -corpus swbd-metadata.xml -query '($n nt):' 2> logfile
Diverting to /dev/null gets rid of them without the need to save to a file.
It is possible to increase the amount of memory available to java for processing, and depending on the machine set up, this may speed things up. This can be done by using flags to java, e.g.
java -Xincgc -Xms127m -Xmx512m -Xfuture CountQueryResults ...
but also as an edit to the java calls in any of the existing scripts. This is what they mean:
Java Arguments Controlling Memory Use
-Xincgc
use incremental garbage collection to get back unused memory
-Xms
size
initial memory heap size
-Xmx
size
maximum memory heap size
The best choice of values will depend on your local environment.
Where possible, the command line tools use the same argument structure. The common arguments are as follows.
Common Arguments for Command Line Tools
-corpus
corpus
the path and filename specifying the location of the metadata file
-observation
obs
the name of an observation. If this argument is not given, then the tools process all of the observations in the corpus
-query
query
a query expressed in NXT's query language
-allatonce
an instruction to load all of the observations for a corpus at the same time. This can require a great deal of memory and slow down processing, but is necessary if queries draw context from outside single observations.
java SaveQueryResults
{-c corpus
} {-q query
} [[-o observation
] | [-allatonce]] [-f outputfilename
] [-d directoryname
]
SaveQueryResults
saves the results of a query as
an XML document whose structure corresponds to the one displayed in the search GUI and described in Query results.
Saved query results can be knit with the corpus to useful
effect (see Knitting and Unknitting NXT Data Files) as well as subjected to external
XML-based processing.
If no output filename is indicated, the output goes to System.out. (Note
that this isn't very sensible to do unless running -allatonce
,
because the output will just concatenate separate XML documents.) In this case, everything else that could potentially be on System.out is
redirected to System.err
.
If outputfilename
is given, output is
stored in the directory directoryname
. If
running -allatonce
or if an observation
is specified, the output ends
up in the file outputfilename
. Otherwise, it is stored is a set of files found by prefixing
outputfilename
by the name of the
observation and a full stop (.
).
Under cygwin, -d
takes
Windows-style directory naming; e.g.,
-d "C:"
not -d "/cygdrive/c"
. Using the latter will create the unexpected locatio
nC:/cygdrive/c
.
In distributions before 05 May 2004 (1.2.6 or earlier), the default was -allatonce, and the flag -independent was used to indicate that one observation should be processed at a time.
java CountQueryResults
{-c corpus
} {-q query
} [[-o observation
] | [-allatonce]]
CountQueryResults counts query results for an entire corpus, showing the number of matches but not the result tree. In the case of complex queries, the counts reflect the number of top level matches (i.e., matches to the first query that survive the filtering performed by the subsequent queries - matches to a subquery drop out if there are no matches for the next query). Combine CountQueryResults with command line scripting, for instance, to fill in possible attribute values from a nenumerated list.
When running -allatonce
or on a named
observation
, the result is a bare count; otherwise,
it is a table containing one line per observation,
with observation name, whitespace, and then the count.
In versions before NXT-1.2.6, CountQueryResults runs -allatonce and a separate utility, CountOneByOne, handles the independent case.
java MatchInContext
{-c corpus
} {-q query
} [[-o observation
] | [-allatonce]] [-context contextquery
] [-textatt textattribute
]
MatchInContext evaluates a query and prints any orthography
corresponding to matches of the first variable in it, sending the
results to standard output. It was developed for a set of users
familiar with tgrep. contextquery
is a
noptional additional query expressing surrounding context to be show
nfor matches. If it is present, for each main query match, the context
query will be evaluated, with the additional proviso that the match
for the first variable of the main query must dominate (be an ancestor
of) the match for the first variable of the context query. If any
such match for the context query is found, then the orthography of the
for the first variable of the first match found
will be shown, and the orthography
relating to the main query will be given completely in upper case.
Where the context query results in more than one match, a comment
is printed to this effect.
The context query must not share variable names with the main query.
By default, the utility looks for orthography in the textual content
of a node. If textattribute
is given, the
nit uses the value of this attribute for the matched node instead.
This is useful for corpora where orthography is stored in attributes
and for getting other kinds of information, such as part-of-speech
tags.
Since not all nodes contain orthography, MatchInContext ca nproduce matches with no text or with context but no main text. There is no clean way of knowing where to insert line breaks, speaker attributions, etc. in a general utility such as this one; for better displays write a tailored tool.
In versions before NXT-1.2.6, MatchInContext means -allatonce and a separate utility, MatchInContextOneByOne, handles the independent case.
java NGramCalc
{-c corpus
} [-q query
] [-o observation
] {-tag tagname
} [-att attname
] [-role rolename
] [-n n
]
An n-gram is a sequence of n states in a row drawn from an enumerated list of types. For instance, consider Parker's floor state model (Journal of Personality and Social Psychology 1988). It marks spoke nturns in a group discussion according to their participation i npairwise conversations. The floor states are newfloor (first to establish a new pairwise conversation), floor (in a pairwise conversation), broken (breaks a pairwise conversation), regai n(re-establishes a pairwise conversation after a broken), and nonfloor (not in a pairwise conversation). The possible tri-grams of floor states are newfloor/floor/broken, newfloor/floor/floor, regain/broken/ nonfloor, and so on. We usually think of n-grams as including all ways of choosing a sequence of n types, but in some models, not all of them are possible; for instance, in Parker's model, the bi-gram newfloor/newfloor can't happen. N-grams are frequently used i nengineering-oriented disciplines as background information for statistical modelling, but they are sometimes used in linguistics and psychology as well. Computationalists can easily calculate n-grams by extracting data from NXT into the format for another tool, but sometimes this is inconvenient or the user who requires the n-grams may not have the correct skills to do it.
NGramCalc calculates n-grams from NXT format data and prints
on standard output a table reflecting the frequencies
of the resulting n-grams for the given n
.
The default value for n
is 1 (i.e., raw
frequencies). NGramCalc uses as the set of possible states
the possible values of attribute
for
the node type tag
; the attribute must
be declared in the corpus metadata as enumerated.
NGramCalc then determines a sequence of nodes about which to report
by finding matches to the first variable of the given
query
and placing them in order of start
time. If role
is given, it then substitutes
for these nodes the nodes found by tracing the first pointer found
that goes from the sequenced nodes with the given role. (This is useful
if the data has been annotated using values stored in an external
ontology or corpus resource.) At this point, the sequence is
assumed to contain nodes that contain the named attribute, and the
value of this attribute is used as the node's state.
Tag
is required, but
query
is itself optional; by default, it is
the query matching all nodes of the type named i
ntag
. Generally, the query's first variable
will be of the node type specified in tag
,
and canonically, the query will simply filter out some nodes from the
sequence. However, as long as a state can be calculated for each
node in the sequence using the attribute specified, the utility will work.
There is no -allatonce
option; if no
observation
is specified, only one set of numbers is reported but the utility
loads only one observation at a time when calculating them.
java NGramCalc -c METADATA -t turn -a fs -n 3
will calculate trigrams of fs attributes of turns and output a tab-delimited table like
500 newfloor floor broke n0 newfloor newfloor newfloor
Suppose that the way that the data is set up includes an additional attribute value that we wish to skip over when calculating the tri-grams, called "continued".
java NGramCalc -c METADATA -t turn -a fs -n 3 -q '($t turn):($t@fs != "continued")'
will do this. Entries for "continued" will still occur in the output table because it is a declared value, but will have zero in the entries.
java NGramCalc -c METADATA -t gesture-type -a name -n 3 -q '($g gest):' -r gest-target
will produce trigrams where the states are found by tracing the
gest-target role from gest elements, which finds gesture-type
elements (canonically, part of some corpus resource), and further
looking at the values of their name attributes. Note that in this
case, the tag type given in -t
is what results from
tracing the role from the query results, not the type returned in the
query.
java FunctionQuery
{-c corpus
} {-q query
} [-o observation
] {-att attribute_or_aggregate
...}
FunctionQuery is a utility for outputting tab-delimited data. It takes all elements resulting from the result of a query, as long as they are timed, and put them in order of start time. Then it outputs one line per element containing the values of the named attributes or aggregates with a tab character between each one.
The value of -atts
must be a space-separated list
of attribute and aggregate specifiers. If an
attribute or aggregate does not exist for some
matched elements, a blank tab-stop will be output for the corresponding
field.
Attribute values can be specified using the form
var
@attributename
(e.g., $v@label
, where label is the name of the
attribute). If the variable specifier (e.g.,
$v
) is omitted, the
attribute belonging to the first variable in the query (the "primary
variable") is returned. If the attribute specifier
(e.g.. label
) is omitted, the
ntextual content for the node will be shown. Nodes may have either
direct textual content or children; in the case of children, the textual
content shown will be the concatenated textual content of its
descendants separated by spaces. For backwards compability with a
norder utility called SortedOutput, instead of specifying it in the
list of attributes, -text
can be used to place this
textual content in the last field, although this is not recommended.
Aggregate functions are identified by a leading '@
' character.
The first argument to an aggregate function is always a query
to be evaluated in the context of the current result using the variable
bindings from the main query. For instance, if $m
has bee
nbound in the main query to nodes of type move
,
the context query ($w w):($m ^ $w)
will find all w
nodes descended from the move
corresponding to the current return value, and the context query
($g gest):($m # $g)
, all gest
nodes
that temporally overlap with it. The list of returned results for
the context query are then used in the aggregation.
For the following functions, optional arguments are denoted by an equals sign followed by the default value of that argument. There are currently four aggregate functions included in FunctionQuery.
Aggregate Functions
@count(
conquery
)
returns the number of results from evaluating conquery
@sum(
conquery
,
attr
)
returns the sum of the values of
attr
for all results of
conquery
. attr
should be numerical attribute.
@extract(
conquery
,
attr
,
n
=0,
last
=n+1)
returns the attr
attribute of the
n
th result of
conquery
evaluated in the context of
query. If n
is less than 0, extract
returns the attr
attribute of the
n
th last result. If
last
is provided, the
attr
value of all results whose index is
at least n
and less tha
n last
is returned. If
last
is less than 0, it will count back
from the final result. If last
equals
zero, all items between n
and the end of
the result list will be returned.
@overlapduration(
conquery
)
returns the length of time that the results of
conquery
overlap with the results of the
main query. For some conquery
results,
this number may exceed the duration of the main query result. For
example, the duration of speech for all participants over a period
of time may exceed the duration of the time segment if there are
multiple simultaneous speakers. This can be avoided, for example,
by using conquery
to restrict matches to a
specific agent.
java FunctionQuery -ccorpus
-oobservation
-q '($m move)' -atts type nite:start nite:end '@count(($w w):$w#$m)' '$m'
will output a sorted list of moves for the observation consisting of type attribute, start and end times, the count of w (words) that overlap each move, and any text included in the move, or any children.
java Index
{-c corpus
} {-q query
} [-o observation
] [-t tag
] {-r role
...}
Index modifies a corpus by adding new nodes that index
the results of a query so that they can be found quickly.
If observation
is omitted,
all observations named in the metadata file are indexed in turn.
One new node is created for each query match. The new
nodes have type tag
,
which defaults to "markable".
If -r is omitted, the new node is made a parent of the match for
the first unquantified variable of the query. If -r is included,
then the new node will instead use the role names to point
to the nodes in the n-tuple returned at the top level of the
query, using the role names in the order given and the
variables in the order used in the query until one of the two
lists is exhausted.
Index does not remove existing tags of the given type before operatio
nso that an index can be built up gradually using several different
queries.
Note that the same node can be indexed more than once, if the query returns n-tuples that involve the same node. The tool does nothing to check whether this is the case even when creating indices that are parents of existing nodes, which can lead to invalid data if you are not careful. Using roles, however, is always safe, as is using parents when the top level of the given query matches only one unquantified variable.
Note that if you want one pointer for every named variable in a simple query, or you want tree-structured indices corresponding to the results for complex queries, you can use SaveQueryResults and load the results as a coding. For cases where you could use either, the main difference is that SaveQueryResults doesn't give control over the tag name and roles.
The tool assumes that a suitable declaration for the new tag have already bee
nadded into the metadata file. It is usual to put it in a new coding,
and it would be a bad idea to put in a layer that anything points to,
since no work is done to attach the indices to prospective parents or
anything else besides what they index.
If the indexing adds parents, then the type of the coding file
(interaction or agent) must match the type of the coding file
that contains the matches to the first variable.
If an observation name is passed, it creates a index only for the one
observation; if none is, it indexes each observation in the metadata
file by loading one at a time (that is, there is no equivalent to
-allatonce
operation).
The canonical metadata form for an index file, assuming roles are used, is an interaction coding declared as follows:
<coding-file name="foo
"> <featural-layer name="baz
"> <code name="tag
"> <pointer number="1" role="role1
" target="LAYER_CONTAINING_MATCHES
"/> ... </code> </featural-layer> </coding-file>
The name of the coding file determines the filenames where the indices get stored. The name of the featural-layer is unimportant but must be unique. The tags for the indices must not already be used in some other part of the corpus, including other indices.
To add indices that point to active sentences in the Switchboard data, add the
following coding-file
tag to the metadata as an interaction-coding
(i.e., as a sister to the other coding file declarations).
<coding-file name="sentences"> <featural-layer name="sentence-layer"> <code name="sentenceindex"> <pointer number="1" role="at"/> </code> </featural-layer> </coding-file>
This specifies that the indices for sw2005 (for example) should go in sw2005.sentences.xml. Then, for example,
java Index -c swbd-metadata.xml -t active -q '($sent nt):($sent@cat=="S")'
After indexing,
($n nt)($i sentenceindex):($i >"at" $n)
gets the sentences.
Sometimes even though an annotation layer draws children from some lower layer, it's useful to know what the closest correspondence is between the segments in that layer and some different lower layer. For instance, consider having both hand transcription and hand annotation for dialogue acts above it, and also ASR output with automatic dialogue act annotation on top of that. There is no relationship apart from timing between the hand and automatic dialogue acts, but to find out how well the automatic process works, it's useful to know whether it segments the hand transcribed words the same way, and with the same categories, as the hand annotation does.
ProjectImage
is a tool that allows this comparison to be made. Given some source annotation
that segments the data by drawing children from a lower layer, and the name of a target annotation that is
defined as drawing children from a different lower layer, it creates the target annotation by adding annotations
that are just like the source but with the other children. A child is inside a target segment if its timing midpoint
is after the start and before the end of the source segment. If there are no such children, then the target
element will be empty. ProjectImage
adds a pointer from each target element back to
its source element so that it's easy to check categories etc.
ProjectImage was committed to CVS on 21/11/2006 and will be in all subsequent NXT builds.
Checkout and build from CVS (or use a build if there is one post 21/11/06).
Edit your metadata file and prepare the ground. You need to decide what NXT element is being projected onto which other. As an example we'll look at Named Entities on the AMI corpus: imagine we want to project manually generated NEs onto ASR output to take a look at the result. You'll already have the manual NEs and ASR transcription declarations in your metadata:
<coding-file name="ne" path="namedEntities"> <structural-layer draws-children-from="words-layer" name="ne-layer"> <code name="named-entity" text-content="false"> <pointer number="0" role="type" target="ne-types"/> </code> </structural-layer> </coding-file> <!-- ASR version of the words --> <coding-file name="asr" path="ASR"> <time-aligned-layer name="asr-words-layer"> <code name="asrword" text-content="true"/> <code name="asrsil"/> </time-aligned-layer> </coding-file>
and now you need to add the projection layer into the metadata file, remembering to add a pointer from the target to source layer:
<!-- ASR Named entities --> <coding-file name="ane" path="ASRnamedEntities"> <structural-layer draws-children-from="asr-words-layer" name="asr-ne-layer"> <code name="asr-named-entity" text-content="false"> <pointer number="0" role="source_element" target="ne-layer"/> <pointer number="0" role="type" target="ne-types"/> </code> </structural-layer> </coding-file>
Using a standard NXT CLASSPATH
or just using the -cp
argument to the java command below like this:
-cp lib/nxt.jar:lib/xercesImpl.jar
, run ProjectImage
:
java net.sourceforge.nite.util.ProjectImage -c /path/to
/AMI-metadata.xml
-o ES2008a -s named-entity -t asr-named-entity
The arguments to ProjectImage
are:
-c
metadata file including definition for the target annotation
-o
Optional observation argument. If it's not there the projection will be done for the entire corpus
-s
source element name
-t
target element name
The output is a (set of) standard NXT files that can be loaded with the others. To get textual output, use FunctionQuery
on the target annotation resulting from running ProjectImage
(see FunctionQuery).
ProjectImage
can be used to project any type of data segment onto a different child layer, and so has many uses beyond the one described.
The main restriction is that the segments must all use the same tag name. Although it might be more natural to define the imaging in terms
of a complete NXT layer, the user would have to specify at the command line a complete mapping from source tags to target tags, which
would be cumbersome. Moreover, many current segmentation layers use single tags. In future NXT versions we may consider generalizing
to remove this restriction.
This section contains documentation of the facility for loading multiply-annotated data that forms the core of NXT's support for
reliability tests, plus a worked example from the AMI project, kindly supplied by Vasilis Karaiskos. For more information,
see the JavaDoc corresponding to the NOM loading routine for multiply-annotated data, for CountQueryMulti
, and for
MultiAnnotatorDisplay
.
The facilities described on this page are new for NXT v 1.3.3.
Many projects wish to know how well multiple human annotators agree on how to apply their coding manuals, and so they have different human annotators read the same manual and code the same data. They then need to calculate some kind of measurement statistic for the resulting agreement. This measurement can depend on the structure of the annotation (agreement on straight categorization of existing segments being simpler to measure than annotations that require the human to segment the data as well) as well as what field they are in, since statistical development for this form of measurement is still in progress, and agreed practice varies from community to community.
NXT 1.3.3 and higher provides some help for this statistical measurement, in the form of a facility that can load the data from multiple annotators into the same NOM (NXT's object model, or internal data representation, which can be used as the basis for Java applications that traverse the NOM counting things or for query execution).
This facility works as follows. The metadata specifies a relative path from itself to directories at which all coding files
containing data can be found. (The data can either be all together, in which case the path is usually given on the
<codings>
tag,
or it can be in separate directories by type, in which case the path is specified on the individual
<coding-file>
tags.) NXT assumes that if there is annotation available from multiple
annotators, it will be found not in the specified directory itself, but in subdirectories of the directory specified,
where the subdirectories is called by the names (or some other unique designators) of the annotators.
Annotation schemes often require more than one layer in the NOM representation. The loading routine takes as arguments the
name of the highest layer containing multiple annotations; the name of a layer reached from that layer by child links
that is common between the two annotators, or null if the annotation grounds out at signal instead; and a string to use
as an attribute name in the NOM to designate the annotator for some data. Note that the use of a top layer and a
common layer below it allows the program to know exactly where the multiply annotated data is - it is in the top layer plus
all the layers between the two layers, but not in the common layer. (It is possible to arrange annotation schemes so
that they do not fit this structure, in which case, NXT will not support reliability studies on them.) The routine
loads all of the versions of these multiply-annotated layers into the NOM, differentiating them by using the subdirectory
name as the value for the additional attribute representing the annotator.
NXT is agnostic as to which statistical measures are appropriate. It does not currently (June 05) implement any, but leaves
users to write Java applications or sets of NXT queries that allow their chosen measures to be calculated.
(Being an open source project, of course, anyone who writes such applications can add them to NXT for the benefit of others
who make the same choices.) Version 1.3.3 provides two end user facilities that will be helpful for these studies,
which are essentially multiple annotator versions of the
and of GenericDisplay
GUICountQueryResults
.
This is a version of the GenericDisplay
that takes additional command line arguments as required by the loading
routine for multiply-annotated data, and renders separate windows for each annotation for each annotator.
The advantage of using the GUI is, as usual, for debugging queries, since queries can be executed, with the results
highlighted on the data display.
To call the GUI:
java net.sourceforge.nite.gui.util.MultiAnnotatorDisplay -cMETADATAFILE
-oOBSERVATION
-tlTOPLAYER
[[-clCOMMONLAYER
] [-aANNOTATOR
]]
-c METADATAFILENAME names a metadata file defining the corpus to be loaded. |
-tl TOPLAYER names the data layer at the top of the multiple annotations to be loaded. |
-cl COMMONLAYER is required only if the multiple annotations ground
out in a common layer, and names the first data layer, reached by descending from the toplayer using child links,
that is common between the multiple annotations. |
-a ANNOTATOR is the name of the attribute to add to the loaded data
that contains the name of the subdirectory from which the annotations were obtained - that is, the unique designator
for the annotation. Optional; defaults to coder . |
To call:
java CountQueryMulti -corpusMETADATAFILE
-queryQUERY
-toplayerTOPLAYER
-commonlayerCOMMONLAYER
[[-attribute ANNOTATOR] [-observation OBSERVATION][-allatonce]]
where arguments are as for MultiAnnotatorDisplay
, apart from the following (which are as for
CountQueryResults
):
-observation OBSERVATION : the observation whose annotations
are to be loaded. Optional; if not given, all observations are processed one by one with counts given in a table. |
-query QUERY : the query to be executed. |
-allatonce : Optional; if used, then the entire corpus is loaded together, with output
counting over the entire corpus. This option is very slow and memory-intensive, and assuming you are
willing to total the results from the individual observations, is only necessary if queries
draw context from outside single observations. |
The remainder of this web page demonstrates an annotation scheme reliability test in NITE. The example queries below come from the agreement test on the named entities annotation of the AMI corpus. Six recorded meetings were annotated by two coders, whose marking were consequently compared. The categories and attributes that come into play are the following:
named-entity: new named entities - the data for which we are doing the reliability test.
These are parents of words in the transcript. They are in a layer called ne-layer . |
w: the words in the transcript. They are in a layer called word-layer . |
ne-type: the categories a named entity can be assigned to. They are in an ontology, with the named
entities pointing to them, using the type role. |
name: an attribute of a named entity type that gives the category for the named entity
(e.g., timex , enamex ). |
coder:an attribute of a named entity, signifying who marked the entity. |
The tests are being carried out by loading the annotated data on the NXT display MultiAnnotatorDisplay
(included in nxt_1.3.3 and above). The call can be incorporated in a shell script along with the appropriate
classpaths. For example, the following is included in our multi.sh
script run from the root of the NXT install
(% sh multi.sh
). All the CLASSPATH
s should be in a single line in the actual script.
#!/bin/bash # Note that a Java runtime should be on the path. # The current directory should be root of the nxt install. # unless you edit this variable to contain the path to your install # then you can run from anywhere. CLASSPATH statements need to be # in a single line NXT="." # Adjust classpath for running under cygwin. if [ $OSTYPE = 'cygwin' ]; then export CLASSPATH=".;$NXT;$NXT/lib;$NXT/lib/nxt.jar;$NXT/lib/jdom.jar; $NXT/lib/JMF/lib/jmf.jar;$NXT/lib/pnuts.jar;$NXT/lib/resolver.jar; $NXT/lib/xalan.jar;$NXT/lib/xercesImpl.jar;$NXT/lib/xml-apis.jar; $NXT/lib/jmanual.jar;$NXT/lib/jh.jar;$NXT/lib/helpset.jar;$NXT/lib/poi.jar; $NXT/lib/eclipseicons.jar;$NXT/lib/icons.jar;$NXT/lib/forms-1.0.4.jar; $NXT/lib/looks-1.2.2.jar;$NXT/lib/necoderHelp.jar;$NXT/lib/videolabelerHelp.jar; $NXT/lib/dacoderHelp.jar;$NXT/lib/testcoderHelp.jar" else export CLASSPATH=".:$NXT:$NXT/lib:$NXT/lib/nxt.jar:$NXT/lib/jdom.jar: $NXT/lib/JMF/lib/jmf.jar:$NXT/lib/pnuts.jar:$NXT/lib/resolver.jar: $NXT/lib/xalan.jar:$NXT/lib/xercesImpl.jar:$NXT/lib/xml-apis.jar: $NXT/lib/jmanual.jar:$NXT/lib/jh.jar:$NXT/lib/helpset.jar:$NXT/lib/poi.jar: $NXT/lib/eclipseicons.jar:$NXT/lib/icons.jar:lib/forms-1.0.4.jar: $NXT/lib/looks-1.2.2.jar:$NXT/lib/necoderHelp.jar:$NXT/lib/videolabelerHelp.jar: $NXT/lib/dacoderHelp.jar:$NXT/lib/testcoderHelp.jar" # echo "CLASSPATH=.:$NXT:$NXT/lib:$NXT/lib/nxt.jar:$NXT/lib/jdom.jar: $NXT/lib/JMF/lib/jmf.jar:$NXT/lib/pnuts.jar:$NXT/lib/resolver.jar: $NXT/lib/xalan.jar:$NXT/lib/xercesImpl.jar:$NXT/lib/xml-apis.jar: $NXT/lib/jmanual.jar:$NXT/lib/jh.jar:$NXT/lib/helpset.jar:$NXT/lib/poi.jar: $NXT/lib/eclipseicons.jar:$NXT/lib/icons.jar:lib/forms-1.0.4.jar: $NXT/lib/looks-1.2.2.jar:$NXT/lib/necoderHelp.jar:$NXT/lib/videolabelerHelp.jar: $NXT/lib/dacoderHelp.jar:$NXT/lib/testcoderHelp.jar\n"; fi java net.sourceforge.nite.gui.util.MultiAnnotatorDisplay -c Data/AMI/AMI-metadata.xml -tl ne-layer -cl words-layer
A GUI with a multitude of windows will load (each window contains the data of one of the various layers of data and annotations), thus allowing comparisons between the choices of these coders. In our examples below the annotators are named Coder1 and Coder2.
Selecting n
-tuple is selected. For a the low-down on the NITE query language (NiteQL), look at the
query language documentation or the Help menu in the query GUI.
($a named-entity) : $a@coder=="Coder1"
Give a list of all the named entities marked by Coder1
.
($w w)(exists $a named-entity) : $a@coder="Coder1" && $a ^ $w
Give a list of all the words marked as named entities by Coder1
.
($a named-entity): $a@coder=="Coder1" :: ($w w): $a ^ $w
Gives all the named entities marked by Coder1
showing the words included in each entity.
($a named-entity)($t ne-type) : ($a >"type"^ $t) && ($t@name == "EntityType") && ($a@coder == "Coder1")
Gives the named entities of type EntityType
annotated by
Coder1
. The entity types (and their names) to choose from can be seen in the
respective window in the GUI (titled "Ontology: ne-types" in this case).
($a named-entity)($t ne-type) : ($a >"type"^ $t) && ($t@name == "EntityType") && ($a@coder == "Coder1") ::
($w w): $a ^ $w
Like the previous query, only each match also includes the words forming the entity.
($t ne-type) :: ($a named-entity) : $a@coder=="Coder1" && $a >"type"^ $t
Gives a list of all the named entity types (including root
), and for each type,
the entities of that type annotated by Coder1
. By writing the last term of the query
as $a >"type" $t
, the query will match only the bottom level entity types (the ones used as actual tags),
that is it will display MEASURE
entities, but not NUMEX
ones (assuming here that MEASURE
is a sub-type of NUMEX
).
($a named-entity)($t ne-type) : $a@coder=="Coder1" && $a >"type"^ $t ::
($w w): $a ^ $w
Like the previous query, only each match (n
-tuple) also includes the words forming the entity.
The following examples check for agreement between the two annotators as to whether some text should be marked as a named entity:
($a named-entity)($b named-entity): $a@coder=="Coder1" && $b@coder=="Coder2" ::
($w1 w) (forall $w w) : ($a ^ $w1) && ($b ^ $w1) &&(($a ^ $w) -> ($b ^ $w)) && (($b ^ $w) -> ($a ^ $w))
Gives a lost of all the co-extensive named entities between Coder1
and
Coder2
along with the words forming the entities (the entities do not have to be of
the same type, but they have to span exactly the same text).
($a named-entity)($b named-entity): $a@coder=="Coder1" && $b@coder=="Coder2" ::
($w1 w) (exists $w w) : ($a ^ $w1) && ($b ^ $w1) &&(($a ^ $w) -> ($b ^ $w)) && (($b ^ $w) -> ($a ^ $w))
Like the previous query, but includes named entities that are only partially co-extensive. The words showing in the query results are only the ones where the entities actually overlap.
($a named-entity)(forall $b named-entity)(forall $w w): $a@coder=="Coder1" && (($b@coder=="Coder2" &&
($a ^ $w))->!($b ^ $w))
Gives the list of entities that only Coder1
has marked, i.e. there is no
corresponding entity in Coder2
. Switching Coder1
and Coder2
in the query, gives the respective set of entities for
Coder2
.
($a named-entity)(forall $b named-entity)(forall $w w): $a@coder=="Coder2" && (($b@coder=="Coder1"
&& ($a ^ $w))->!($b ^ $w)) || $a@coder=="Coder1" && (($b@coder=="Coder2" && ($a ^ $w))->!($b ^ $w))
Like the previous query, only this time both sets of non-corresponding entities is given in one go.
The following examples check how the two annotators agree on the categorisation of co-extensive entities:
($a named-entity)($b named-entity) ($t ne-type): $a@coder=="Coder1" && $b@coder=="Coder2"
&& ($a >"type" $t) && ($b >"type" $t) :: ($w1 w) (forall $w w) : ($a ^ $w1) && ($b ^ $w1) &&(($a ^ $w) ->
($b ^ $w)) && (($b ^ $w) -> ($a ^ $w))
Gives all the common named entities between Coder1
and
Coder2
along with the entity type and text; the entities have to be
co-extensive (fully overlapping) and of the same type.
($a named-entity)($b named-entity) ($t ne-type): $a@coder=="Coder1" && $b@coder=="Coder2"
&& ($a >"type" $t) && ($b >"type" $t) :: ($w1 w) (exists $w w) : ($a ^ $w1) && ($b ^ $w1) &&(($a ^ $w) ->
($b ^ $w)) && (($b ^ $w) -> ($a ^ $w))
Like the previous query, but includes partially co-extensive entities. The words showing in the query results are only the ones that actually do overlap.
($a named-entity)($b named-entity) ($t ne-type): $a@coder=="Coder1" && $b@coder=="Coder2"
&& ($a >"type" $t) && ($b >"type" $t) :: ($w2 w):($a ^ $w2) && ($b ^ $w2) :: ($w w):(($b ^ $w) && !($a ^ $w)) ||
(($a ^ $w) && !($b ^ $w))
Gives the list of entities which are the same type, but only partially co-extensive. The results include the entire set of words from both codings.
($a named-entity)($b named-entity) ($t ne-type)($t1 ne-type): $a@coder=="Annotator1"
&& $b@coder=="Annotator2" && ($a >"type" $t) && ($b >"type" $t1) && ($t != $t1) ::
($w1 w) (exists $w w) : ($a ^ $w1) && ($b ^ $w1) &&(($a ^ $w) -> ($b ^ $w)) && (($b ^ $w) -> ($a ^ $w)) ::
($w2 w): ($b ^ $w2)
Gives the list of entities, which are partially or fully co-extensive, but for which the two coders disagree as to the type.
($a named-entity)($b named-entity)($c ne-type)($d ne-type):
$a@coder=="Coder1" && $b@coder=="Coder2" && $c@name="EntityType1" && $d@name="EntityType2"&& $a>"type"^ $c && $b>"type"^ $d ::
($w2 w):($a ^ $w2) && ($b ^ $w2)
Gives the list of entities which are partially or fully co-extensive, and which Coder1
has marked as EntityType1
(or one of its sub-types) and
Coder2
has marked as EntityType2
(or one of its sub-types).
This checks for type-specific disagreements between the two coders.
($t ne-type): !($t@name="ne-root") :: ($a named-entity)($b named-entity): $a@coder=="Coder1"
&& $b@coder=="Coder2" && (($a >"type"^ $t) && ($b >"type"^ $t)) ::
($w1 w) (forall $w w) : ($a ^ $w1) && ($b ^ $w1) &&(($a ^ $w) -> ($b ^ $w)) && (($b ^ $w) -> ($a ^ $w))
The query creates a list of all the entity types, and slots in each entry all the (fully) co-extensive entities as marked by the two coders. The actual text forming each entity is also included in the results.
($t1 ne-type): !($t1@name="ne-root") ::
($a named-entity)($b named-entity): $a@coder=="Coder1" && $b@coder=="Coder2" && (($a >"type"^ $t1) && ($b >"type"^ $t1)) ::
($w1 w) (exists $w w) : ($a ^ $w1) && ($b ^ $w1) &&(($a ^ $w) -> ($b ^ $w)) && (($b ^ $w) -> ($a ^ $w))
Like the previous query, but includes partially co-extensive entities. The words showing in the query results are only the ones that actually do overlap.