NXT has its own query language: the NXT Query Language, or NQL,. In this section, we describe NQL, list its operators, and give example queries.
NQL queries describe n-tuples of nodes, possibly constrained by type, and a set of conditions that expresses further constraints, for instance, on the attributes that a node contains or how two nodes relate to each other. If an n-tuple of nodes with the required types satisfies the conditions expressed in the query, it is said to be a match.
Syntactically, a query consists of two parts separated by a
colon (:
). The first part declares the
variables for the query, and the second part expresses the
conditions.
Table 3. Example showing general query structure
($a)($b word): $a ^ $b | This query matched pairs (2-tuples) of nodes in which the first, bound to $a ,
can be any type and the second, bound to $b must be of type word, and
where the first dominates the second. In this example ($a)($b word) is the declaration
part and the dominance relation $a ^ $b is the only condition.
|
Formal definition:
query
:=declarations
:
match_condition
The declaration part of the query must contain a variable declaration for every variable mentioned in the conditions.
Each variable declaration is enclosed in parentheses. The components of a variable declaration are separated by
whitespace. The first component is an optional quantifier; the possible quantifiers are
forall
and exists
, which have their usual logical meanings.
The second component is the name of the variable. A variable name is a $
character followed by an
arbitrary number of letters and digits, which can include underscore ( _
) and language-specific
characters. The final component is an optional type restriction. This is either a simple string expressing the
type of a node in the NOM or a disjunction of these types separated using the pipe symbol (|
).
Table 4. Example declaration parts
($a) |
The query matches singletons (1-tuples) in which $a is be bound to nodes of any
type for which the conditions are true.
TipAn empty type definition may slow down query processing drastically. |
($a word)($b sentence) |
The query matches pairs (2-tuples) in which $a is bound to nodes of type
word and $b is bound to nodes of type
sentence for which the conditions are true.
|
($a word)($b word) |
The query matches pairs (2-tuples) in which $a is bound to nodes of type word
and $b is bound to nodes of type word for which the conditions are true.
CautionIn a pair, both variables might be bound to the same node. |
($a word | phrase | sentence) |
The query matches singletons (1-tuples) in which $a is bound to nodes of type word,
phrase, or sentence for which the conditions are true.
|
($b sentence)(forall $a word) |
The query matches singletons (1-tuples) in which $b is bound to any node that has type
sentence for which the conditions are true for every possible way of binding $a
to individual nodes that have type word.
|
Formal definition:
declarations
:=declarations var_declaration
declarations
:=var_declaration
var_declaration
:=(
variable
)
var_declaration
:=(
variable typedeclaration
)
typedeclaration
:=types
typedeclaration
:=type
types
:=types
|
type
Grammar for forall/exists missing. We want to use some rule system with optionality or this is going to take forever - check what docbook supports?
The condition part is a Boolean expression over property tests,
structural relations, and temporal relations. Parentheses are only
needed if a lower precedence relation should be executed first. The
strongest binding operator is negation ( !
).
The operators are listed in the order of their precedence below:
Negation: not
or !
Conjunction: and
or &
or &&
Disjunction: or
or |
or ||
Implication: ->
For convenience, there are a number of syntactic literals for each operator.
The different forms for the same operator are completely synomynous.
The implication operator ->
is the weakest binding operator and,
again, is provided for convenience; $a -> $b
is logically equivalent to
!a | b
.
Formal definition:
match_condition
:=match_condition
:=(
match_condition
)
match_condition
:=!
match_condition
var_declaration
:=match_condition
&
match_condition
var_declaration
:=match_condition
|
match_condition
var_declaration
:=match_condition
->
match_condition
var_declaration
:=property_test
var_declaration
:=structural_relation
var_declaration
:=time_relation
The condition part may be empty, in which case the query always evaluates to true.
A property test either tests for the existence of some property or compares the values of two properties.
A property may be an attribute value, a constant, or the result of a function applied to an element.
The query language contains a number of functions that take elements and return some property of the element.
These functions have to do with properties that are special in NXT: timing,
identification, and textual content. Function names can be given in upper or lower case, and the operators
=
and ==
are synonymous. Where the data storage format for a corpus stores
these properties as XML attributes, they can also be queried using the form variable
@
attribute
. The functional form is preferred because it is the same across all data
sets and leaves the user in no doubt that these are the attributes holding the special properties.
The functions are as follows:
Table 5. Query functions
TEXT($w) |
Returns the text contained by the element matched by $w .
|
ID($w) |
Returns the unique identifier of the element matched by $w .
|
TIMED($w) |
Returns true if the element matched by $w has start and end times, and false otherwise.
|
START($w) |
Returns the start time of the element matched by $w .
|
END($a) |
Returns the end time of the element matched by $w .
|
DURATION($a) |
Returns the duration of the element matched by $w (that is, the end time minus the start time).
|
CENTER($a) |
Returns the temporal center of the element matched by $w (that is, the end time minus the start time, divided by 2).
|
Formal definition:
property
:="
number_or_string
"
property
:=variable
@
attribute
property
:=TEXT(
variable
)
property
:=ID(
variable
)
property
:=TIMED(
variable
)
property
:=START(
variable
)
property
:=END(
variable
)
property
:=DURATION(
variable
)
property
:=CENTER(
variable
)
Numbers and string values must be placed in quotes. Placing a number in quotes does not mean it will be treated as a string. Either single or double quotes can be used, but if you are passing a query as an argument at the command line, your choice must be compatible with your choice of quotes for the shell.
Property existence tests check for the existence of some property.
Table 6. Node property tests
$w@pos |
True if and only if the element matched by $a has a pos attribute.
|
TIMED($a) |
True if and only if the element matched by $a is timed, either because it has both start and
end times or because it inherits them from their children.
|
START($a) |
True if and only if the element matched by $a has a start time, either in its own right or
by inheritance from its children.
|
END($a) |
True if and only if the element matched by $a has an end time, either in its own right or by
inheritance from its children.
|
TEXT($a) |
True if and only if the element matched by $a contains text.
|
It is actually possible to test for the existence of any property, but the other possible existence tests are not useful;
ids
, string
, and numbers
always
exist, and duration
and center
properties exist for elements
that are timed.
Formal definition:
property_test
:=variable
@
attribute
property_test
:=TIMED(
variable
)
property_test
:=START(
variable
)
property_test
:=END(
variable
)
property_test
:=TEXT(
variable
)
The next set of property tests compare the equality and order of two values. Property expressions are weakly typed. The value
resulting from an expression will be interpreted as a floating-point
number when possible. If it cannot be converted into a number or if
the value is compared to a pattern given by a regular expression,
the value will be treated as a string. A number is always unequal to
a string. Strings are themselves alphabetically ordered, and are
case-sensitive. Strings starting with upper case letters are less
than strings with upper case letters. As a result, "2" == "2.0"
is true, while
"Two" == "two"
is false.
Table 7. Equality and order tests
($x): $x@cat=="NP" | Matches elements with a category attribute containing the string value "NP". |
($x)($y): $x@cat==$y@cat |
Matches pairs of elements with the same cat attribute
(including the pair where $x and $y
are bound to the same element).
|
($x)($y): $x@cat==$y@cat & $x!= $y |
Matches pairs of elements with the same cat attribute
(excluding the pair where $x and $y
are bound to the same element).
|
=
and ==
are synonymous.
Formal definition:
property_test
:=property
==
property
property_test
:=property
!=
property
property_test
:=property
<
property
property_test
:=property
>
property
property_test
:=property
<=
property
property_test
:=property
>=
property
The final set of property tests compare string values against
regular expressions. Regular expressions are enclosed by slashes (/
). NXT's regular expression
implementation uses Java regular expressions underneath, so it is not possible to give a definitive syntax for the patterns here,
instead, see the
Java 1.5 Regular Expression Documentation
or equivalent documentation for your Java version.
Table 8. Regular expression examples
($a): text($a) ~ /th.*/ |
Words starting with th . Dot
(. ) means any single character, and
* means 0 or more repetitions of whatever
it follows.
|
($a): text($a) ~ /[dD](as|er)/ |
The words das and
der , whether capitalized or not.
|
($a): text($a) ~ /.+([0-9A-Z])+.*/ |
Words which contain at least one uppercase letter or
number at a non-initial position. The plus
(+ ) means 1 or more repetitions of whatever
it follows, and the square brackets
([] )specify a character class.
|
($a): text($a) ~ /\.*/ |
A possibly empty sequence of dots, where in contrast
/.*/ matches every word (assuming it contains
text). The backslash (\ ) means the dot
(. ) is interpreted literally.
|
Your regular expression must match the entire string, not some substring contained in it. /x/
in NQL notation
means /^x$/
in the Perl notation.
Formal definition:
property_test
:=property
~ /
pattern
/
property_test
:=property
!~ /
pattern
/
Comments are allowed in the form of line comments and block
comments. Line comments start with the
symbol //
and include the remainder of the current
line. Block comments begin with
/*
and end with */
, and may extend
over multiple lines.
The simplest structural relation asserts the identity or non-identity
of two elements. Since the default evaluation strategy allows different variables to be bound to the same element, the
!=
operator is sometimes necessary to exclude unwanted results. The
==
operator is less useful and was mainly added for the sake of symmetry.
structural_relation
:=variable
==
variable
structural_relation
:=variable
=!
variable
The basic structural relation is the dominance relation ^
. To describe that an element a
dominates an element b
the dominance operator ^
is be used. In other words
a
is an ancestor of b
.
structural_relation
:=variable
^
variable
structural_relation
:=variable
^
distance variable
The expression a
^
a
is always true! Use the non-identity operator to exclude these special case.
Two elements are in a precedence relation if they have a common ancestor element, which can be a normal element or
the root element of a layer. An element $x
precedes another element $y
if some ancestor of
$x
(or $x
itself) is a preceding sibling of some ancesor of $y
(or $y
itself).
structural_relation
:=variable
<>
variable
The expression
a
<>
a
is always false!
Some examples:
Table 10. Structural relations examples
($a)($b): $a ^ $b & $a != $b | all combinations of two different elements in a dominance relation |
($s syntax)($w word): $s ^1 $w | all combinations of syntax and word elements, where the syntax element dominates directly the word element |
($a)($b): $a ^0 $b | equal to $a == $b
|
($a)($b): $a ^-2 $b | equal to $b ^2 $a
|
($a word)($b word): $a <> $b | two words, $a precedes $b
|
Table 11. Temporal relations examples
Op., short | Operator, lexical | Definition |
---|---|---|
%
|
overlaps.left |
(start($a) <= start($b)) and (end($a) > start($b)) and (end($a) <= end($b)) |
[[
|
left.aligned.with |
start($a) == start($b) |
]]
|
right.aligned.with |
end($a) == end($b) |
@
|
includes inclusion |
(start($a) <= start($b)) and (end($a) >= end($b)) |
[]
|
same.extent.as |
(start($a) == start($b)) and (end($a) == end($b)) |
#
|
overlaps.with |
(end($a) > start($b)) and (end($b) > start($a)) |
][
|
contact.with |
end($a) == start($b) |
<<
|
precedes |
end($a) <= start($b) |
starts.earlier.than |
start($a) <= start($b) | |
starts.later.than |
start($a) >= start($b) | |
ends.earlier.than |
end($a) <= end($b) | |
ends.later.than |
end($a) >= end($b) |
To express complex structural relations in some cases auxiliary elements are required, which should not be part of the query result. Sometimes it is sufficient that one such element satisfies the match condition, sometimes all auxiliary elements must match.
The mathematical solution to this problem are the existential and universal quantifiers. In NQL variables can be existential quantified or universal quantified. In both cases elments which are bound to a quantified variable are not part of the result.
The formal definition of Condition part is now extended with quantifiers:
var_declaration
:=( exists
variable
)
var_declaration
:=( exists
variable typedefinition
)
var_declaration
:=( forall
variable
)
var_declaration
:=( forall
variable typedefinition
)
In queries with quantifiers the implication operator ->
could be useful (see Condition part).
Some examples:
Table 12. Quantifier Examples
($a)(exists $b): $a ^1 $b | elements with children |
($root)(forall $null): !$null ^1 $root | root elements |
The result of a query is a list of n
-tuples of elements (or,
more precisely, variable bindings) satisfying the match condition, where
n
is the number of variables declared without quantifiers (cf.
Quantifier).
The query result is returned in the form of an XML document (or, abstractely, a new tree structure adjoined
to the corpus). Each query match corresponds to a match
element, with pointers representing variable bindings and the variable name given by the pointer's role.
An example result for a query involving variables $w
and
$p
is:
<matchlist size="2"> <match n="1"> <nite:pointer role="w" xlink:href="..."/> <nite:pointer role="p" xlink:href="..."/> </match> <match n="2"> <nite:pointer role="w" xlink:href="..."/> <nite:pointer role="p" xlink:href="..."/> </match> </matchlist>
The matches are not ordered. The ordering of the results of two similar but not identical queries can be very different.
A complex query consists of a sequence of simple queries seperated by
::
markers.
complex_query
:=complex_query
::
query
complex_query
:=query
For a complex query, the leftmost query is evaluated first. Each query in the sequence operates on the result of the previous query. This means that for every match, the following query is evaluated with the variable bindings of the previous queries. The fixed variable bindings may be used anywhere in the ensuing queries. This evaluation strategy produces a hierarchically structured query result, where each match of the leftmost simple query includes a matchlist for the second query, etc.
In the example
($w word): $w@orth ~ /S.*/ :: ($p phone): $w ^ $p
the query result has the following structure:
<matchlist size="2"> <match n="1"> <nite:pointer role="w" xlink:href="..."/> <matchlist type="sub" size="2"> <match n="1"> <nite:pointer role="p" xlink:href="..."/> </match> <match n="2"> <nite:pointer role="p" xlink:href="..."/> </match> </matchlist> </match> <match n="2"> <nite:pointer role="w" xlink:href="..."/> <matchlist type="sub" size="1"> <match n="1"> <nite:pointer role="p" xlink:href="..."/> </match> </matchlist> </match> </matchlist>
There are no empty submatches. If for a variable binding the following single query has no matches, the variable binding will be removed from the result. So the number of matches for a complex query is less than or equal to the number of matches for the first part.
IS THIS SECTION TO BE KEPT?
At Feb 05, there are a number of known problems with the current querylanguage implementation.
There is a bug when querying over multiple observations - the implementation considers
times in different observations to be comparable, so that it's possible to get the result that
an element in one observation is before some element in another. This is easy to get around:
query on one observation at a time, or declare the reserved attribute for observation names for
your corpus and add a test for the same observation as an extra query term - e.g.($f@obs = $g@obs)
,
if the attribute declared is obs.
The search GUI (whether called stand-alone or from a search menu) can't display results if
some subquery in a complex query only has query matches that are bound with forall
- e.g.
($f foo):($f@att="val")::(forall $g bar):!($g ^ $f)
The immediate precedence operator is missing. Immediate
precedence is equivalent to
($f foo)($g foo)(forall $h foo): ($f<<$g) && (($h=$f) || ($h=$g) || ($h<<$f) || ($g<<$h))
but, of course, this is cumbersome and can be too slow and memory-intensive for practical purposes,
depending on the data set. Some common uses of the operator are covered by the NGramCalc
utility.Another work-around
is to create one XML tree from the NXT data thatrepresents the information required and query it using XPath.
Export to LPath and tgrep2 would also be reasonable and
are not difficult to implement. If you need to match on
regular expressions of XML elements in order to add markup, (so, for instance, saying "find syntactic constituents
with one determiner, followed by one or more adjectives, followed by one noun, and wrap a new tag around them"), but
you can always use something like fsgmatch (from the LTG; new release,
currently in beta, is called lxtransduce) and
then modify the metadata to match. Remember, the data is just XML, amenable to all of the usual XML processing
techniques.
The arithmetic operators are missing.
At present, users who need them add new attributes to theirdata set and then carry on as normal.
For instance, a researcher looking at how often bar
elements start in the 10 seconds after
foo
elements end might add an "adjusted start" attribute to bar
elements
that take 10 secondsoff their official start
times, and then use the query
($f foo)($b bar):(START($b) > END($f)) && ($b@adjustedstart < END($foo))
This stylesheet, run on a specific individual coding in the context of the MONITOR project, is an example of how this can be done. It just copies everything, adding new attributes to feedback gaze codes. We used this general technique on the Switchboard data to get lengths for syntactic constituents, and on the Monitor data to get durations.
This method is inconvenient, particularly for the sort of exploratory study that wishes to consider several
different time relationships. We don't think it is worth adding special loading routines that addtemporary
attributes for adjusted start and end times, but we could include some utilities for command line searching
based on adjustments passed in on the command line. For instance,
java CountWithTimeOffset -q '($t turn)($f feedback):($t # $f)' -t feedback -d 50
could mean to count overlaps after feedback elements have been displaced 50 seconds forward.
We are considering whether this would be useful enough to supply.
At present (Apr 05) the query language parser fails to handle namespacing properly, so any elements and attributes
that are namespaced will be difficult to work with. For the timing and id attributes, where the
default names are in the nite:
namespace, this doesn'tmatter, since they are exposed
to query via e.g.
START($x)
, but namespacing other tags and attributes would make working with them difficult until this is fixed.
NXT's query engine is slow and uses a great deal of memory. For instance, some of our more complicated syntactic querieson the Switchboard corpus take 10 seconds per dialogue, or over an hour and a half for the entire corpus.
This is partly a consequence of what it does - the query languageis solving a harder problem than languages that operate on trees and/or are limited in their use of left and right context. It istrue that the current implementation is not fully optimized, but this is not something we intend to look at in the immediate future. Our first choice strategy for addressing this problem is to look at mapping NQL queries to XQuery for implementation, and addition of the missing operators, that way. Meanwhile, most of NXT's users are not actually engaged in real-time processing, and find that if they develop queries on a few observations using a GUI, they can then afford to run the queries at the command line in batch. The more they are interested in sparse phenomena, the less suitable this strategy is. For some query-based analyses, it is also useful to consider direct implementation using the NOM API, since the programmer can optimize for the analysis being performed.
Meanwhile, an hour and a half is OK for batch mode, but some of our queries areso common that we
really want easy access to the results. We can get this by indexing.
Using indices rather than the more complex syntactic queries makes querying roughly ten times faster.
This will be even faster if one then selects not to load the syntax at all, which is possible if one
doesn't need it for other parts of the subsequent query.
You can choose not to load any part of the data by commenting out the
<coding-file>
tag for it in your local copy of the metadata file, or after NXT 1.3.0, by enabling lazy loading
in your applications.
It's faster to use string equality than regular expression matching in the query language, and keep in mind the regular expressions have to match the entire string they are compared against, not just a substring of it.
The very desperate can write special purpose applications to evaluate their queries,
which is faster especially for queries involving quantification.
For instance,one user has adapted CountQueryResults
to run part of the query he wants,
but instead of returning the results, then checks the equivalent of hisforall conditions using navigation in the NOM.
We recommend refining queries using display.bat/.sh
on a single dialogue (probably
spot-checking on a couple more, since observations vary), and running actual counts using the
command line utilities. Build up queries term by term - the syntax error messages aren't always very easy
to understand. Missing dollar signs, quotation marks, and parentheses are the worst culprits.
Get around the bookmark problems and the lack of parenthesis and quote matching in the searchGUI
by typing the query into something else that's handier (such as emacs)
and pasting what you've written into the query window.
You canand should include comments in queries if they are at all complicated.
Queries have to be expressed on one line to run them at the command line, but you shouldn't try
to author them this way - instead, postprocess a query developed in this more verbose style by taking out
UNFINISHED SENTENCE
Analysis of query results can be expedited by thinking carefully aboutthe battery of tools that are
available: knit
, LT-XML, stylesheets, xmlperl,
shell script loops, and so on. One interesting possibility is importing the query results into the data set,
which would be a fancier, hierarchically structured form of indexing. At May 2004, the metadata
<coding-file>
declaration required to do this would be a little different for every query result,
but we intend minor syntactic changes in both the query result XML and what knit
produces
to make this declaration static.
The main documentation is the query language reference manual . Virtually the same information can be found on the helpmenu of the search window (if you don't find it there, it's an installation problem). An older document with more contextual information can be found here.
At September 2006, we plan a revised version of the manual. The current version fails to give details
about the operator for finding out whether two elements are linked via a pointer with a role.
($a <"foo" $b)
is true if $a
points to $b
using the
"foo"
role; the role name can be omitted, but if it is specified it can only be
given as a textual string, not as a regular expression. The current version also fails to make clear
that the regular expression examples given are only a subset of the possibilities.
The exact regular expression syntax depends on your version of Java, since it is implemented using the
java.util.regex package. Java 1.5 regular expression documentation can be found
here
.
Here are some worked examples for the Switchboard data sample and the Monitor data sample.
Computer scientists and people familiar with first order predicate calculus have tended to be happy
with the reference manual plus the examples; other people need more (so, for instance, don't know what
implication is or what forall
is likely to mean) and we're still thinking about what we
might be able to provide for them.
At Nov 2004, there are a few things described in the query documentation that haven't been implemented yet
(and aren't on the workplan for immediate development). This includes arithmetic operators and temporal fuzziness.
We thought this included versions of ^
and
<>
limited by distance, but users report that these (or some of these?)
work. Also, some versions of the query documentation show ;
instead of :
as
the separator between bindings and match conditions. The only major bug we've run into (at Nov 2004) is that
temporal operators will perform comparisons across observations, even though time in different observations is
meant to be independent. After NXT-1.2.6, 05 May 04, one can in the metadata declare a reserved attribute to use
for the observation name that will be
added automatically for every element, providing a work-around.
There's a nifty visual demo that runs on a toy corpus and might be useful for deciding whether this stuff is useful in the first place.