The NXT Query Language (NQL)

The NXT Query Language (NQL)
Prev		Next

NXT has its own query language: the NXT Query Language, or NQL,. In this section, we describe NQL, list its operators, and give example queries.

General structure of a simple query

NQL queries describe n-tuples of nodes, possibly constrained by type, and a set of conditions that expresses further constraints, for instance, on the attributes that a node contains or how two nodes relate to each other. If an n-tuple of nodes with the required types satisfies the conditions expressed in the query, it is said to be a match.

Syntactically, a query consists of two parts separated by a colon (:). The first part declares the variables for the query, and the second part expresses the conditions.

Table 3. Example showing general query structure

($a)($b word): $a ^ $b

This query matched pairs (2-tuples) of nodes in which the first, bound to $a, can be any type and the second, bound to $b must be of type word, and where the first dominates the second. In this example ($a)($b word) is the declaration part and the dominance relation $a ^ $b is the only condition.

Formal definition:

	query  :=  declarations : match_condition

Declaration part

The declaration part of the query must contain a variable declaration for every variable mentioned in the conditions. Each variable declaration is enclosed in parentheses. The components of a variable declaration are separated by whitespace. The first component is an optional quantifier; the possible quantifiers are forall and exists, which have their usual logical meanings. The second component is the name of the variable. A variable name is a $ character followed by an arbitrary number of letters and digits, which can include underscore ( _ ) and language-specific characters. The final component is an optional type restriction. This is either a simple string expressing the type of a node in the NOM or a disjunction of these types separated using the pipe symbol (|).

Table 4. Example declaration parts

($a)	The query matches singletons (1-tuples) in which `$a` is be bound to nodes of any type for which the conditions are true. Tip An empty type definition may slow down query processing drastically.
($a word)($b sentence)	The query matches pairs (2-tuples) in which `$a` is bound to nodes of type word and `$b` is bound to nodes of type sentence for which the conditions are true.
($a word)($b word)	The query matches pairs (2-tuples) in which `$a` is bound to nodes of type word and `$b` is bound to nodes of type word for which the conditions are true. Caution In a pair, both variables might be bound to the same node.
($a word \| phrase \| sentence)	The query matches singletons (1-tuples) in which `$a` is bound to nodes of type word, phrase, or sentence for which the conditions are true.
($b sentence)(forall $a word)	The query matches singletons (1-tuples) in which `$b` is bound to any node that has type sentence for which the conditions are true for every possible way of binding `$a` to individual nodes that have type word.

Formal definition:

	declarations     :=  declarations var_declaration
	declarations     :=  var_declaration
	var_declaration  :=  ( variable )
	var_declaration  :=  ( variable typedeclaration )
	typedeclaration  :=  types
	typedeclaration  :=  type
	types            :=  types | type

Grammar for forall/exists missing. We want to use some rule system with optionality or this is going to take forever - check what docbook supports?

Condition part

The condition part is a Boolean expression over property tests, structural relations, and temporal relations. Parentheses are only needed if a lower precedence relation should be executed first. The strongest binding operator is negation ( ! ). The operators are listed in the order of their precedence below:

Negation: not or !
Conjunction: and or & or &&
Disjunction: or or | or ||
Implication: ->

For convenience, there are a number of syntactic literals for each operator. The different forms for the same operator are completely synomynous. The implication operator -> is the weakest binding operator and, again, is provided for convenience; $a -> $b is logically equivalent to !a | b.

Formal definition:

	match_condition    := 
	match_condition    := ( match_condition )
	match_condition    := ! match_condition
	var_declaration    := match_condition & match_condition
	var_declaration    := match_condition | match_condition
	var_declaration    := match_condition -> match_condition
	var_declaration    := property_test
	var_declaration    := structural_relation
	var_declaration    := time_relation

Tip

The condition part may be empty, in which case the query always evaluates to true.

Property tests

A property test either tests for the existence of some property or compares the values of two properties.

Simple and functional property expressions

A property may be an attribute value, a constant, or the result of a function applied to an element.

The query language contains a number of functions that take elements and return some property of the element. These functions have to do with properties that are special in NXT: timing, identification, and textual content. Function names can be given in upper or lower case, and the operators = and == are synonymous. Where the data storage format for a corpus stores these properties as XML attributes, they can also be queried using the form variable @attribute. The functional form is preferred because it is the same across all data sets and leaves the user in no doubt that these are the attributes holding the special properties.

The functions are as follows:

Table 5. Query functions

TEXT($w)	Returns the text contained by the element matched by `$w`.
ID($w)	Returns the unique identifier of the element matched by `$w`.
TIMED($w)	Returns true if the element matched by `$w` has start and end times, and false otherwise.
START($w)	Returns the start time of the element matched by `$w`.
END($a)	Returns the end time of the element matched by `$w`.
DURATION($a)	Returns the duration of the element matched by `$w` (that is, the end time minus the start time).
CENTER($a)	Returns the temporal center of the element matched by `$w` (that is, the end time minus the start time, divided by 2).

Formal definition:

	property    := " number_or_string " 
	property    := variable @ attribute
	property    := TEXT(  variable )
	property    := ID( variable )
	property    := TIMED( variable )
	property    := START( variable )
	property    := END( variable )
	property    := DURATION( variable )
	property    := CENTER( variable )

Note

Numbers and string values must be placed in quotes. Placing a number in quotes does not mean it will be treated as a string. Either single or double quotes can be used, but if you are passing a query as an argument at the command line, your choice must be compatible with your choice of quotes for the shell.

Property existence tests

Property existence tests check for the existence of some property.

Table 6. Node property tests

$w@pos	True if and only if the element matched by `$a` has a pos attribute.
TIMED($a)	True if and only if the element matched by `$a` is timed, either because it has both start and end times or because it inherits them from their children.
START($a)	True if and only if the element matched by `$a` has a start time, either in its own right or by inheritance from its children.
END($a)	True if and only if the element matched by `$a` has an end time, either in its own right or by inheritance from its children.
TEXT($a)	True if and only if the element matched by `$a` contains text.

It is actually possible to test for the existence of any property, but the other possible existence tests are not useful; ids, string, and numbers always exist, and duration and center properties exist for elements that are timed.

Formal definition:

	property_test    := variable @ attribute
	property_test    := TIMED( variable )
	property_test    := START( variable )
	property_test    := END( variable )
	property_test    := TEXT( variable )

String and number comparisons using`==`, `!=`, `<=`, `<`, `>`, and `>=`

The next set of property tests compare the equality and order of two values. Property expressions are weakly typed. The value resulting from an expression will be interpreted as a floating-point number when possible. If it cannot be converted into a number or if the value is compared to a pattern given by a regular expression, the value will be treated as a string. A number is always unequal to a string. Strings are themselves alphabetically ordered, and are case-sensitive. Strings starting with upper case letters are less than strings with upper case letters. As a result, "2" == "2.0" is true, while "Two" == "two" is false.

Table 7. Equality and order tests

($x): $x@cat=="NP"	Matches elements with a category attribute containing the string value "NP".
($x)($y): $x@cat==$y@cat	Matches pairs of elements with the same cat attribute (including the pair where `$x` and `$y` are bound to the same element).
($x)($y): $x@cat==$y@cat & $x!= $y	Matches pairs of elements with the same `cat` attribute (excluding the pair where `$x` and `$y` are bound to the same element).

Tip

= and == are synonymous.

Formal definition:

	property_test    := property  ==  property
	property_test    := property  !=  property
	property_test    := property  <  property
	property_test    := property  >  property
	property_test    := property  <=  property
	property_test    := property  >=  property

Regular expression comparisons

The final set of property tests compare string values against regular expressions. Regular expressions are enclosed by slashes (/). NXT's regular expression implementation uses Java regular expressions underneath, so it is not possible to give a definitive syntax for the patterns here, instead, see the Java 1.5 Regular Expression Documentation or equivalent documentation for your Java version.

Table 8. Regular expression examples

($a): text($a) ~ /th.*/	Words starting with `th`. Dot (`.`) means any single character, and `*` means 0 or more repetitions of whatever it follows.
($a): text($a) ~ /[dD](as\|er)/	The words `das` and `der`, whether capitalized or not.
($a): text($a) ~ /.+([0-9A-Z])+.*/	Words which contain at least one uppercase letter or number at a non-initial position. The plus (`+`) means 1 or more repetitions of whatever it follows, and the square brackets (`[]`)specify a character class.
($a): text($a) ~ /\.*/	A possibly empty sequence of dots, where in contrast `/.*/` matches every word (assuming it contains text). The backslash (`\`) means the dot (`.`) is interpreted literally.

Note

Your regular expression must match the entire string, not some substring contained in it. /x/ in NQL notation means /^x$/ in the Perl notation.

Formal definition:

	property_test    := property ~ / pattern / 
	property_test    := property !~ / pattern /

Comments

Comments are allowed in the form of line comments and block comments. Line comments start with the symbol // and include the remainder of the current line. Block comments begin with /* and end with */, and may extend over multiple lines.

Table 9. Comment examples

($a) // all elements	line comment: all elements
($a)($b word): /$a@pos="NN" & / $a ^ $b	only `($a)($b word): $a ^ $b` will be processed

Structural relations

Identity

The simplest structural relation asserts the identity or non-identity of two elements. Since the default evaluation strategy allows different variables to be bound to the same element, the != operator is sometimes necessary to exclude unwanted results. The == operator is less useful and was mainly added for the sake of symmetry.

	structural_relation    :=    variable == variable
	structural_relation    :=    variable =! variable

Dominance

The basic structural relation is the dominance relation ^. To describe that an element a dominates an element b the dominance operator ^ is be used. In other words a is an ancestor of b.

	structural_relation    :=   variable ^ variable
	structural_relation    :=   variable ^ distance variable

Note

The expression a^a is always true! Use the non-identity operator to exclude these special case.

Precedence

Two elements are in a precedence relation if they have a common ancestor element, which can be a normal element or the root element of a layer. An element $x precedes another element $y if some ancestor of $x (or $x itself) is a preceding sibling of some ancesor of $y (or $y itself).

	structural_relation    :=    variable <> variable

Note

The expression a<>a is always false!

Some examples:

Table 10. Structural relations examples

($a)($b): $a ^ $b & $a != $b	all combinations of two different elements in a dominance relation
($s syntax)($w word): $s ^1 $w	all combinations of syntax and word elements, where the syntax element dominates directly the word element
($a)($b): $a ^0 $b	equal to `$a == $b`
($a)($b): $a ^-2 $b	equal to `$b ^2 $a`
($a word)($b word): $a <> $b	two words, `$a` precedes `$b`

Temporal relations

Table 11. Temporal relations examples

Op., short	Operator, lexical	Definition
`%`	overlaps.left	(start($a) <= start($b)) and (end($a) > start($b)) and (end($a) <= end($b))
`[[`	left.aligned.with	start($a) == start($b)
`]]`	right.aligned.with	end($a) == end($b)
`@`	includes inclusion	(start($a) <= start($b)) and (end($a) >= end($b))
`[]`	same.extent.as	(start($a) == start($b)) and (end($a) == end($b))
`#`	overlaps.with	(end($a) > start($b)) and (end($b) > start($a))
`][`	contact.with	end($a) == start($b)
`<<`	precedes	end($a) <= start($b)
	starts.earlier.than	start($a) <= start($b)
	starts.later.than	start($a) >= start($b)
	ends.earlier.than	end($a) <= end($b)
	ends.later.than	end($a) >= end($b)

Quantifier

To express complex structural relations in some cases auxiliary elements are required, which should not be part of the query result. Sometimes it is sufficient that one such element satisfies the match condition, sometimes all auxiliary elements must match.

The mathematical solution to this problem are the existential and universal quantifiers. In NQL variables can be existential quantified or universal quantified. In both cases elments which are bound to a quantified variable are not part of the result.

The formal definition of Condition part is now extended with quantifiers:

	var_declaration     :=  ( exists variable )
	var_declaration     :=  ( exists variable typedefinition )
	var_declaration     :=  ( forall variable )
	var_declaration     :=  ( forall variable typedefinition )

In queries with quantifiers the implication operator -> could be useful (see Condition part).

Some examples:

Table 12. Quantifier Examples

($a)(exists $b): $a ^1 $b	elements with children
($root)(forall $null): !$null ^1 $root	root elements

Query results

The result of a query is a list of n-tuples of elements (or, more precisely, variable bindings) satisfying the match condition, where n is the number of variables declared without quantifiers (cf. Quantifier). The query result is returned in the form of an XML document (or, abstractely, a new tree structure adjoined to the corpus). Each query match corresponds to a match element, with pointers representing variable bindings and the variable name given by the pointer's role.

An example result for a query involving variables $w and $p is:

  
<matchlist size="2">      
	<match n="1">          
		<nite:pointer role="w" xlink:href="..."/>          
		<nite:pointer role="p" xlink:href="..."/>      
	</match>      
	<match n="2">          
		<nite:pointer role="w" xlink:href="..."/>          
		<nite:pointer role="p" xlink:href="..."/>      
	</match>  
</matchlist>

Note

The matches are not ordered. The ordering of the results of two similar but not identical queries can be very different.

Complex queries

A complex query consists of a sequence of simple queries seperated by :: markers.

	complex_query     :=  complex_query :: query
	complex_query     :=  query

For a complex query, the leftmost query is evaluated first. Each query in the sequence operates on the result of the previous query. This means that for every match, the following query is evaluated with the variable bindings of the previous queries. The fixed variable bindings may be used anywhere in the ensuing queries. This evaluation strategy produces a hierarchically structured query result, where each match of the leftmost simple query includes a matchlist for the second query, etc.

In the example

($w word): $w@orth ~ /S.*/ :: ($p phone): $w ^ $p

the query result has the following structure:

  
<matchlist size="2">      
	<match n="1">          
		<nite:pointer role="w" xlink:href="..."/>          
		<matchlist type="sub" size="2">              
			<match n="1">                  
				<nite:pointer role="p" xlink:href="..."/>              
			</match>              
			<match n="2">                  
				<nite:pointer role="p" xlink:href="..."/>              
			</match>          
		</matchlist>      
	</match>      
	<match n="2">          
		<nite:pointer role="w" xlink:href="..."/>          
		<matchlist type="sub" size="1">              
			<match n="1">                  
				<nite:pointer role="p" xlink:href="..."/>              
			</match>          
		</matchlist>      
	</match>  
</matchlist>

Note

There are no empty submatches. If for a variable binding the following single query has no matches, the variable binding will be removed from the result. So the number of matches for a complex query is less than or equal to the number of matches for the first part.

Known Problems

IS THIS SECTION TO BE KEPT?

At Feb 05, there are a number of known problems with the current querylanguage implementation.

Multiple observations and timings

There is a bug when querying over multiple observations - the implementation considers times in different observations to be comparable, so that it's possible to get the result that an element in one observation is before some element in another. This is easy to get around: query on one observation at a time, or declare the reserved attribute for observation names for your corpus and add a test for the same observation as an extra query term - e.g.($f@obs = $g@obs), if the attribute declared is obs.

Search GUI and forall

The search GUI (whether called stand-alone or from a search menu) can't display results if some subquery in a complex query only has query matches that are bound with forall - e.g. ($f foo):($f@att="val")::(forall $g bar):!($g ^ $f)

Immediate Precedence

The immediate precedence operator is missing. Immediate precedence is equivalent to ($f foo)($g foo)(forall $h foo): ($f<<$g) && (($h=$f) || ($h=$g) || ($h<<$f) || ($g<<$h))

but, of course, this is cumbersome and can be too slow and memory-intensive for practical purposes, depending on the data set. Some common uses of the operator are covered by the NGramCalc utility.Another work-around is to create one XML tree from the NXT data thatrepresents the information required and query it using XPath. Export to LPath and tgrep2 would also be reasonable and are not difficult to implement. If you need to match on regular expressions of XML elements in order to add markup, (so, for instance, saying "find syntactic constituents with one determiner, followed by one or more adjectives, followed by one noun, and wrap a new tag around them"), but you can always use something like fsgmatch (from the LTG; new release, currently in beta, is called lxtransduce) and then modify the metadata to match. Remember, the data is just XML, amenable to all of the usual XML processing techniques.

Arithmetic

The arithmetic operators are missing.

At present, users who need them add new attributes to theirdata set and then carry on as normal. For instance, a researcher looking at how often bar elements start in the 10 seconds after foo elements end might add an "adjusted start" attribute to bar elements that take 10 secondsoff their official start times, and then use the query ($f foo)($b bar):(START($b) > END($f)) && ($b@adjustedstart < END($foo))

This stylesheet, run on a specific individual coding in the context of the MONITOR project, is an example of how this can be done. It just copies everything, adding new attributes to feedback gaze codes. We used this general technique on the Switchboard data to get lengths for syntactic constituents, and on the Monitor data to get durations.

This method is inconvenient, particularly for the sort of exploratory study that wishes to consider several different time relationships. We don't think it is worth adding special loading routines that addtemporary attributes for adjusted start and end times, but we could include some utilities for command line searching based on adjustments passed in on the command line. For instance, java CountWithTimeOffset -q '($t turn)($f feedback):($t # $f)' -t feedback -d 50 could mean to count overlaps after feedback elements have been displaced 50 seconds forward. We are considering whether this would be useful enough to supply.

Inability to handle namespacing

At present (Apr 05) the query language parser fails to handle namespacing properly, so any elements and attributes that are namespaced will be difficult to work with. For the timing and id attributes, where the default names are in the nite: namespace, this doesn'tmatter, since they are exposed to query via e.g. START($x), but namespacing other tags and attributes would make working with them difficult until this is fixed.

Speed and Memory Use

NXT's query engine is slow and uses a great deal of memory. For instance, some of our more complicated syntactic querieson the Switchboard corpus take 10 seconds per dialogue, or over an hour and a half for the entire corpus.

This is partly a consequence of what it does - the query languageis solving a harder problem than languages that operate on trees and/or are limited in their use of left and right context. It istrue that the current implementation is not fully optimized, but this is not something we intend to look at in the immediate future. Our first choice strategy for addressing this problem is to look at mapping NQL queries to XQuery for implementation, and addition of the missing operators, that way. Meanwhile, most of NXT's users are not actually engaged in real-time processing, and find that if they develop queries on a few observations using a GUI, they can then afford to run the queries at the command line in batch. The more they are interested in sparse phenomena, the less suitable this strategy is. For some query-based analyses, it is also useful to consider direct implementation using the NOM API, since the programmer can optimize for the analysis being performed.

Meanwhile, an hour and a half is OK for batch mode, but some of our queries areso common that we really want easy access to the results. We can get this by indexing. Using indices rather than the more complex syntactic queries makes querying roughly ten times faster. This will be even faster if one then selects not to load the syntax at all, which is possible if one doesn't need it for other parts of the subsequent query. You can choose not to load any part of the data by commenting out the <coding-file> tag for it in your local copy of the metadata file, or after NXT 1.3.0, by enabling lazy loading in your applications.

It's faster to use string equality than regular expression matching in the query language, and keep in mind the regular expressions have to match the entire string they are compared against, not just a substring of it.

The very desperate can write special purpose applications to evaluate their queries, which is faster especially for queries involving quantification. For instance,one user has adapted CountQueryResults to run part of the query he wants, but instead of returning the results, then checks the equivalent of hisforall conditions using navigation in the NOM.

Helpful hints

We recommend refining queries using display.bat/.sh on a single dialogue (probably spot-checking on a couple more, since observations vary), and running actual counts using the command line utilities. Build up queries term by term - the syntax error messages aren't always very easy to understand. Missing dollar signs, quotation marks, and parentheses are the worst culprits. Get around the bookmark problems and the lack of parenthesis and quote matching in the searchGUI by typing the query into something else that's handier (such as emacs) and pasting what you've written into the query window. You canand should include comments in queries if they are at all complicated. Queries have to be expressed on one line to run them at the command line, but you shouldn't try to author them this way - instead, postprocess a query developed in this more verbose style by taking out UNFINISHED SENTENCE

Analysis of query results can be expedited by thinking carefully aboutthe battery of tools that are available: knit, LT-XML, stylesheets, xmlperl, shell script loops, and so on. One interesting possibility is importing the query results into the data set, which would be a fancier, hierarchically structured form of indexing. At May 2004, the metadata <coding-file> declaration required to do this would be a little different for every query result, but we intend minor syntactic changes in both the query result XML and what knit produces to make this declaration static.