
NXT has its own query language: the NXT Query Language, or NQL,. In this section, we describe NQL, list its operators, and give example queries.
NQL queries describe n-tuples of nodes, possibly constrained by type, and a set of conditions that expresses further constraints, for instance, on the attributes that a node contains or how two nodes relate to each other. If an n-tuple of nodes with the required types satisfies the conditions expressed in the query, it is said to be a match.
Syntactically, a query consists of two parts separated by a
	colon (:). The first part declares the
	variables for the query, and the second part expresses the
	conditions.
Table 3. Example showing general query structure
            
			($a)($b word): $a ^ $b  | This query matched pairs (2-tuples) of nodes in which the first, bound to $a, 
			can be any type and the second, bound to $b must be of type word, and 
			where the first dominates the second.  In this example ($a)($b word) is the declaration 
			part and the dominance relation $a ^ $b is the only condition.
		 | 
Formal definition:
query:=declarations:match_condition
	The declaration part of the query must contain a variable declaration for every variable mentioned in the conditions.
Each variable declaration is enclosed in parentheses. The components of a variable declaration are separated by 
whitespace. The first component is an optional quantifier; the possible quantifiers are 
forall and exists, which have their usual logical meanings.
	The second component is the name of the variable. A variable name is a $ character followed by an
	arbitrary number of letters and digits, which can include underscore ( _ ) and language-specific 
	characters. The final component is an optional type restriction. This is either a simple string expressing the 
	type of a node in the NOM or a disjunction of these types separated using the pipe symbol (|).
Table 4. Example declaration parts
		($a)  |       
		The query matches singletons (1-tuples) in which $a is be bound to nodes of any 
		type for which the conditions are true.         
		TipAn empty type definition may slow down query processing drastically.  | 
            
		($a word)($b sentence)  |             
		The query matches pairs (2-tuples) in which $a is bound to nodes of type 
		word and $b is bound to nodes of type 
		sentence for which the conditions are true.
     | 
            
		($a word)($b word)  |             
		The query matches pairs (2-tuples) in which $a is bound to nodes of type word
		and $b is bound to nodes of type word for which the conditions are true.
		CautionIn a pair, both variables might be bound to the same node.  | 
            
		($a word | phrase | sentence)  |             
		The query matches singletons (1-tuples) in which $a is bound to nodes of type word, 
		phrase, or sentence for which the conditions are true.
	 | 
            
		($b sentence)(forall $a word)  |             
		The query matches singletons (1-tuples) in which $b is bound to any node that has type            
		sentence for which the conditions are true for every possible way of binding $a
		to individual nodes that have type word.	    
	 | 
Formal definition:
declarations:=declarations var_declarationdeclarations:=var_declarationvar_declaration:=(variable)var_declaration:=(variable typedeclaration)typedeclaration:=typestypedeclaration:=typetypes:=types|type
Grammar for forall/exists missing. We want to use some rule system with optionality or this is going to take forever - check what docbook supports?
The condition part is a Boolean expression over property tests,      
	structural relations, and temporal relations. Parentheses are only      
	needed if a lower precedence relation should be executed first. The      
	strongest binding operator is negation ( ! ). 
	The operators are listed in the order of their precedence below:
       
		Negation: not or !
	
          
		Conjunction: and or & or &&
	
          
		Disjunction: or or | or ||
	
          
		Implication: ->
	
For convenience, there are a number of syntactic literals for each operator. 
	The different forms for the same operator are completely synomynous. 
	The implication operator -> is the weakest binding operator and, 
	again, is provided for convenience; $a -> $b is logically equivalent to 
	!a | b.      
Formal definition:
match_condition:=match_condition:=(match_condition)match_condition:=!match_conditionvar_declaration:=match_condition&match_conditionvar_declaration:=match_condition|match_conditionvar_declaration:=match_condition->match_conditionvar_declaration:=property_testvar_declaration:=structural_relationvar_declaration:=time_relation
The condition part may be empty, in which case the query always evaluates to true.
A property test either tests for the existence of some property or compares the values of two properties.
A property may be an attribute value, a constant, or the result of a function applied to an element.
The query language contains a number of functions that take  elements and return some property of the element. 
	These functions have to do with properties that are special in NXT: timing, 
	identification, and textual content. Function names can be given in upper or lower case, and the operators 
	= and == are synonymous. Where the data storage format for a corpus stores 
	these properties as XML attributes, they can also be queried using the form variable
    @attribute. The functional form is preferred because it is the same across all data 
    sets and leaves the user in no doubt that these are the attributes holding the special properties. 
The functions are as follows:
Table 5. Query functions
                
		TEXT($w)  |                 
		Returns the text contained by the element matched by $w.
	 | 
                
		ID($w)  |                 
		Returns the unique identifier of the element matched by $w.
	 | 
                
		TIMED($w)  |                 
		Returns true if the element matched by $w has start and end times, and false otherwise.
	 | 
                
		START($w)  |                 
		Returns the start time of the element matched by $w. 
     | 
                
	END($a)  |                 
		Returns the end time of the element matched by $w. 
     | 
                
		DURATION($a)  |                 
		Returns the duration of the element matched by $w (that is, the end time minus the start time).          
	 | 
                
		CENTER($a)  |                 
		Returns the temporal center of the element matched by $w (that is, the end time minus the start time, divided by 2). 
	 | 
Formal definition:
property:="number_or_string"property:=variable@attributeproperty:=TEXT(variable)property:=ID(variable)property:=TIMED(variable)property:=START(variable)property:=END(variable)property:=DURATION(variable)property:=CENTER(variable)
Numbers and string values must be placed in quotes. Placing a number in quotes does not mean it will be treated as a string. Either single or double quotes can be used, but if you are passing a query as an argument at the command line, your choice must be compatible with your choice of quotes for the shell.
Property existence tests check for the existence of some property.
Table 6. Node property tests
                
		$w@pos  |                 
		True if and only if the element matched by $a has a pos attribute. 
	 | 
                
		TIMED($a)  |                 
		True if and only if the element matched by $a is timed, either because it has both start and 
		end times or because it inherits them from their  children.
	 | 
                
		START($a)  |                 
		True if and only if the element matched by $a has a start time, either in its own right or 
		by inheritance from its children.
     | 
                
		END($a)  |                 
		True if and only if the element matched by $a has an end time, either in its own right or by  
		inheritance from its children.
	 | 
                
		TEXT($a)  |                 
		True if and only if the element matched by $a contains text. 
	 | 
It is actually possible to test for the existence of any property, but the other possible existence tests are not useful; 
	ids, string, and numbers always 
	exist, and duration and center properties exist for elements 
	that are timed.
Formal definition:
property_test:=variable@attributeproperty_test:=TIMED(variable)property_test:=START(variable)property_test:=END(variable)property_test:=TEXT(variable)
The next set of property tests compare the equality and order of two values. Property expressions are weakly typed. The value 
resulting from an expression will be interpreted as a floating-point 
number when possible. If it cannot be converted into a number or if
the value is compared to a pattern given by a regular expression, 
the value will be treated as a string. A number is always unequal to 
a string. Strings are themselves alphabetically ordered, and are 
case-sensitive. Strings starting with upper case letters are less 
than strings with upper case letters. As a result, "2" == "2.0" is true, while 
"Two" == "two" is false.
Table 7. Equality and order tests
                ($x): $x@cat=="NP"  | Matches elements with a category attribute containing the string value "NP". | 
                ($x)($y): $x@cat==$y@cat  | 
                Matches pairs of elements with the same cat attribute
                (including the pair where $x and $y
                are bound to the same element). 
               | 
                ($x)($y): $x@cat==$y@cat & $x!= $y  | 
                Matches pairs of elements with the same cat attribute
                (excluding the pair where $x and $y
                are bound to the same element).
               | 
            
= and == are synonymous.          
Formal definition:
property_test:=property==propertyproperty_test:=property!=propertyproperty_test:=property<propertyproperty_test:=property>propertyproperty_test:=property<=propertyproperty_test:=property>=property
The final set of property tests compare string values against 
regular expressions. Regular expressions are enclosed by slashes (/). NXT's regular expression 
	implementation uses Java regular expressions underneath, so it is not possible to give a definitive syntax for the patterns here, 
	instead, see the 
	Java 1.5 Regular Expression Documentation
 or equivalent documentation for your Java version. 
Table 8. Regular expression examples
                ($a): text($a) ~ /th.*/  | 
                Words starting with th. Dot
                (.) means any single character, and
                * means 0 or more repetitions of whatever
                it follows.
               | 
                ($a): text($a) ~ /[dD](as|er)/  | 
                The words das and
                der, whether capitalized or not. 
               | 
                ($a): text($a) ~ /.+([0-9A-Z])+.*/  | 
                Words which contain at least one uppercase letter or
                number at a non-initial position. The plus
                (+) means 1 or more repetitions of whatever
                it follows, and the square brackets
                ([])specify a character class. 
               | 
                ($a): text($a) ~ /\.*/  | 
                A possibly empty sequence of dots, where in contrast
                /.*/ matches every word (assuming it contains
                text). The backslash (\) means the dot
                (.) is interpreted literally.
               | 
            
Your regular expression must match the entire string, not some substring contained in it. /x/ in NQL notation 
	means /^x$/ in the Perl notation.
Formal definition:
property_test:=property~ /pattern/property_test:=property!~ /pattern/
Comments are allowed in the form of line comments and block
      comments. Line comments start with the
      symbol // and include the remainder of the current
      line. Block comments begin with
      /* and end with */, and may extend
      over multiple lines.
The simplest structural relation asserts the identity or non-identity 
of two elements. Since the default evaluation strategy allows different variables to be bound to the same element, the 
!= operator is sometimes necessary to exclude unwanted results. The      
== operator is less useful and was mainly added for the sake of symmetry.      
structural_relation:=variable==variablestructural_relation:=variable=!variable
The basic structural relation is the dominance relation ^. To describe that an element a
 dominates an element b the dominance operator ^ is be used. In other words 
a is an ancestor of b.      
structural_relation:=variable^variablestructural_relation:=variable^distance variable
The expression a^a
 is always true! Use the non-identity operator to exclude these special case.
Two elements are in a precedence relation if they have a common ancestor element, which can be a normal element or 
the root element of a layer. An element $x precedes another element $y if some ancestor of      
$x (or $x itself) is a preceding sibling of some ancesor of $y (or $y itself).      
 
structural_relation:=variable<>variable
The expression 
a<>a is always false!
Some examples:
Table 10. Structural relations examples
		($a)($b): $a ^ $b & $a != $b  | all combinations of two different elements in a dominance relation | 
		($s syntax)($w word): $s ^1 $w  | all combinations of syntax and word elements, where the syntax element dominates directly the word element | 
	($a)($b): $a ^0 $b  | equal to $a == $b
	 | 
		($a)($b): $a ^-2 $b  | equal to $b ^2 $a
	 | 
		($a word)($b word): $a <> $b  | two words, $a precedes $b
	 | 
Table 11. Temporal relations examples
| Op., short | Operator, lexical | Definition | 
|---|---|---|
		%
	 | 
		overlaps.left  | 
		(start($a) <= start($b)) and (end($a) > start($b)) and (end($a) <= end($b))  | 
		[[
	 | 
		left.aligned.with  | 
		start($a) == start($b)  | 
		]]
	 | 
		right.aligned.with  | 
		end($a) == end($b)  | 
		@
	 | 
		includes inclusion  | 
		(start($a) <= start($b)) and (end($a) >= end($b))  | 
		[]
	 | 
		same.extent.as  | 
		(start($a) == start($b)) and (end($a) == end($b))  | 
		#
	 | 
		overlaps.with  | 
		(end($a) > start($b)) and (end($b) > start($a))  | 
		][
	 | 
		contact.with  | 
		end($a) == start($b)  | 
		<<
	 | 
		precedes  | 
		end($a) <= start($b)  | 
		starts.earlier.than  | 
		start($a) <= start($b)  | |
		starts.later.than  | 
		start($a) >= start($b)  | |
		ends.earlier.than  | 
		end($a) <= end($b)  | |
		ends.later.than  | 
		end($a) >= end($b)  | 
To express complex structural relations in some cases auxiliary elements are required, which should not be part of the query result. Sometimes it is sufficient that one such element satisfies the match condition, sometimes all auxiliary elements must match.
The mathematical solution to this problem are the existential and universal quantifiers. In NQL variables can be existential quantified or universal quantified. In both cases elments which are bound to a quantified variable are not part of the result.
The formal definition of Condition part is now extended with quantifiers:
var_declaration:=( existsvariable)var_declaration:=( existsvariable typedefinition)var_declaration:=( forallvariable)var_declaration:=( forallvariable typedefinition)
In queries with quantifiers the implication operator ->
 could be useful (see Condition part).
Some examples:
Table 12. Quantifier Examples
		($a)(exists $b): $a ^1 $b  | elements with children | 
		($root)(forall $null): !$null ^1 $root  | root elements | 
The result of a query is a list of n-tuples of elements (or, 
more precisely, variable bindings) satisfying the match condition, where      
n is the number of variables declared without quantifiers (cf.      
Quantifier). 
	The query result is returned in the form of an XML document (or, abstractely, a new tree structure adjoined      
	to the corpus). Each query match corresponds to a match
 element, with pointers representing variable bindings and the variable name given by the pointer's role.
An example result for a query involving variables $w and 
$p is:   
<matchlist size="2"> <match n="1"> <nite:pointer role="w" xlink:href="..."/> <nite:pointer role="p" xlink:href="..."/> </match> <match n="2"> <nite:pointer role="w" xlink:href="..."/> <nite:pointer role="p" xlink:href="..."/> </match> </matchlist>
The matches are not ordered. The ordering of the results of two similar but not identical queries can be very different.
A complex query consists of a sequence of simple queries seperated by      
:: markers.
complex_query:=complex_query::querycomplex_query:=query
For a complex query, the leftmost query is evaluated first. Each query in the sequence operates on the result of the previous query. This means that for every match, the following query is evaluated with the variable bindings of the previous queries. The fixed variable bindings may be used anywhere in the ensuing queries. This evaluation strategy produces a hierarchically structured query result, where each match of the leftmost simple query includes a matchlist for the second query, etc.
In the example
($w word): $w@orth ~ /S.*/ :: ($p phone): $w ^ $p
the query result has the following structure:
<matchlist size="2"> <match n="1"> <nite:pointer role="w" xlink:href="..."/> <matchlist type="sub" size="2"> <match n="1"> <nite:pointer role="p" xlink:href="..."/> </match> <match n="2"> <nite:pointer role="p" xlink:href="..."/> </match> </matchlist> </match> <match n="2"> <nite:pointer role="w" xlink:href="..."/> <matchlist type="sub" size="1"> <match n="1"> <nite:pointer role="p" xlink:href="..."/> </match> </matchlist> </match> </matchlist>
There are no empty submatches. If for a variable binding the following single query has no matches, the variable binding will be removed from the result. So the number of matches for a complex query is less than or equal to the number of matches for the first part.
IS THIS SECTION TO BE KEPT?
At Feb 05, there are a number of known problems with the current querylanguage implementation.
There is a bug when querying over multiple observations - the implementation considers 
	times in different observations to be comparable, so that it's possible to get the result that 
	an element in one observation is before some element in another.  This is easy to get around: 
	query on one observation at a time, or declare the reserved attribute for observation names for 
	your corpus and add a test for the same observation as an extra query term - e.g.($f@obs = $g@obs), 
	if the attribute declared is obs. 
The search GUI (whether called stand-alone or from a search menu) can't display results if 
	some subquery in a complex query only has query matches that are bound with forall - e.g. 
($f foo):($f@att="val")::(forall $g bar):!($g ^ $f)
The immediate precedence operator is missing.  Immediate
precedence is equivalent to
($f foo)($g foo)(forall $h foo): ($f<<$g) && (($h=$f) || ($h=$g) || ($h<<$f) || ($g<<$h))
but, of course, this is cumbersome and can be too slow and memory-intensive for practical purposes, 
	depending on the data set. Some common uses of the operator are covered by the NGramCalc
	 utility.Another work-around 
	is to create one XML tree from the NXT data thatrepresents the information required and query it using XPath.  
	Export to LPath and tgrep2 would also be reasonable and 
	are not difficult to implement. If you need to match on 
	regular expressions of XML elements in order to add markup, (so, for instance, saying "find syntactic constituents 
	with one determiner, followed by one or more adjectives, followed by one noun, and wrap a new tag around them"), but 
	you can always use something like fsgmatch (from the LTG; new release,
	currently in beta, is called lxtransduce) and 
	then modify the metadata to match.  Remember, the data is just XML, amenable to all of the usual XML processing 
	techniques.
	
The arithmetic operators are missing.
At present, users who need them add new attributes to theirdata set and then carry on as normal. 
	For instance, a researcher looking at how often bar elements start in the 10 seconds after 
	foo 
	elements end might add an "adjusted start" attribute to bar elements 
	that take 10 secondsoff their official start 
	times, and then use the query
	($f foo)($b bar):(START($b) > END($f)) && ($b@adjustedstart < END($foo))
This stylesheet, run on a specific individual coding in the context of the MONITOR project, is an example of how this can be done. It just copies everything, adding new attributes to feedback gaze codes. We used this general technique on the Switchboard data to get lengths for syntactic constituents, and on the Monitor data to get durations.
This method is inconvenient, particularly for the sort of exploratory study that wishes to consider several 
	different time relationships. We don't think it is worth adding special loading routines that addtemporary 
	attributes for adjusted start and end times, but we could include some utilities for command line searching 
	based on adjustments passed in on the command line.  For instance, 
java CountWithTimeOffset -q '($t turn)($f feedback):($t # $f)' -t feedback -d 50
could mean to count overlaps after feedback elements have been displaced 50 seconds forward.   
	We are considering whether this would be useful enough to supply.
	
At present (Apr 05) the query language parser fails to handle namespacing properly, so any elements and attributes 
	that are namespaced will be difficult to work with.  For the timing and id attributes, where the
	default names are in the nite: namespace, this doesn'tmatter, since they are exposed 
	to query via e.g. 
START($x), but namespacing other tags and attributes would make working with them difficult until this is fixed.
		
NXT's query engine is slow and uses a great deal of memory. For instance, some of our more complicated syntactic querieson the Switchboard corpus take 10 seconds per dialogue, or over an hour and a half for the entire corpus.
This is partly a consequence of what it does - the query languageis solving a harder problem than languages that operate on trees and/or are limited in their use of left and right context. It istrue that the current implementation is not fully optimized, but this is not something we intend to look at in the immediate future. Our first choice strategy for addressing this problem is to look at mapping NQL queries to XQuery for implementation, and addition of the missing operators, that way. Meanwhile, most of NXT's users are not actually engaged in real-time processing, and find that if they develop queries on a few observations using a GUI, they can then afford to run the queries at the command line in batch. The more they are interested in sparse phenomena, the less suitable this strategy is. For some query-based analyses, it is also useful to consider direct implementation using the NOM API, since the programmer can optimize for the analysis being performed.
Meanwhile, an hour and a half is OK for batch mode, but some of our queries areso common that we 
	really want easy access to the results.  We can get this by indexing.   
	Using indices rather than the more complex syntactic queries makes querying roughly ten times faster.  
	This will be even faster if one then selects not to load the syntax at all, which is possible if one 
	doesn't need it for other parts of the subsequent query.  
	You can choose not to load any part of the data by commenting out the 
<coding-file>
 tag for it in your local copy of the metadata file, or after NXT 1.3.0, by enabling lazy loading 
	in your applications.
It's faster to use string equality than regular expression matching in the query language, and keep in mind the regular expressions have to match the entire string they are compared against, not just a substring of it.
The very desperate can write special purpose applications to evaluate their queries, 
	which is faster especially for queries involving quantification.  
	For instance,one user has adapted CountQueryResults to run part of the query he wants,
	but instead of returning the results, then checks the equivalent of hisforall conditions using navigation in the NOM.
	
We recommend refining queries using display.bat/.sh on a single dialogue (probably 
	spot-checking on a couple more, since observations vary), and running actual counts using the 
	command line utilities. Build up queries term by term - the syntax error messages aren't always very easy 
	to understand. Missing dollar signs, quotation marks, and parentheses are the worst culprits.  
	Get around the bookmark problems and the lack of parenthesis and quote matching in the searchGUI
	 by typing the query into something else that's handier (such as emacs) 
	and pasting what you've written into the query window.  
	You canand should include comments in queries if they are at all complicated. 
	Queries have to be expressed on one line to run them at the command line, but you shouldn't try 
	to author them this way - instead, postprocess a query developed in this more verbose style by taking out
	UNFINISHED SENTENCE
Analysis of query results can be expedited by thinking carefully aboutthe battery of tools that are 
	available: knit, LT-XML, stylesheets, xmlperl, 
	shell script loops, and so on.  One interesting possibility is importing the query results into the data set, 
	which would be a fancier, hierarchically structured form of indexing.  At May 2004, the metadata 
<coding-file>
 declaration required to do this would be a little different for every query result, 
	but we intend minor syntactic changes in both the query result XML and what knit produces 
	to make this declaration static.
The main documentation is the query language reference manual . Virtually the same information can be found on the helpmenu of the search window (if you don't find it there, it's an installation problem). An older document with more contextual information can be found here.
At September 2006, we plan a revised version of the manual. The current version fails to give details 
	about the operator for finding out whether two elements are linked via a pointer with a role.
	($a <"foo" $b) is true if $a points to $b using the 
	"foo" role; the role name can be omitted, but if it is specified it can only be 
	given as a textual string, not as a regular expression.  The current version also fails to make clear 
	that the regular expression examples given are only a subset of the possibilities.  
	The exact regular expression syntax depends on your version of Java, since it is implemented using the 
	java.util.regex package.  Java 1.5 regular expression documentation can be found
here
.
Here are some worked examples for the Switchboard data sample and the Monitor data sample.
	Computer scientists and people familiar with first order predicate calculus have tended to be happy 
	with the reference manual plus the examples; other people need more (so, for instance, don't know what 
	implication is or what forall is likely to mean) and we're still thinking about what we 
	might be able to provide for them.
At Nov 2004, there are a few things described in the query documentation that haven't been implemented yet 
	(and aren't on the workplan for immediate development). This includes arithmetic operators and temporal fuzziness.
	We thought this included versions of ^ and 
<> limited by distance, but users report that these (or some of these?)
	work.  Also, some versions of the query documentation show ; instead of : as 
	the separator between bindings and match conditions.  The only major bug we've run into (at Nov 2004) is that 
	temporal operators will perform comparisons across observations, even though time in different observations is 
	meant to be independent. After NXT-1.2.6, 05 May 04, one can in the metadata declare a reserved attribute to use 
	for the observation name that will be 
	added automatically for every element, providing a work-around.
There's a nifty visual demo that runs on a toy corpus and might be useful for deciding whether this stuff is useful in the first place.