Using NXT in conjunction with other tools

This section contains specific information for potential users who need NXT's features about how to record, transcribe, and otherwise mark up their data before up-translation to NXT. NXT's earliest users have mostly been from computational linguistics projects. This is partly because of where it comes from - it arose out of a collaboration among two computational linguistics groups and an interdisciplinary research centre - and partly because for most uses, its design assumes that the projects that use it will have access to a programmer to set up tailored tools for data coding and to get out some kinds of analysis, or at the very least someone on the project will be willing to look at XML. However, NXT is useful for linguistics and psychology projects based on corpus methods as well. This web page is primarily aimed at them, to tell them problems to look out for, help them assess what degree of technical help they will need in order to carry out the work successfully, and give a sense of what sorts of things are possible with the software.

Recording Signals

Signal Formats

For information on media formats and JMF, see How to Play Media in NXT.

It is a good idea to produce a sample signal and test it in NXT (and any other tools you intend to use) before starting recording proper, since changing the format of a signal can be confusing and time-consuming. There are two tests that are useful. The first is whether you can view the signal at all under any application on your machine, and the second is whether you can view the signal from NXT. The simplest way of testing the latter is to name the signal as required for one of the sample data sets in the NXT download and try the generic display or some other tool that uses the signal. For video, if the former works and not the latter, then you may have the video codec you need, but NXT can't find it - it may be possible to fix the problem by adding the video codec to the JMF Registry. If neither works, the first thing to look at is whether or not you have the video codec you need installed on your machine. Another common problem is that the video is actually OK, but the header written by the video processing tool (if you performed a conversion) isn't what JMF expects. This suggests trying to convert in a different way, although some brave souls have been known to modify the header in a text editor.

We have received a request to implement an alternative media player for NXT that uses QT Java (the QuickTime API for Java) rather than JMF. This would have advantages for Mac users and might help some PC users. We're currently considering whether we can support this request.

Capturing Multiple Signals

Quite often data sets will have multiple signals capturing the same observation (videos capturing different angles, one audio signal per participant, and so on). NXT expresses the timing of an annotation by offsets from the beginning of the audio or video signal. This means that all signals should start at the same time. This is easiest to guarantee if they are automatically synchronized with each other, which is usually done by taking the timestamp from one piece of recording equipment and using it to overwrite the locally produced timestamps on all the others. (When we find time to ask someone who is technically competent exactly how this is done, we'll insert the information here.) A distant second best to automatic synchronization is to provide some audibly and visibly distinctive event (hitting a colourful children's xylophone, for instance) that can be used to manually edit the signals so that they all start at the same time.

Using Multiple Signals

Most coding tools will allow only one signal to be played at a time. It's not clear that more than this is ever really required, because it's possible to render multiple signals onto one. For instance, individual audio signals can be mixed into one recording covering everyone in the room, for tools that require everyone to be heard on the interface. Soundless video or video with low quality audio can have higher quality audio spliced onto it. For the purposes of a particular interface, it should be possible to construct a single signal to suit, although these might be different views of the data for different interfaces (hence the requirement for synchronization - it is counter-productive to have different annotations on the same observation that use different time bases). The one sticking point is where combining multiple videos into one split-screen view results in an unacceptable loss of resolution, especially in data sets that do not have a "room view" video in addition to, say, individual videos of the participants.

From NXT 1.3.0 it is possible to show more than one signal simultaneously by having the application put up more than one media player. If one signal is selected as the master by clicking the checkbox on the appropriate media player, that signal will control the time for all the signals: it will be polled for the current time that will be sent to the other signals (and anything else that monitors time). The number of signals that can successfully play in sync on NXT depends on the spec of your machine and the encoding of the signals. Where sync is seriously out, NXT will attempt to correct the drift by pausing all the signals and re-aligning. If this happens too often, it's a good sign your machine is struggling. If you intend to rely on synchronization of multiple signals, you should test your formats and signal configuration on your chosen platform carefully.

Transcription

One of the real benefits of using NXT is the fact that it puts together timing information and linguistic structure. This means that most projects transcribing data with an eye to using NXT want a transcription tool that allows timings to be recorded. For rough timings, a tool with a signal (audio or video) player will do, especially if it's possible to slow the signal down and go back and forth a bit to home in on the right location (although this greatly increases expense over the sort of "on-line" coding performed simply by hitting keys for the codes as the signal plays). For accurate timing of transcription elements - which is what most projects need - the tool must show the speech waveform and allow the start and end times of utterances (or even words) to be marked using it.

NXT does not provide any interface for transcription. It's possible to write an NXT-based transcription interface that takes times from the signal player, but no one has. Providing one that allows accurate timestamping is a major effort because NXT doesn't (yet?) contain a waveform generator. For this reason, you'll want to do transcription in some other tool and import the result into NXT.

Using special-purpose transcription tools

There are a number of special-purpose transcription tools available. For signals that effectively have one speaker at a time, most people seem to use Transcriber or perhaps TransAna. For group discussion, ChannelTrans which is a multi-channel version of Transcriber, seems to be the current tool of choice. iTranscribe is a ground-up rewrite of it that is currently in pre-release.

Although we have used some of these tools, we've never evaluated them from the point of view of non-computational users (especially whether or not installation is difficult or whether in practice they've required programmatic modification), so we wouldn't want to endorse any particular one, and of course, there may well be others that work better for you.

Transcriber's transcriptions are stored in an XML format that can be up-translated to NXT format fairly simply. TransAna's are stored in an SQL database, so the up-translation is a little more complicated; we've never tried it but there are NXT users who have exported data from SQL-based products into whatever XML format they support and then converted that into NXT.

Using programs not primarily intended for transcription

Some linguistics and psychology-based projects use programs they already have on their computers (like Microsoft Word and Excel) for transcription, without any modification. This is because (a) they know they want to use spreadsheets for data analysis (or to prepare data for importation into SPSS) and they know how to get there from here, (b) because they can't afford software licenses but they know they've already paid for these ones; and (c) they aren't very confident about installing other software on their machines.

Using unmodified standard programs can be successful, but it takes very careful thought about the process, and we would caution potential users not to launch blindly into it. We would also argue that since there are now programs specifically for transcription that are free and work well on Windows machines, there is much less reason for doing this than there used to be. However, whatever you do for transcription, you want to avoid the following.

  • hand-typing times (for instance, from a display on the front of a VCR), because the typist will get them wrong

  • hand-typing codes (for instance, {laugh}, because the typist will get them wrong

In short, avoid hand-typing anything but the orthography, and especially anything involving numbers or left and right bracketing. These are practices we still see regularly, mostly when people ask for advice about how to clean up the aftermath. Which is extremely boring to do, because it takes developing rules for each problem ({laughs}, {laugh, laugh), laugh, {laff}, {luagh}... including each possible way of crossing nested brackets accidentally), inspecting the set as you goes to see what the next rule should be. Few programmers will take on this sort of job voluntarily (or at least not twice), which can make it expensive. It is far better (...easier, less stressful, better for staff relations, less expensive...) to sort out your transcription practices to avoid these problems.

More as a curiosity than anything else, we will mention that it is possible to tailor Microsoft Word and Excel to contain buttons on the toolbars for inserting codes, and to disable keys for curly brackets and so on, so that the typist can't easily get them wrong. We know of a support programmer who was using these techniques in the mid-90s to support corpus projects, and managed to train a few computationally-unskilled but brave individuals to create their own transcription and coding interfaces this way. If you really must use these programs, you really should consider these techniques. (Note to the more technical reader or anyone trying to find someone who knows how it works these days: the programs use Visual Basic and manipulate Word and Excel via their APIs; they can be created by writing the program in the VB editor or from the end user interface using the Record Macro function, or by some combination of the two.) In the 90's, the Microsoft platform changed every few years in ways that required the tools to be continually reimplemented. We don't know whether this has improved or not.

Up-translating transcriptions prepared in these programs to NXT can be painful, depending upon exactly how the transcription was done. It's best if all of the transcription information is still available when you save as "text only". This means, for instance, avoiding the use of underlining and bold to mean things like overlap and emphasis. Otherwise, the easiest treatment is to save the document as HTML and then write scripts to convert that to NXT format, which is fiddly and can be unpalatable.

Using Forced Alignment with Speech Recognizer Output to get Word Timings

Timings at the level of the individual word can be useful for analysis, but they are extremely expensive and tedious to produce by hand, so most projects can only dream about them. It is actually becoming technically feasible to get usable timings automatically, using a speech recognizer. By "becoming", we mean that computational linguistics projects, who have access to speech specialists, know how to do it well enough that they think of it as taking a bit of effort but not requiring particular thought. This is a very quick explanation of how, partly in case you want to build this into your project and partly because we're considering whether we can facilitate this process for projects in general (for instance, by working closely with one project to do it for them and producing the tools and scripts that others would need to do forced alignment, as a side effect). Please note that the author is not a speech researcher or a linguist; she's just had lunch with a few, and not even done a proper literature review. That means that we don't guarantee everything in here is accurate, but that we are taking steps to understand this process and what we might be able to do about it. For better information, one possible source is Lei Chen, Yang Liu, Mary Harper, Eduardo Maia, and Susan McRoy, Evaluating Factors Impacting the Accuracy of Forced Alignments in a Multimodal Corpus, LREC 2004, Lisbon Portugal.

Commercial speech recognizers take an audio signal and give you their one best guess (or maybe n best guesses) of what the words are. Research speech recognizers can do this, but for each segment of speech, they can also provide a lattice of recognition hypotheses. A lattice is a special kind of graph where nodes are times and arcs (lines connecting two different times) are word hypotheses, meaning the word might have been said between the two times, with a given probability. The different complete things that might have been said can be found by tracing all the paths from the start time to the end time of the segment, putting the word hypotheses together. (The best hypothesis is then the one that has the highest overall probability, but that's not always the correct one.) If you have transcription for the speech that was produced by hand and can therefore be assumed to be correct, you can exploit the lattice to get word timings by finding the path through the lattice for which the words match what was transcribed by hand and transferring the start and end times for each word over to the transcribed data. This is what is meant by "forced alignment". HTK, one popular toolkit that researchers use to build their speech recognizers, comes with forced alignment as a standard feature, which means that if your recognizer uses it, you don't have to write a special purpose program to get the timings out of the lattice and onto your transcription. Of course, it's possible that other speech recognizers do this to and we just don't know about it.

The timings that are derived from forced alignment are not as accurate as those that can be obtained by timestamping from a waveform representation, but they are much, much cheaper. Chen et al. 2004 has some formal results about accuracy. Speech recognizers model what is said by recognizing phonemes and putting them together into words, so the inaccuracy comes from the kinds of things that happen to articulation at word boundaries. This means that, to hazard a guess, the accuracy isn't good enough for phoneticians, but it is good enough for researchers who are just trying to find out the timing relationship between words and events in other modalities (posture shifts, gestures, gaze, and so on). The timings for the onset and end of a speech segment are likely to be more accurate than the word boundaries in between.

The biggest problem in producing a forced alignment is obtaining a research speech recognizer that exposes the lattice of word hypotheses. The typical speech recognition researcher concentrates on accuracy in terms of word error rates (what percentage of words the system gets wrong in its best guess), since in the field as a whole, one can publish if and only if the word error rate is lower than in the last paper to be published. (This is why most people developing speech recognizers don't seem to have immediate answers to the question of how accurate the timings are.) Developing increasingly accurate recognizers takes effort, and once a group has put the effort in, they don't usually want to give their recognizer away. So if you want to used forced alignment, you have the following options:

  • Persuade a speech group to help you. Lending the speech recognizer for your purposes doesn't harm commercial prospects or their research prospects in any way, but they might never have thought about that. This does require contact with a group that is either charitable or knows the benefits of negotation. Since speech groups are always wanting more data and since hand-transcription is expensive, one reasonable deal is that if they provide you with the timings for your research, they can use your data to improve their recognizer. This only works if your recordings are of high enough quality for their purposes and speech groups may have specific technical constraints. For instance, speech recognizers work better on data that is recorded using the same kind of microphones as the data the recognizer was trained on. This means that the best time to broker a deal is before you start recording. The easiest arrangement is usually for them to bung your data through the recognizer at their site and pass you the results rather than for you to install the recognizer.

  • Build your own recognizer. One of the interesting things about forced alignment is that you don't actually need a good recognizer - you just need one that can get the correct words somewhere in the lattice of word hypotheses. Knowing the correct words also makes it possible to make it much more likely that the correct hypothesis will be in the lattice somewhere, since you can make sure that none of the words are outside of the speech recognizer's vocabulary. A quick poll of speech researchers results in the estimate that constructing a speech recognizer that works OK but won't win any awards using HTK takes 1-3 person-months. More time lowers the word error rate but isn't likely to affect the timing accuracy. The researchers involved found it difficult to think about how bad the recognizer could be and still work for these purposes, so they weren't sure whether spending less time was a possibility. It does take someone with a computational background to build a recognizer, although they didn't feel it took any particular skill or speech background to build a bad one.

  • Find a speech recognizer floating around somewhere that's free and will work. There must be a project student somewhere who has put together a recognizer using HTK that is good enough for these purposes.

Finally, here are the steps in producing a forced alignment:

  • Produce high quality speech recordings. You must have one microphone per participant, and they must be close-talking microphones (i.e., tabletop PZMs will not do - you need lapel or head-mounted microphones). If you are recording conversational speech (i.e., dialogue or small groups), it's essential that the signal on each participant's microphone be stronger when they're speaking than when other people are. Each participant must be recorded onto a separate channel.

  • Optionally, find the areas of speech on each audio signal automatically. The energy on the signal will be higher when the person is speaking; you need to figure out some threshhold above which the person is speaking and write a script to mark those. This is often done in MATLAB.

  • Hand-transcribe, either scanning each entire signal looking for speech or limiting yourself to the areas found to the automatic process. Turns (or utterances, depending on your terminology) don't have to be timestamped accurately, but can include extra silent space before or after them that will be corrected by the forced alignment. However, it's important that the padding not include cross-talk from another person that could confuse the recognizer.

  • Optionally, add to the speech recognizer's dictionary all of the words in the hand-transcription that aren't in it already. (This is so that it can make a guess at what speech matches them even though it has never encountered the words before, rather than treating them as out-of-vocabulary, which means applying some kind of more general "garbage" model.)

  • Run the speech recognizer in forced alignment mode and then a script to add the timings to the hand transcription.

Time-stamped coding

Although waveforms are necessary for timestamping speech events accurately, many other kinds of coding (gestures, posture, etc.) don't really require anything that isn't available in the current version of NXT, except possibly the ability to advance a video frame by frame. People are starting to use NXT to do this kind of coding, and we expect to release some sample tools of this style plus a configurable video labelling tool fairly soon. However, there are many other ways of getting time-stamped coding; some of the video tools we encounter most often are The Observer, EventEditor, Anvil, and TASX. EMU is audio-only but contains extra features (such as format and pitch tracking) that are useful for speech research.

Time-stamped codings are so simple in format (even if they allow hierarchical decomposition of codes in "layers") that it doesn't really matter how they are stored for our purposes - all of them are easy to up-translate into NXT. In our experience it takes a programmer .5-1 day to set up scripts for the translation, assuming she understands the input and output formats.

Importing Data into NXT

NXT has been used with source data from many different tools. The import mechanisms used are becoming rather less ad-hoc, and this section has information about importing from some commonly-used tools. As transforms for particular formats are abstracted enough to be a useful starting point for general use, they will appear in this document, and also in the NXT distribution: see the transforms directory for details.

Transcriber and Channeltrans

Transcriber and Channeltrans have very similar file formats, Channeltrans being a multi-channel version of Transcriber.

See the transforms/TRStoNXT directory for the tools and information you will need. The basic transform is run by a perl script called trs2nxt. The perl script uses three stylesheets and an NXT program. Before running the transform, compile the AddObservation Java program using the standard NXT CLASSPATH. Full instructions are included, but the basic call to transform is:

trs2nxt  -c metadata_file -ob observationname -i in_dir -o out_dir -n nxt_dir 

where you need to point to your local NXT directory using -n, and your local editable metadata copy using -c. The Java part of the process is useful as it checks validity of the transform and saves the XML in a readable format.

Note

: there are many customizations you can make to this process using command line arguments, but if you have specific transcription conventions that you need to be converted to particular NXT elements or attributes, you will need to edit the script itself. The transcription conventions assumed are those in the AMI Transcription Guidelines.

EventEditor

EventEditor is a free Windows-only tool for direct time-stamping of events to signal.

See the transforms/EventEditortoNXT directory for the tools and information you will need. The basic transform is a Java program which needs to be compiled using the standard NXT CLASSPATH (comprising at least nxt.jar and xalan.jar). To transform one file, use

 
java EventEditorToNXT -i input_file -t tagname -a attname -s starttime 
	-e endtime -c comment [ -l endtime ]

The arguments are the names of the elements and attributes to be output. Because EventEditor is event based, the last event does not have an end time. If you want an end time to appear in the NXT format, use the -l argument.

IDs are not added to elements, but you can use the provided add-ids.xsl stylesheet for that:

 
java -classpath $NXTDIR/lib/xalan.jar org.apache.xalan.xslt.Process 
	-in outputfromabove -out outfile 
	-xsl add-ids.xsl -param session observationname 
	-param participant agentname

where NXTDIR is your local NXT install, or you can point to anywhere you happen to have installed Xalan. At least the session parameter, and really the participant one too, should be used as these help the IDs to be unique.

The Observer

The Observer is a commercial Windows-only tool for timestamping events against signal.

Output from The Observer is the textual odf format and this is transformed to NXT format using the observer2nxt perl script in the transforms/ObserverToNXT directory. It will be necessary to specify your own transform between Observer and NXT codes by editing the lookup tables in the perl script.

Other Formats

For data in different formats it's worth investigating how closely your transform might resemble one of those above: often it's a fairly simple case of tailoring an existing transform to your circumstances. If you manage this successfully, please contact the NXT developers: it will be worth passing on your experience to other NXT users. If your input format is significantly different to those listed, the NXT developers may still have experience that can be useful for your transform. We have also transformed data from Anvil and ELAN among others.

Exporting Data from NXT into Other Tools

NQL is a good query language for this form of data, but it is necessarily slower and more memory-intensive than some others in use (particularly by syntacticians) because it does not restrict the use of left and right context in any way (in fact, it's possible to search across several observations using it). This isn't really as much of a problem for data analysts as they think - they can debug queries on a small amount of data and then run them overnight - but it is a problem for real-time applications. And sometimes users already know other query languages that they would prefer to use. This page considers how to convert NXT data for use in two existing search utilities, tgrep2 and TigerSearch. Our recommended path to tgrep2 is via Penn Treebank format, which can be useful as input to other utilities as well. Besides the speed improvements that come from limiting context, tgrep2 has a number of structural operators that haven't been implemented in NQL, including immediate precedence and first and last child (although we expect to address this in 2006). We haven't gone through it looking at whether it has functionality that is difficult to duplicate in XPath; if it doesn't, then using XPath is likely to be the better option for those who already know it, but tgrep2 already has a user community who like the elegance and compactness of the language. TigerSearch has a nice graphical interface and again supports structural operators missing in NQL.

Tgrep2 is for trees, and TigerSearch, for directed acyclic graphs with a single root. NXT represents a directed acyclic graph with multiple roots and additionally, some arbitrary graph structure thrown over the top that can have cycles. The biggest problem in conversion is determining what tree, or what single-rooted graph, to include in the conversion. This is a design problem, since it effectively means deciding what information to throw away. Every NXT corpus has its own design, so there is no completely generic solution - conversion utilities will require at least corpus-specific configurations.

TGREP2 via Penn Treebank Format

Penn Treebank format is simply labelled bracketing of orthography. For instance,

(NP (DET the) (N man))

Tgrep2 can load Penn Treebank format, but other tools use it as well. This means that it's reasonable to get to tgrep2 via Penn Treebank format, since some of the work on conversion can be dual purposed.

Most users of Penn Treebank format treat the labels simply as labels. Tgrep2 users tend to overload them with more information that they can get at using regular expressions. So, for instance, if one has markup for NPs that are subjects of sentences, one might mark that using NP for non-subjects and NP-SUBJ for subjects. The hyphen as separator is important to the success of regular expressions over the labels, especially where different parts of the labelling share substrings.

Some users of Penn Treebank format additionally overload the labels with information about out-of-tree links that can't be used in tgrep2, but that they have other ways of dealing with. For instance, suppose they wish to mark a coreferential link between "the man" and "he". One way of doing this is using a unique reference number for link:

(NP/ANTEC1 (DET the) (N man)) ... (PRO/REF1 he)

We recommend dividing conversion into two steps: (1) deriving a single XML file that represents the tree you want from the NXT format data, where the XML file structure mirrors the tree structure for the target treebank and any out of tree links are representing using ids and idrefs; and (2) transforming that tree into the Penn Treebank format.

(1) is specific to a given corpus and set of search requirements. For some users, it will be one coding file from the original data, or the result of one knit operation, in which case it's easy. It might also be a simple transformation of a saved query result. Or it might be derived by writing a Java program that constructs it using the data model API. Once you know what tree you want, the search page will give hints about how to get it from the NXT data.

(2) could be implemented as a generic utility that reads a configuration file explaining how to pack the single-file XML structure into a Penn Treebank labelling and performs it on a given XML file. Assume that each label consists of a basic label (the first part, before any separator, usually the most critical type information), optionally followed by some rendering of attribute-value pairs, optionally followed by some rendering of out-of-tree links. The configuration file would designate separators between different kinds of information in the Treebank label and where to find the roots of trees for the treebank. (The latter is unnecessary, since anything else could be culled from the tree in step 1, but it makes it more likely that a single coding file from the NXT data format will be a usable tree for input to step 2.) For each XML tag name, it would also designate how to find the basic label (the first part, before any separator), which attribute-value pairs and links to include, and how they should be printed.

Below is one possible design for the configuration file. Note that the configuration uses XPath fragments to specify where to find roots for the treebank and descendants for inclusion. Our assumption is that those who don't know XPath can at least copy from examples, and those who do can get more flexibility from this approach.

<NXT-to-tgrep-config>
   <!-- specify where the treebank roots are. We will tree-walk
        the XML from these nodes, printing as we go -->
   <treebank-roots match="//foo"/>
   <!-- what to use as brackets -->
   <left-bracket value="("/>
   <right-bracket value=")"/>
   <!-- string with a separator to use between base label and atts;
     if none give, none used -->
   <base-label-sep value="-"/>
   <!-- string with a separator to use between att name and value -->
   <att-value-sep value=":"/>
   <!-- string with a separator to use between different atts -->
   <att-sep value="*"/>
   <!-- string with a separator to use between attributes and links -->
   <link-sep value="/"/>
   <!-- don't bother printing attribute names or the separator between
        the names and the values -->
   <omit-attribute-names/>
   <!-- if a node matches the expression given, skip it, moving
	on to its children -->
   <omit match="baz"/>
   <!-- transformation instructions for nodes matching the expression given -->
   <transform match="nt">
       <!-- the base-label comes first in the label, again an XPath 
       fragment.  name() for tag name, @cat for value of cat attribute -->
       <base-label value="name()"/>    
       <!-- where to find the orthography, if any (usually the textual content,
       sometimes a particular attribute) -->
       <orthography value="text()"/>
       <!-- leave out the start attribute -->
       <omit-attribute name="start"/>  
       <!-- we assume all other attributes are printed in a standard
       format with the name, att-value-sep, and then the attribute-value. 
       If we need individual control for how attributes are printed,
       we'll need to allow configuration of that here.
       -->
   </transform>
   <!-- how to print out-of-tree links represented by id/idref in 
    the input.  This example says expect foo tags 
    to be linked to bar tags where the refatt attribute has the same
    value as the foo's idatt attribute.  For the foo label, add the
    link separator followed by ANTEC followed by the value of idatt,
    and for the bar label, add the link separator followed by REF
    followed by the value of refatt (which is the same value).  -->
   <link>
       <antecedent match="foo"/>
       <antecedent-id name="@idatt"/>
       <referent match="bar"/>
       <referent-idref name="@refatt"/>
       <link-anteclabel value="ANTEC"/>
       <link-reflabel value="REF"/>
   </link>
</NXT-to-tgrep-config>

A few example tags that can be generated from this configuration from

<nt cat="NP" subcat="SUBJ" id="1"/>

where this serves as an antecedent in a link:

  • NP

  • NP-subcat:SUBJ/ANTEC1

  • NP-subcat:SUBJ/1

  • nt-cat:NP*subcat:SUBJ/ANTEC1

  • nt-NP:SUBJ

  • nt-NP

and so on.

The utility should have defaults for everything so that it does something when there is no configuration file, choosing standard separators, not omitting any tags or attributes, printing attribute names, and failing to print any out-of-tree links. It also should not require a DTD for the input data. One thing to note: this design assumes we print separate ids for every link, but some nodes could end up linked in two ways, to two different things, causing labels like FOO-BARANTEC1-BAZANTEC1. This is the more general solution, but if users always have the same id attribute for both types of links, we can make the representation more compact.

We have attracted funding to write this utility, with the work to be scheduled sometime in the period Oct 05 to Oct 06, and so we are consulting on this design to see whether it is flexible enough, complete, too complicated for the target users, and actually in demand. Note that a converter like this couldn't guarantee safety of queries given that the Penn Treebank labels get manipulated using regular expressions - the user could easily get matches on the wrong part of the label by mistake because these regular expressions are hard to write to preclude this, unless you devise your attribute values and tag names carefully so that no pair of things matches an obvious reg exp you might want to search on. The users who have requested this work expect to get around this problem by running conversion from NXT format into several different tgreppable formats for different queries that omit the information that isn't needed.

Our biggest concern with the utility is how implementation choices could affect usability for this user community. It tends to be the less computational end of the tgrep user community who most want tgrep conversion, with speed and familiarity as the biggest issues. (Familiarity doesn't really seem to be an issue for the more computational users, and speed is slightly less of an issue since they're more comfortable with scripting and batch processing, but it's still enough of a problem for some queries that they want conversion. This may change when we complete work on a new NXT search implementation that evaluates queries by translating them to XQuery first, but that's a bigger task.) Serving the needs of less computational users introduces some problems, though. The first one is that since they know nothing about XML, and are used to thinking about trees but not more complex data models, they won't be able to write the configuration file for the utility. The second is that it may be difficult to find an implementation for the converter that runs fast, is easy to install, and doesn't require us to make executables for a wide range of platforms. (We think it needs to run fast because the users expect to create several different tgreppable forms of the same data, but if they have to get someone else to do it because it requires skills they don't have to write the configuration file, this is no longer important - the real delay will be in getting someone's time.)

We're still wrestling with this design; comments about our assessment of what's required and acceptable solutions welcome. The implementations we're considering are (a) generating a stylesheet from the configuration file and applying that to the data or (b) direct implementation reading the data and configuration file at the same time, in either perl with xml parsing and xpath modules, Java with Apache libraries, or LT-XML2.

TigerSearch

We have put less thought into conversion into TigerSearch, but that doesn't mean the conversion is less useful. The fact that TigerSearch supports a more general data structure than trees means that it will be more useful for some people. NXT uses the XML structure to represent major trees from the data, but Tiger's XML is based on graph structure, with XML tags like nt (non-terminal node), t (terminal node), and edge. On the other hand, since Tiger can represent not just trees but directed acyclic graphs with a single root, it would be more reasonable to specify a converter, again using a configuration file, in one step from NXT format. The configuration file would need to specify what to use as roots, where to find the orthography, a convention for labelling edges, and which links to omit to avoid cycles, but otherwise it could just preserve the attribute-value structure of the original. The best implementation is probably in Java using the NXT data model API to walk a loaded data set.

Knitting and Unknitting NXT Data Files

By "knitting", we mean the process of creating a larger tree than that in an individual coding or corpus resource by traversing over child or pointer links and including what is found. Knitting an XML document from an NXT data set performs a depth-first left-to-right traversal of the nodes in a virtual document made up by including not just the XML children of a node but also the out-of-document children links (usually pointed to using nite:child and nite:pointer, respectively, although the namimg of these elements is configurable). In the data model, tracing children is guaranteed not to introduce cycles, so the traversal recurses on them; however, following links could introduce cycles, so the traversal is truncated after the immediate node pointed to has been included in the result tree. For pointers, we also insert a node in the tree between the source and target of the link that indicates that the subtree derives from a link and shows the role. The result is one tree that starts at one of the XML documents from the data set, cutting across the other documents in the same way as the ^ operator of the query language, and including residual information about the pointer traces. At May 2004, we are considering separating the child and pointer tracing into two different steps that can be pipelined together, for better flexibility, and changing the syntax of the element between sources and targets of links.

Unknitting is the opposite process, involving splitting up a large tree into smaller parts with stand-off links between them.

Knitting NXT data can create standard XML files from stand-off XML files. This can be essential for downstream processing that is XML aware but does not deal with stand-off markup. Data Storage describes NXT's stand-off annotation format.

There are two distinct approaches for kitting data: using an XSLT stylesheet, or using the LT XML2 toolkit.

Knit using Stylesheet

To resolve the children and pointers from any NXT file there is a stylesheet in NXT's lib directory called knit.xsl. Stylesheet processor installations vary locally. Some people use Xalan, which happens to be redistributed with NXT. It can be used to run a stylesheet on an XML file as follows.

 java org.apache.xalan.xslt.Process -in INFILE -xsl lib/knit.xsl -param idatt id 
    -param childel child -param pointerel pointer -param linkstyle ltxml 
    -param docbase file:///my/file/directory 2> errlog > OUTFILE

The docbase parameter indicates the directory of the INFILE, used to resolve the relative paths in child and pointer links. If not specified, it will default to the location of the stylesheet (NOT the input file!). Note that if you're using the absolute location of the INFILE, it is perfectly fine to just set docbase to the same thing, because the entity resolver will take its base URL (according to xslt standard) for document function calls.

Note

This means you may have to move XML files around so that all referred-to files are in the same directory.

The default linkstyle is LT XML, the default id attribute is nite:id, the default indication of an out-of-file child is nite:child, and the default indication of an out-of-file pointer is nite:pointer. These can be overridden using the parameters linkstyle, idatt, childel, and pointerel, respectively, and so for example if the corpus is not namespaced and uses xpointer links,

java org.apache.xalan.xslt.Process -in INFILE -xsl STYLESHEET 
	-param linkstyle xpointer -param idatt id 
	-param childel child -param pointerel pointer

A minor variant of this approach is to edit knit.xsl so that it constructs a a tree that is drawn from a path that could be knitted, and/or document calls to pull in off-tree items. The less the desired output matches a knitted tree and especially the more outside material it pulls in, the harder this is. Also, if a subset of the knitted tree is what's required, it's often easier to obtain it by post-processing the output of knit.

Knit using LT XML2

Knit.xsl can be very slow. It follows both child links and pointer links, but conceptually, these operations could be separate. We have implemented separate "knits" for child and pointer links as command line utilities with a fast implementation in LT XML2: lxinclude (for children) and lxnitepointer (for pointers).

lxinclude -t nite FILENAME reads from the named file (which is really a URL) or from standard input, writes to standard output, and knits child links. (The -t nite is required because this is a fuller XInclude implementation; this parameterizes for NXT links). If you haven't used the default nite:child links, you can pass the name of the tag you used with -l, using -xmlns to declare any required namespacing for the link name:

lxinclude -xmlns:n=http://example.org -t nite -l n:mychild

This can be useful for recursive tracing of pointer links if you happen to know that they do not loop. Technically, the -l argument is a query to allow for constructions such as -l '*[@ischild="true"]'.

Similarly,

lxnitepointer FILENAME

will trace pointer links, inserting summary traces of the linked elements.

Using stylesheet extension functions

As a footnote, LT XML2 contains a stylesheet processor called lxtn, and we're experimenting with implementing extension functions that resolve child and pointer links with less pain than the mechanism given in knit.xsl; this is very much simpler syntactically and also faster, although not as fast as the LT XML2 based implementation of knit. This approach could be useful for building tailored trees and is certainly simpler than writing stylesheets without the extension functions. Edinburgh users can try it as

Complete this section with description of extension functions.

Unknit using LT XML2

Again based on LT XML2 we have developed a command line utility that can unknit a knitted file back into the original component parts.

lxniteunknit -m METADATA FILE

Lxniteunknit does not include a command line option for identifying the tags used for child and pointer links because it reads this information from the metadata file.

General Approaches to Processing NXT Data

Suppose that you have data in NXT format, and you need to make some other format for part or all of it - a tailored HTML display, say, or input to some external process such as a machine learning algorithm or a statistical package. There are an endless number of ways in which such tasks can be done, and it isn't always clear what the best mechanism is for any particular application (not least because it can depend on personal preference). Here we walk you through some of the ones we use.

The hardest case for data processing is where the external process isn't the end of the matter, but creates some data that must then be re-imported into NXT. (Think, for instance, of the task of part-of-speech tagging or chunking an existing corpus of transcribed speech.) In the discussion below, we include comments about this last step of re-importation, but it isn't required for most data processing applications.

Option 1: Write an NXT-based application

Often the best option is to write a Java program that loads the data into a NOM and use the NOM API to navigate it, writing output as you go. For this, the iterators in the NOM API are useful; there are ones, for instance, that run over all elements with a given name or over individual codings. It's also possible from within an application to evaluate a query on the loaded NOM and iterate over the results within the full NOM, not just the tree that saving XML from the query language exposes. (Many of the applications in the sample directory both load and iterate over query results, so it can be useful to borrow code from them.) For re-importation, we don't have much experience of making Java communicate with programs written in other languages (such as the streaming of data back and forth that might be required to add, say, part-of-speech tags) but we know this is possible and that users have, for instance, made NXT-based applications communicate with processes running in C (but for other purposes).

This option is most attractive:

  • for those who write applications anyway (since they know the NOM API)

  • for applications where drawing the data required into one tree (the first step for the other processing mechanisms) means writing a query that happens to be slow or difficult to write, but NOM navigation can be done easily with a simpler query or no query at all

  • for applications where the output requires something which is hard to express in the query language (like immediate precedence) or not supported in query (like arithmetic)

Option 2: Make a tree, process it, and (for re-importation) put it back

Since XML processing is oriented around trees, constructing a tree that contains the data to be processed, in XML format, opens up the data set to all of the usual XML processing possibilities.

First step: make a tree

Individual NXT codings and corpus resources are, of course, tree structures that conveniently already come in XML files. Often these files are exactly what you need for processing anyway, since they gather together like information into one file. Additionally, you can use the knitting and knit-like tree construction approaches described in Knitting and Unknitting NXT Data Files.

As an alternative to knitting data into trees, if you evaluate a query and save the query results as XML, you will get a tree structure of matchlists and matches with nite:pointers at the leaves that point to data elements. Sometimes this is the best way to get the tree-structured cut of the data you want, since it makes many data arrangements possible that don't match the corpus design and therefore cannot be obtained by knitting.

The query engine API includes (and the search GUI exposes) an option for exporting query results not just to XML but to Excel format. We recommend caution in exercising this option, especially where further processing is required. For simple queries with one variable, the Excel data is straightforward to interpret, with one line per variable match. For simple queries with n variables, each match takes up n spreadsheet rows, and there is no way of finding the boundaries between n-tuples except by keeping track (for instance, using modular arithmetic). This isn't so much of a problem for human readability, but it does make machine parsing more difficult. For complex queries, in which the results from one query are passed through another, the leaves of the result tree and presented in left-to-right depth-first order of traversal, and even human readability can be difficult. Again, it is possible to keep track whilst parsing, but between that and the difficulty of working with Excel data in the first place, its often best to stick to XML.

Second step: process the tree
Stylesheets

This is the most standard XML transduction mechanism. There are some stylesheets in the lib directory that could be useful as is, or as models; knit.xsl itself, and attribute-extractor.xsl, that can be used in conjunction with SaveQueryResults and knit to extract a flat list of attribute values for some matched query variable (available from Sourceforge CVS from 2 July 04, will be included in NXT-1.2.10).

This option is most attractive:

  • for those who write stylesheets anyway (since they know XSLT)

  • for operations that can primarily be carried out on one coding at a time, or on knitted trees, or on query language result trees, limiting the number and complexity of the document calls required

  • for applications where the output requires something which is not supported in query but is supported in XSLT (like arithmetic)

Xmlperl

Xmlperl gives a way of writing pattern-matching rules on XML input but with access to general perl processing in the action part of the rule templates.

This option is most attractive:

  • for those who write xmlperl or at least perl anyway

  • for operations that can be carried out on one coding at a time, or on knitted trees, or on query language result trees

  • for applications where the output requires something which is not supported in query (like arithmetic)

  • for applications where XSLT's variables provide insufficient state information

  • for applications where bi-directional communication with an external process is needed (for instance, to add part-of-speech tags to the XML file), since this is easiest to set up in xmlperl

Xmlperl is quite old now. There are many XML modules for perl that could be useful but we have little experience of them.

In the LT XML2 release, see also lxviewport, which is another mechanism for communication with external processes.

ApplyXPath/Sggrep

There are some simple utilities that apply a query to XML data and return the matches, like ApplyXPath (an Apache sample) and sggrep (part of LT XML2). Where the output required is very simple, these will often suffice.

Using lxreplace

This is another transduction utility available that is distributed more widely with LT XML2. It is implemented over LT XML2's stylesheet processor, but the same functionality could be implemented over some other processor.

lxreplace -q query -t template

template is an XSLT template body, which is instantiated to replace the nodes that match query. The stylesheet has some pre-defined entities to make the common cases easy:

  • &this; expands to a copy of the matching element (including its attributes and children)

  • &attrs; expands to a copy of the attributes of the matching element

  • &children; expands to a copy of the children of the matching element

Examples:

To wrap all elements foo whose attribute bar is unknown in an element called bogus:

lxreplace -q 'foo[@bar="unknown"]' -t '&this;'

(that is, replace each matching foo element with a bar element containing a copy of the original foo element).

To rename all foo elements to bar while retaining their attributes:

lxreplace -q 'foo' -t '&attrs;&children;'

(that is, replace each foo element with a bar attribute, copying the attributes and children of the original foo element).

To move the (text) content of all foo elements into an attribute called value (assuming that the foos don't have any other attributes):

lxreplace -q 'foo' -t ''

(that is, replace each foo element with a foo element whose value attribute is the text value of the original foo element).

Third step: add the changed tree back in

Again based on LT XML2 we have developed a command line utility that can unknit a knitted file back into the original component parts.

lxniteunknit -m METADATA FILE

Lxniteunknit does not include a command line option for identifying the tags used for child and pointer links because it reads this information from the metadata file. With lxniteunknit, one possible strategy for adding information to a corpus is to knit a view with the needed data, add information straight in the knitted file as new attributes or a new layer of tags, change the metadata to match the new structure, and then unknit.

Another popular option is to keep track of the data edits by id of the affected element and splice them into the original coding file using a simple perl script.

Option 3: Process using other XML-aware software

NXT files can be processed with any XML aware software, though the semantics of the standoff links between files will not be respected. Most languages have their own XML libraries: under the hood, NXT uses the Apache XML Java libraries. We sometimes use the XML::XPath module for perl, particularly on our import scripts where XSLT would be inefficient or difficult to write.

Manipulating media files

A wide variety of media tools can be used to create signals for NXT and to manipulate them for use with other tools. Here we mention a few that we use regularly. The cross-platform tool mencoder is good for encoding video for use with NXT; VirtualDub in conjunction with AviSynth (Windows only) are useful for accurately chopping up video files for use in other programs. Using the latter approach is better if you need frame-accurate edits, mencoder is only accurate to the nearest keyframe.

As an example, we are sometimes asked to provide video extracts that show certain NXT phenomena. The first task is to find the phenomena using an NXT tool like FunctionQuery. This results in a tab-delimited result set each line of which will identify the video file to use, along with the start and end time of the segment. Using a scripting language like perl it's easy to transform this into a set of AviSynth format files. These files simply describe a set of media actions to take like loading a video file and adding an audio soundtrack, then chopping out the appropriate section (these would use the AviSynth functions AVISource, WAVSource, AudioDub and Trim). These files can then be loaded into VirtualDub which treats them like any other video file, and the result saved as an AVI file with whatever video / audio compression you choose. The useful thing about VirtualDub is that these actions can be applied to a batch of files and left to run with no further user-action.