The TEI is not a tool like the others on this list, but we have been asked about the relationship between NXT and the Text Encoding Initiative, and in particular, whether it is possible to produce an annotation for spoken dialogue compliant with the TEI standards using NXT GUIs. (Although NXT does get used on text, we have not considered the relationship between NXT and the TEI on textual materials yet, but we expect there to be fewer issues that arise for them.) These are our thoughts on the issue so far. We have made some reference to the P5 documentation in writing them, although we are also relying partly on memory and have not thoroughly checked our work, so it is not definitive. Corrections are welcome. Note also that the TEI states that their guidelines are under revision in this area.
If one has TEI-compliance in mind from the start, then it should be possible to design the NXT storage format for the data set so that it only requires a simple transform to be TEI-compliant, and for some data sets it may be possible to make it TEI-compliant as is. However, designing the NXT data representation for maximum TEI-compliance loses the main benefits of using NXT. If the data has crossing hierachies of annotation, using a TEI-compliant representation means losing the search facility that handles these nicely. If the data represents temporal relationships, using a TEI-compliant representation means losing the ability of NXT browsers to highlight the current annotations as a signal plays. In addition, the configurable interfaces for dialogue acts and named entities currently constrain the NXT data representation in ways that violate TEI recommendations, which means that data sets which aim for TEI-compliance would either need to write their own tailored GUIs for everything or contribute (fairly modest) changes to them. If one wants to make use of NXT's best properties, then it would be better to develop a data path for getting between the NXT and TEI-compliant data formats than to build TEI-compliance into the NXT format. If one doesn't need NXT's facilities for crossing hierachies or timing, then there may be a simpler framework upon which annotation tools can be built.
The TEI recommends particular tag names for orthographic transcription element. These are not a
problem for NXT, which has no constraints on tag naming - it just requires the tags to be formally
defined in the NXT "metadata" using the TEI's set. The TEI recommends the use of markup within
one XML tree as the orthography for the representation of dialogue acts, named entities,
turns, and the like. For instance, dialogue acts are represented in the TEI as <seg>
's and
named entities as <rs>
's (or similar non-segmenting spans of transcription elements,
such as <persName>
). One hierarchy of <seg>
's over the transcription
can be represented in NXT, again by authoring the metadata to match, but the metadata will
not be particularly useful for data validation because it will simply have the semantics that all
<seg>
's draw from the transcription elements as children; if there is internal
structure among the segments, NXT will not by itself enforce or check that. Similarly, <rs>
and similar tags can be used, but technically they violate NXT's data model unless hey are either
defined within the orthographic transcription tag set (with recursive descent through that set of tags).
This is because strictly speaking, NXT requires "layers" of annotation to span the layers
beneath them (in this case, the layer of transcription elements). However, this is a
only a weak data model violation, and NXT copes with it by allowing tags to contain either
the element types declared as their children or skip directly to the ones declared as their
children's children. If one's data does not have crossing hierarchies or a relationship to signal,
this suggests that TEI-compliance is either possible or very close. There may be a problem with the
representation of links. The TEI practice for relating data elements uses IDREF
or IDREFS
or in-file links. Some NXT data sets use string matching on attribute
values which is similar to using IDREF
s, but there is nothing in the attribute
declarations which lets NXT validate that relationship. NXT currently writes in-file links using a
syntax that (redundantly) contains the filename, although this could be changed without much difficulty.
There may also be differences in what's expected at file roots. NXT doesn't require a particular
tag name at the root (although it does currently warn if an unexpected one is used), but it
doesn't expect headers and bodies in the same file, and the metadata declaration won't allow
different content models for two tags at the same depth from the root in the same file,
weakening the data validation where they are stored together (since then the content model
must specify a disjunction of the possible types at that depth). Every NXT element must have
an id
, which may be a burden for some data sets.
The main difference between NXT's representation and that of the TEI is whether or not overlapping (crossing) hierarchies pointing down to the same elements are expected. NXT is designed specifically for cases where they are; the TEI contains mechanisms for dealing with crossing hierarchies, but because this is not their primary concern, the mechanisms are more cumbersome. NXT's data representation is based on the idea of multi-rooted trees; in the data model, individual nodes can have one set of children, but multiple parents from different upward trees. A typical use of for this representation in the annotation of spoken dialogue (which makes up NXT's largest user group) is to have time-aligned orthographic transcription at the bottom, and then separate hierarchies for, say, named entities, dialogue acts, prosodic phrases, turns, or whatever that use the words as children. The data is serialized into XML by divided the multi-rooted tree into convenient trees where the XML structure mirrors the data structure and representing the remaining connections between nodes using stand-off links in XLink format. NXT also allows arbitrary additional links to be represented on top of the multi-rooted tree, again using XLinks, but ones that have a different semantics within NXT. The TEI representation for a data set with crossing hierarchies would choose one hierarchy as the primary one, mirror that in the XML structure, and use milestone tags for the other hierarchies. This keeps everything in one file. For extreme cases, one could use the TEI's recommended form for representing graphs, which gives a list of nodes and links where the XML structure does not mirror any part of the graph. Either of these styles of representation can be defined in NXT's "metadata" describing the set of tags, and as long as everything fits into one XML tree they can be kept in one file, but the NXT data validation won't be particularly useful then, and there are no existing GUIs or search facilities that will help in creating or using this data, which means building new ones using the GUI library.
The other main difference between NXT and the TEI is in the representation of timing relationships.
The TEI gives a choice of mechanisms, ranging from the coarse statement that an element
is overlapped via trans="overlap"
, through the use of
<anchor>
tags that link to overlapping events,
to the representation of complete timelines that give time points which then can be
used to indicate the start and end times for an element. Any of these representations
can be defined in NXT's data storage format, but none of them will get the timing data
recognized as time in NXT, which disables one of the most useful features of NXT browsers
(the ability to play signals and show which annotations are current as they play).
NXT's format for timing information is closest to the last one, but is not
TEI-compliant; where annotations of a particular type for different speakers ("agents")
can overlap temporally, NXT requires them to be stored in separate files. This is in
aid of the temporal semantics inherent in NXT's data model which allows timings to percolate up
trees. This requirement can only be circumventing by failing to declare the attributes as times.
NXT comes with some configurable tools for annotating dialogue acts and named
entities. These currently rely on an NXT data representation in which the
dialogue act and named entity tags point into an external ontology of act or
entity types, rather than allowing the type to be expressed as an attribute value.
That means that if a data set is represented to be as TEI-compliant as possible in
the NXT format itself, these tools cannot be used. We are considering making it
possible to configure the tools to use an enumerated attribute, but we don't
have an immediate need for the result so the work hasn't been scheduled yet.
If there is more than one type of <seg>
in the data,
this will cause problems for setting up the tool because the NXT metadata will
have no way of specifying which types go together into one set to be annotated
together (so, for instance, making dialogue act annotation different from some other
segmentation and classification task).
The difficulties in mapping between the TEI and NXT arise from the fact that NXT is designed for data that is rather esoteric for the TEI. If one doesn't need crossing hierachies or relationships to signal, there may be other annotation frameworks that are closer to TEI-compliance in their native data formats. We have never considered other frameworks in this light. MMAX2 uses multiple file stand-off, so probably isn't any closer. Other key words to search on are AGTK, CALLISTO, ATLAS, and WordFreak.