.. _geotag: ********** Geotagging ********** **NOTE:** This chapter actually describes the TTT2 pipeline software, which differs slightly from the Geoparser. However, all the important points on the operation of the geotagging step are covered. Introduction ============ This documentation is intended to provide a detailed description of the pipelines provided in the LT-TTT2 distribution. The pipelines are implemented as Unix shell scripts and contain calls to processing steps which are applied to a document in sequence in order to add layers of XML mark-up to that document. This document does not contain any explanation of ``lxtransduce`` grammars or XPath expressions. For an introduction to the ``lxtransduce`` grammar rule formalism, see the `tutorial documentation `_. See also the `lxtransduce manual `_ as well as the documentation for the `LT-XML2 programs `_. LT-TTT2 includes some software not originating in Edinburgh which has been included with kind permission of the authors. Specifically, the part-of-speech (POS) tagger is the C&C tagger and the lemmatiser is ``morpha``. See Sections :ref:`gt-postag` and :ref:`gt-lemmatise` below for more information and conditions of use. LT-TTT2 also includes some resource files which have been derived from a variety sources including UMLS, Wikipedia, Project Gutenberg, Berkeley and the Alexandria Digital Library Gazetteer. See Sections :ref:`gt-tokenise`, :ref:`gt-lemmatise` and :ref:`gt-nertag` below for more information and conditions of use. Pipelines ========= The ``run`` script ------------------ The LT-TTT2 pipelines are found in the ``TTT2/scripts`` directory and are NLP components or sub-components, apart from ``TTT2/scripts/run`` which is a pipeline that applies all of the NLP components in sequence to a plain text document. The diagram in Figure :ref:`gt-runFig` shows the sequence of commands in the pipeline. .. _gt-runFig: .. figure:: images/run.jpg :width: 90% :align: center :alt: 'run' pipeline The ``run`` pipeline The script is used from the command line in the following kinds of ways (from the directory): :: ./scripts/run < data/example1.txt > your-output-file :: cat data/example1.txt | ./scripts/run | more The steps in Figure :ref:`gt-runFig` appear in the script as follows:: 1. cat >$tmp-input 2. $here/scripts/preparetxt <$tmp-input >$tmp-prepared 3. $here/scripts/tokenise <$tmp-prepared >$tmp-tokenised 4. $here/scripts/postag -m $here/models/pos <$tmp-tokenised >$tmp-postagged 5. $here/scripts/lemmatise <$tmp-postagged >$tmp-lemmatised 6. $here/scripts/nertag <$tmp-lemmatised >$tmp-nertagged 7. $here/scripts/chunk -s nested -f inline <$tmp-nertagged >$tmp-chunked 8. cat $tmp-chunked Step 1 copies the input to a temporary file ``$tmp-input``, (see Section :ref:`gt-setup` for information about ``$tmp``). This is then used in Step 2 as the input to the first processor which converts a plain text file to XML and writes its output as the temporary file ``$tmp-prepared``. Each successive step takes as input the temporary file which is output from the previous step and writes its output to another appropriately named temporary file. The output of the final processor is written to ``$tmp-chunked`` and the final step of the pipeline uses the Unix command ``cat`` to send this file to standard output. .. _gt-setup: Setup ----- All of the pipeline scripts contain this early step: :: . `dirname $0`/setup This causes the commands in the file ``TTT2/scripts/setup`` to be run at this point and establishes a consistent naming convention for paths to various resources. For the purposes of understanding the content of the pipeline scripts, the main points to note are: - The variable takes as value the full path to the ``TTT2`` directory. - A ``$bin`` variable is defined as ``TTT2/bin`` and is then added to the value of the user’s ``PATH`` variable so that the scripts can call the executables such as ``lxtransduce`` without needing to specify a path. - The variable ``$tmp`` is defined for use by the scripts to write temporary files and ensure that they are uniquely named. The value of ``$tmp`` follows this pattern: ``/tmp/--``. Thus the temporary file created by Step 2 above (``$tmp-prepared``, the temporary file containing the output of ``TTT2/scripts/preparetxt``) might be ``/tmp/bloggs-run-959-prepared``. Temporary files are removed automatically after the script has run, so cannot usually be inspected. Sometimes it is useful to retain them for debugging purposes and the setup script provides a method to do this — if the environment variable ``LXDEBUG`` is set then the temporary files are not removed. For example, this command: :: LXDEBUG=1 ./scripts/run testout.xml causes the script ``run`` to be run and retains the temporary files that are created along the way. Component Scripts ----------------- The main components of the ``run`` pipeline as shown in Figure :ref:`gt-runFig` are also located in the ``TTT2/scripts`` directory. They are described in detail in Sections :ref:`gt-preparetxt` – :ref:`gt-chunk`. The needs of users will vary and not all users will want to use all the components. The script has been designed so that it is simple to edit and configure for different needs. There are dependencies, however: - ``preparetxt`` assumes a plain text file as input; - all other components assume an XML document as input; - ``tokenise`` requires its input to contain paragraphs marked up as ``

`` elements; - the output of ``tokenise`` contains ```` (sentence) and ```` (word) elements and all subsequent components require this format as input; - ``lemmatise``, ``nertag`` and ``chunk`` require part-of-speech (POS) tag information so ``postag`` must be applied before them; - if both ``nertag`` and ``chunk`` are used then ``nertag`` should be applied before ``chunk``. Each of the scripts has the effect of adding more XML mark-up to the document. In all cases, except ``chunk``, the new mark-up appears on or around the character string that it relates to. Thus words are marked up by wrapping word strings with a ```` element, POS tags and lemmas are realised as attributes on ```` elements, and named entities are marked up by wrapping ```` sequences with appropriate elements. The ``chunk`` script allows the user to choose among a variety of output formats, including BIO column format and standoff output (see Section :ref:`gt-chunk` for details). Section :ref:`gt-visualise` discusses how the XML output of pipelines can be converted to formats which make it easier to visualise. The components are Unix shell scripts where input is read from standard input and output is to standard output. Most of the scripts have no arguments apart from ``postag`` and ``chunk``: details of their command line options can be found in the relevant sections below. The component scripts are similar in design and in the beginning parts they follow a common pattern: - ``usage`` and ``descr`` variables are defined for use in error reporting; - the next part is a command to run the ``setup`` script (``.~`dirname $0`/setup``) as described in Section :ref:`gt-setup` above - a ``while`` loop handles arguments appropriately - a ``lib`` variable is set to point to the directory in which the resource files for the component are kept. For example, in ``lemmatise`` it is defined like this: ``lib=\$here/lib/lemmatise`` so that instances of ``$lib`` in the script expand out to ``TTT2/lib/lemmatise``. (``$here`` is defined in the script as the ``TTT2`` directory.) .. _gt-preparetxt: The ``preparetext`` Component ============================= Overview -------- The ``preparetxt`` component is a Unix shell script called with no arguments. Input is read from standard input and output is to standard output. This script converts a plain text file into a basic XML format and is a necessary step since the LT-XML2 programs used in all the following components require XML as input. The script generates an XML header and wraps the text with a text element. It also identifies paragraphs and wraps them as ``

`` elements. If the input file is this: :: This is a piece of text. It needs to be converted to XML. the output is this: :: ]>

This is a piece of text.

It needs to be converted to XML.

Some users may want to process data which is already in XML, in which case this step should not be used. Instead, it should be ensured that the XML input files contain paragraphs wrapped as ``

`` elements. So long as there is some kind of paragraph mark-up, this can be done using ``lxreplace``. For example, a file containing para elements like this: :: This is a piece of text. It needs to be converted to XML. can easily be converted using this command: :: cat input-file | lxreplace -q para -n "'p'" so that the output is this: ::

This is a piece of text.

It needs to be converted to XML.

Note that parts of the XML structure above the paragraph level do not need to be changed since the components only affect either paragraphs or sentences and words inside paragraphs. The ``preparetext`` script -------------------------- In the early part of the script the ``$lib`` variable is defined to point to ``TTT2/lib/preparetxt/`` which is the location of the resource files used by the ``preparetxt`` pipeline. The remainder of the script contains the sequence of processing steps piped together that constitute the ``preparetxt`` pipeline. The ``preparetext`` pipeline ---------------------------- :: 1. lxplain2xml -e guess -w text | 2. lxtransduce -q text $lib/paras.gr **Step 1:** ``lxplain2xml -e guess -w text`` This step uses the LT-XML2 program ``lxplain2xml`` to convert the text into an XML file. The output is the text wrapped in a text root element (``-w text``) with an XML header that contains an encoding attribute which ``lxplain2xml`` guesses (``-e guess``) based on the characters it encounters in the text. The output of this step given the previous input file is this: :: ]> This is a piece of text. It needs to be converted to XML. <\text> The file ``TTT2/data/utf8-example`` contains a UTF-8 pound character. If Step 1 is used with this file as input, the output has a UTF-8 encoding: :: ]> This example contains a UTF-8 character, i.e. £. **Step 2:** ``lxtransduce -q text $lib/paras.gr`` The second and final step in the ``preparetxt`` pipeline uses the LT-XML2 program ``lxtransduce`` with the grammar rule file ``TTT2/preparetxt/paras.gr`` to identify and mark up paragraphs in the text as ``

`` elements. On the first example in this section the output contains two paragraphs as already shown above. On a file with no paragraph breaks, the entire text is wrapped as a ``

`` element, for example: :: ]>

This is a piece of text. It needs to be converted to XML.

<\text> Note that if the encoding is UTF-8 then the second step of the pipeline does not output the XML declaration since UTF-8 is the default encoding. Thus the output of ``preparetxt`` on the file ``TTT2/data/utf8-example`` is this: :: ]>

This example contains a UTF-8 character, i.e. £.

.. _gt-tokenise: The ``tokenise`` Component ========================== Overview -------- The ``tokenise`` component is a Unix shell script called with no arguments. Input is read from standard input and output is to standard output. This is the first linguistic processing component in all the top level scripts and is a necessary prerequisite for all other linguistic processing. Its input is an XML document which must contain paragraphs marked up as ``

`` elements. The ``tokenise`` component acts on the ``

`` elements by (a) segmenting the character data content into ```` (word) elements and (b) identifying sentences and wrapping them as ```` elements. Thus an input like this: ::

This is an example. There are two sentences.

is transformed by and output like this (modulo white space which has been changed for display purposes): ::

This is an example . There are two sentences .

The attribute on ```` elements encodes a unique id for each word based on the start position of its first character. The attribute on ```` elements encodes unique sequentially numbered ids for sentences. The ``c`` attribute is used to encode word type (see :ref:`Table 2 ` for complete list of values). It serves internal purposes only and can possibly be removed at the end of preprocessing. All ```` elements have a ``pws`` attribute which has a ``no`` value if there is no white space between the word and the preceding word and a ``yes`` value otherwise. The ``sb`` attribute on sentence final full stops serves to differentiate these from sentence internal full stops. The ``pws`` and ``sb`` attributes are used by the ``nertag`` component. The ``tokenise`` script ----------------------- In the early part of the script the ``$lib`` variable is defined to point to ``TTT2/lib/tokenise/`` which is the location of the resource files used by the ``tokenise`` pipeline. The remainder of the script contains the sequence of processing steps piped together that constitute the ``tokenise`` pipeline. The ``tokenise`` pipeline ------------------------- :: 1. lxtransduce -q p $lib/pretokenise.gr | 2. lxtransduce -q p $lib/tokenise.gr | 3. lxreplace -q "w/cg" | 4. lxtransduce -q p -l lex=$lib/mobyfuncwords.lex $lib/sents-news.gr | 5. lxtransduce -q s -l lex=$here/lib/nertag/numbers.lex $lib/posttokenise.gr | 6. lxreplace -q "w/w" | 7. lxreplace -q "w[preceding-sibling::*[1][self::w]]" -t "&attrs;&children;" | 8. lxreplace -q "w[not(@pws)]" -t "&attrs;&children;" | 9. lxreplace -q cg | 10. lxaddids -e 'w' -p "'w'" -c '//text()' | 11. lxaddids -e 's' -p "'s'" **Step 1:** ``lxtransduce -q p $lib/pretokenise.gr`` The first step in the pipeline uses ``lxtransduce`` with the rules in ``pretokenise.gr``. The query (``-q p``) establishes ``

`` elements as the part of the XML that the rules are to be applied to. The pretokenise grammar converts character data inside ``

`` elements into a sequence of ‘character groups’ (```` elements) so that this: ::

"He's gone", said Fred.

is output as follows: ::

"He 's gone", said Fred.

Note that here and elsewhere we introduce line breaks to display examples to make them readable but that they are not to be thought of as part of the example. Every actual character in this example is contained in a ````, including whitespace and newline characters, e.g. the newline between *said* and *Fred* in the current example. The ``c`` attribute on ```` elements encodes the character type, e.g. ``lca`` indicates lower case. :ref:`Table 1 ` contains a complete list of values for the ``c`` attribute on ```` elements. Note that quote ```` elements (``c='qut'``) have a further attribute to indicate whether the quote is single or double: ``qut='s'`` or ``qut='d'``. .. _gt-concg: ====== ========================================== Code Meaning ====== ========================================== amp ampersand brk bracket (round, square, brace) cd digits cm comma, colon, semi-colon dash single dash, sequence of dashes dots sequence of dots gt greater than (character or entity) lca lowercase alphabetic lc-nt lowercase n't lt less than entity nl newline pct percent character qut quote slash forward and backward slashes stop full stop, question mark, exclamation mark sym symbols such as ``+``, ``-``, ``@`` etc. tab tab character uca uppercase alphabetic uc-nt uppercase n't what unknown characters ws whitespace ====== ========================================== Table 1: Values for the ``c`` attribute on ```` elements **Step 2:** ``lxtransduce -q p $lib/tokenise.gr`` The second step in the pipeline uses ``lxtransduce`` with ``tokenise.gr``. The query again targets ``

`` elements but in this step the grammar uses the ```` elements of the previous step and builds ```` elements from them. Thus the output of step 1 is converted to this: ::

" He 's gone ", said Fred .

Note that the apostrophe+s sequence in *He’s* has been recognised as such (``aposs`` value for the attribute). Non-apostrophe quote ```` elements acquire an ``lquote``, ``rquote`` or ``quote`` value for ``c`` (left, right or can’t be determined) and have a further attribute to indicate whether the quote is single or double: ``qut='s'`` or ``qut='d'``. :ref:`Table 2 ` contains a complete list of values for the ``c`` attribute on ```` elements. .. _gt-conw: ====== ========================================== Code Meaning ====== ========================================== . full stop, question mark, exclamation mark abbr abbreviation amp ampersand aposs apostrophe s br bracket (round, square, brace) cc *and/or* cd numbers cm comma, colon, semi-colon dash single dash, sequence of dashes dots sequence of dots hyph hyphen hyw hyphenated word lquote left quote ord ordinal pcent percent expression pct percent character quote quote (left/right undetermined) rquote right quote slash forward and backward slashes sym symbols such as ``+``, ``-``, ``@`` etc. w ordinary word what unknown type of word ====== ========================================== Table 2: Values for the ``c`` attribute on ```` elements **Step 3:** ``lxreplace -q "w/cg"`` The third step uses ``lxreplace`` to remove ```` elements inside the new ```` elements. (Word internal ```` elements are no longer needed, but those occurring between words marking whitespace and newline are retained for use by the sentence grammar.) The output now looks like this: ::

"He's gone", said Fred.

**Step 4:** ``lxtransduce -q p -l lex=$lib/mobyfuncwords.lex $lib/sents-news.gr`` The next step uses ``lxtransduce`` to mark up sentences as ```` elements. As well as using the ``sents-news.gr`` rule file, a lexicon of function words (``mobyfuncwords.lex``, derived from Project Gutenberg’s Moby Part of Speech List [1]_) is consulted. This is used as a check on a word with an initial capital following a full stop: if it is a function word then the full stop is a sentence boundary. The output on the previous example is as follows: ::

"He's gone", said Fred.

The ``tokenise`` script is set up to use a sentence grammar which is quite general but which is tuned in favour of newspaper text and the abbreviations that occur in general/newspaper English. The distribution contains a second sentence grammar, ``sents-bio.gr``, which is essentially the same grammar but which has been tuned for biomedical text. For example, the abbreviation *Mr.* or *MR.* is expected not to be sentence final in ``sents-news.gr`` but is permitted to occur finally in ``sents-bio.gr``. Thus this example: ::

I like Mr. Bean. XYZ interacts with 123 MR. Experiments confirm this.

is segmented by ``sents-news.gr`` as: ::

I like Mr. Bean. XYZ interacts with 123 MR. Experiments confirm this.

while ``sents-bio.gr`` segments it like this: ::

I like Mr. Bean. XYZ interacts with 123 MR. Experiments confirm this.

The ``sents-bio.gr`` qgrammar has been tested on the Genia corpus and performs very well. **Step 5:** ``lxtransduce -q s -l lex=$here/lib/nertag/numbers.lex $lib/posttokenise.gr`` The fifth step applies ``lxtransduce`` with the rule file ``posttokenise.gr`` to handle hyphenated words and to handle full stops belonging to abbreviations. Since an ```` layer of annotation has been introduced by the previous step, the query now targets ```` elements rather than ``

`` elements. In the input to ``posttokenise.gr``, hyphens are split off from their surrounding words, so this grammar combines them to treat most hyphenated words as words rather than as word sequences — it wraps a ```` element (with the attribute ``c='hyw'``) around the relevant sequence of ```` elements, thus creating ```` inside ```` mark-up. The grammar consults a lexicon of numbers in order to exclude hyphenated numbers from this treatment. (Later processing by the numex and timex named entity rules requires that these should be left separated.) Thus if the following is input to ``tokenise``: ::

Mr. Bean eats twenty-three ice-creams.

the output after the post-tokenisation step is: ::

Mr. Bean eats twenty -three ice-creams .

The grammar also handles full stops which are part of abbreviations by wrapping a ```` element (with the attribute ``c='abbr'``) around a sequence of a word followed by a non-sentence final full stop (thus again creating ``w/w`` elements). The *Mr.* in the current example demonstrates this aspect of the grammar. Note that this post-tokenisation step represents tokenisation decisions that may not suit all users for all purposes. Some applications may require hyphenated words not to be joined (e.g. the biomedical domain where entity names are often subparts of hyphenated words (*NF-E2-related*)) and some downstream components may need trailing full stops not to be incorporated into abbreviations. This step can therefore be omitted altogether or modified according to need. **Step 6:** ``lxreplace -q "w/w"`` The sixth step in the ``tokenise`` pipeline uses ``lxreplace`` to remove the embedded mark-up in the multi-word words created in the previous step. **Step 7 & 8:** ``lxreplace -q "w[preceding-sibling::*[1][self::w]]" -t "&attrs;&children;" |`` ``lxreplace -q "w[not(@pws)]" -t "&attrs;&children;"`` The seventh and eighth steps add the attribute ``pws`` to ```` elements. This attribute indicates whether the word is preceded by whitespace or not and is used by other, later LT-TTT2 components (e.g., the ``nertag`` component). Step 7 uses ``lxreplace`` to add ``pws='no'`` to ```` elements whose immediately preceding sibling is a ````. Step 8 then adds ``pws='yes'`` to all remaining ```` elements. **Step 9:** ``lxreplace -q cg`` At this point the ```` mark-up is no longer needed and is removed by step 9. The output from steps 6–9 is as follows: ::

Mr. Bean eats twenty-three ice-creams.

**Steps 10 & 11:** ``lxaddids -e 'w' -p "'w'" -c '//text()' |`` ``lxaddids -e 's' -p "'s'"`` In the final two steps ``lxaddids`` is used to add id attributes to words and sentences. The initial example in this section, reproduced here, shows the input and output from ``tokenise`` where the words and sentences have acquired ids through these final steps: ::

This is an example. There are two sentences.

::

This is an example . There are two sentences .

In step 10, the ``-p "'w'"`` part of the ``lxaddids`` command prefixes the id value with ``w``. The ``-c '//text()'`` option ensures that the numerical part of the id reflects the position of the start character of the ```` element (e.g. the initial *e* in *example* is the 14th character in the ``text`` element). We use this kind of id so that retokenisations in one part of a file will not cause id changes in other parts of the file. Step 11 is similar except that for id values on ``s`` elements the prefix is ``s``. We have also chosen not to have the numerical part of the id reflect character position — instead, through not supplying a ``-c`` option, the default behaviour of sequential numbering obtains. .. _gt-postag: The ``postag`` Component ======================== Overview -------- The ``postag`` component is a Unix shell script called with one argument via the ``-m`` option. The argument to ``-m`` is the name of a model directory. The only POS tagging model provided in this distribution is the one found in ``TTT2/models/pos`` but we have parameterised the model name in order to make it easier for users wishing to use their own models. Input is read from standard input and output is to standard output. POS tagging is the next step after tokenisation in all the top level scripts since other later components make use of POS tag information. The input to ``postag`` is a document which has been processed by ``tokenise`` and which contains ``

``, ````, and ```` elements. The ``postag`` component adds a ``p`` attribute to each ```` with a value which is the POS tag assigned to the word by the C&C POS tagger using the ``TTT2/models/pos`` model. Thus an input like this (output from ``tokenise``): ::

This is an example . There are two sentences .

is transformed by ``postag`` and output like this: ::

This is an example . There are two sentences .

The POS tagger called by the ``postag`` script is the C&C maximum entropy POS tagger (Curran and Clark 2003 [2]_) trained on data tagged with the Penn Treebank POS tagset (Marcus, Santorini, and Marcinkiewicz 1993 [3]_). We have included the relevant Linux binary and model from the C&C release at ``_ with the permission of the authors. The binary of the C&C POS tagger, which in this distribution is named ``TTT2/bin/pos``, is a copy of ``candc-1.00/bin/pos`` from the tar file ``candc-linux-1.00.tgz``. The model, which in this distribution is named ``TTT2/models/pos``, is a copy of ``ptb_pos`` from the tar file ``ptb_pos-1.00.tgz``. This model was trained on the Penn Treebank (see ``TTT2/models/pos/info`` for more details). The C&C POS tagger may be used under the terms of the academic (non-commercial) licence at ``_. Note that the ``postag`` script is simply a wrapper for a particular non-XML based tagger. It converts the input XML to the input format of the tagger, invokes the tagger, and then merges the tagger output back into the XML representation. It is possible to make changes to the script and the conversion files in order to replace the C&C tagger with another. The ``postag`` script --------------------- Since ``postag`` is called with a ``-m`` argument, the early part of the script is more complex than scripts with no arguments. The ``while`` and ``if`` loops set up the ``-m`` argument so that the path to the model has to be provided when the component is called. Thus all the top level scripts which call the ``postag`` component do so in this way: :: $here/scripts/postag -m $here/models/pos In the next part of the script the ``$lib`` variable is defined to point to ``TTT2/lib/postag/`` which is the location of the resource files used by the ``postag`` pipeline. The remainder of the script contains the sequence of processing steps piped together that constitute the ``postag`` pipeline. The ``postag`` pipeline ----------------------- :: 1. cat >$tmp-in 2. lxconvert -w -q s -s $lib/pos.cnv <$tmp-in | 3. pos -model $model 2>$tmp-ccposerr | 4. lxconvert -r -q s -s $lib/pos.cnv -x $tmp-in **Step 1:** ``cat >$tmp-in`` The first step in the pipeline copies the input to the temporary file ``$tmp-in``. This is so that it can both be converted to C&C input format as well as retained as the file that the C&C output will be merged with. **Step 2:** ``lxconvert -w -q s -s $lib/pos.cnv <$tmp-in`` The second step uses ``lxconvert`` to convert into the right format for input to the C&C POS tagger (one sentence per line, tokens separated by white space). The ``-s`` option instructs it to use the ``TTT2/lib/postag/pos.cnv`` stylesheet, while the ``-q s`` query makes it focus on ```` elements. (The component will therefore not work on files which do not contain ```` elements.) The ``-w`` option makes it work in write mode so that it follows the rules for writing C&C input format. If the following ``tokenise`` output: ::

Mr. Bean had an ice-cream. He dropped it.

is input to the first step, its output looks like this: :: Mr. Bean had an ice-cream . He dropped it . and this is the format that the C&C POS tagger requires. **Step 3:** ``pos -model $model 2>$tmp-ccposerr`` The third step is the one that actually runs the C&C POS tagger. The ``pos`` command has a ``-model`` option and the argument to that option is provided by the ``$model`` variable which is set by the ``-m`` option of the ``postag`` script, as described above. The ``2>$tmp-ccposerr`` ensures that all C&C messages are written to a temporary file rather than to the terminal. If the input to this step is the output of the previous step shown above, the output of the tagger is this: :: Mr.|NNP Bean|NNP had|VBD an|DT ice-cream|NN .|. He|PRP dropped|VBD it|PRP .|. Here each token is paired with its POS tag following the ‘``|``’ separator. The POS tag information in this output now needs to be merged back in with the original document. **Step 4:** ``lxconvert -r -q s -s $lib/pos.cnv -x $tmp-in`` The fourth and final step in the ``postag`` component uses ``lxconvert`` with the same stylesheet as before (``-s $lib/pos.cnv``) to pair the C&C output file with the original input which was copied to the temporary file, ``$tmp-in``, in step 1. The ``-x`` option to ``lxconvert`` identifies this original file. The ``-r`` option tells ``lxconvert`` to use read mode so that it follows the rules for reading C&C output (so as to cause the POS tags to be added as the value of the ``p`` attribute on ```` elements). The query again identifies ```` elements as the target of the rules. For the example above which was output from the previous step, the output of this step is as follows: ::

Mr. Bean had an ice-cream . He dropped it .

.. _gt-lemmatise: The ``lemmatise`` Component =========================== Overview -------- The ``lemmatise`` component is a Unix shell script called with no arguments. Input is read from standard input and output is to standard output. The ``lemmatise`` component computes information about the stem of inflected words: for example, the stem of *peas* is *pea* and the stem of *had* is *have*. In addition, the verbal stem of nouns and adjectives which derive from verbs is computed: for example, the verbal stem of *arguments* is *argue*. The lemma of a noun, verb or adjective is encoded as the value of the ``l`` attribute on ```` elements. The verbal stem of a noun or adjective is encoded as the value of the ``vstem`` attribute on ```` elements. The input to ``lemmatise`` is a document which has been processed by ``tokenise`` and ``postag`` and which therefore contains ``

``, ````, and ```` elements with POS tags encoded in the ``p`` attribute of ```` elements. Since lemmatisation is only applied to nouns, verbs and verb forms which have been tagged as adjectives, the syntactic category of the word is significant — thus the ``lemmatise`` component must be applied after the ``postag`` component and not before. When the following is passed through ``tokenise``, ``postag`` and ``lemmatise``: ::

The planning committee were always having big arguments. The children have frozen the frozen peas.

it is output like this (again modulo white space): ::

The planning committee were always having big arguments . The children have frozen the frozen peas .

The lemmatiser called by the ``lemmatise`` script is ``morpha`` (Minnen, Carroll, and Pearce 2000 [4]_). We have included the relevant binary and verb stem list from the release at ``_ with the permission of the authors. The binary of ``morpha``, which in this distribution is located at ``TTT2/bin/morpha``, is a copy of ``morpha.ix86_linux`` from the tar file ``morph.tar.gz``. The resource file, ``verbstem.list``, which in this distribution is located in the ``TTT2/lib/lemmatise/`` directory is copied from the same tar file. The ``morpha`` software is free for research purposes. Note that the ``lemmatise`` script is similar to the ``postag`` script in that it is a wrapper for a particular non-XML based program. It converts the input XML to the input format of the lemmatiser, invokes the lemmatiser, and then merges its output back into the XML representation. It is possible to make changes to the script and the conversion files in order to plug out the ``morpha`` lemmatiser and replace it with another. The pipeline does a little more than just wrap ``morpha``, however, because it also computes the ``vstem`` attribute on certain nouns and adjectives (see step 4 in the next section). In doing this it uses a lexicon of information about the verbal stem of nominalisations (e.g. the stem of *argument* is *argue*). This lexicon, ``TTT2/lib/lemmatise/umls.lex``, is derived from the file in the 2007 UMLS SPECIALIST lexicon distribution [5]_. The ``lemmatise`` script ------------------------ In the early part of the script the ``$lib`` variable is defined to point to ``TTT2/lib/lemmatise/`` which is the location of the resource files used by the ``lemmatise`` pipeline. The remainder of the script contains the sequence of processing steps piped together that constitute the ``lemmatise`` pipeline. The ``lemmatise`` pipeline -------------------------- :: 1. cat >$tmp-in 2. lxconvert -w -q w -s $lib/lemmatise.cnv <$tmp-in | 3. morpha -f $lib/verbstem.list | 4. lxconvert -r -q w -s $lib/lemmatise.cnv -x $tmp-in **Step 1:** ``cat >$tmp-in`` The first step in the pipeline copies the input to the temporary file ``$tmp-in``. This is so that it can both be converted to ``morpha`` input format as well as retained as the file that the ``morpha`` output will be merged with. **Step 2:** ``lxconvert -w -q w -s $lib/lemmatise.cnv <$tmp-in`` The second step uses ``lxconvert`` to convert ``$tmp-in`` into an appropriate format for input to the ``morpha`` lemmatiser (one or sometimes two word_postag pairs per line). The ``-s`` option instructs it to use the ``TTT2/lib/lemmatise/lemmatise.cnv`` stylesheet, while the ``-q w`` query makes it focus on ```` elements. (The component will therefore work on any file where words are encoded as ```` elements and POS tags are encoded in the attribute ``p`` on ````.) The ``-w`` option makes it work in write mode so that it follows the rules for writing ``morpha`` input format. If the following ``postag`` output: ::

The planning committee were always having big arguments . The children have frozen the frozen peas.

is input to the first step, its output looks like this: :: planning_NN planning_V committee_NN were_VBD having_VBG big_JJ arguments_NNS children_NNS have_VBP frozen_VBN frozen_JJ frozen_V peas_NNS Each noun, verb or adjective is a placed on a line and its POS tag is appended after an underscore. Where a noun or an adjective ends with a verbal inflectional ending, a verb instance of the same word is created (i.e. ``planning_V``, ``frozen_V`` ) in order that ``morpha``’s output for the verb can be used as the value for the ``vstem`` attribute. **Step 3:** ``morpha -f $lib/verbstem.list`` The third step is the one that actually runs ``morpha``. The ``morpha`` command has a ``-f`` option to provide a path to the ``verbstem.list`` resource file that it uses. If the input to this step is the output of the previous step shown above, the output of ``morpha`` is this: :: planning plan committee be have big argument child have freeze frozen freeze pea Here it can be seen how the POS tag affects the performance of the lemmatiser. The lemma of *planning* is *planning* when it is a noun but *plan* when it is a verb. Similarly, the lemma of *frozen* is *frozen* when it is an adjective but *freeze* when it is a verb. Irregular forms are correctly handled (*children:child*, *frozen:freeze*). **Step 4:** ``lxconvert -r -q w -s $lib/lemmatise.cnv -x $tmp-in`` The fourth and final step in the ``lemmatise`` component uses ``lxconvert`` with the same stylesheet as before (``-s $lib/lemmatise.cnv``) to pair the ``morpha`` output file with the original input which was copied to the temporary file, ``$tmp-in``, in step 1. The ``-x`` option to ``lxconvert`` identifies this original file. The ``-r`` option tells ``lxconvert`` to use read mode so that it follows the rules for reading ``morpha`` output. The query again identifies ```` elements as the target of the rules. For the example above which was output from the previous step, the output of this step is as follows (irrelevant attributes suppressed): ::

The planning committee were always having big arguments. The children have frozen the frozen peas.

Here the lemma is encoded as the value of ``l`` and, where a second verbal form was input to ``morpha`` (*planning*, *frozen* as an adjective), the output becomes the value of the ``vstem`` attribute. Whenever the lemma of a noun can be successfully looked up in the nominalisation lexicon (``TTT2/lib/lemmatise/umls.lex``), the verbal stem is encoded as the value of ``vstem`` (argument:argue). The relevant entry from ``TTT2/lib/lemmatise/umls.lex`` is this: :: .. _gt-nertag: The ``nertag`` Component ======================== .. _gt-nerintro: Overview -------- The ``nertag`` component is a Unix shell script called with no arguments. Input is read from standard input and output is to standard output. The ``nertag`` component is a rule-based named entity recogniser which recognises and marks up certain kinds of named entity: numex (sums of money and percentages), timex (dates and times) and enamex (persons, organisations and locations). These are the same entities as those used for the MUC7 named entity evaluation (Chinchor 1998) [6]_. (In addition ``nertag`` also marks up some miscellaneous entities such as urls.) Unlike the other components, ``nertag`` has a more complex structure where it makes calls to subcomponent pipelines which are also located in the ``TTT2/scripts`` directory. Figure :ref:`gt-nerFig` shows the structure of the nertag pipeline. .. _gt-nerFig: .. figure:: images/ner.jpg :width: 50% :align: center :alt: 'nertag' pipeline The ``nertag`` pipeline The input to ``nertag`` is a document which has been processed by ``tokenise``, ``postag`` and ``lemmatise`` and which therefore contains ``

``, ````, and ```` elements and the attributes ``p``, ``l`` and ``vstem`` on the ```` elements. The rules identify sequences of words which are entities and wrap them with the elements ````, ```` and ````, with subtypes encoded as the value of the ``type`` attribute. For example, the following might be input to a sequence of ``tokenise``, ``postag`` and ``nertag``. ::

Peter Johnson, speaking in London yesterday afternoon, said that profits for ABC plc were up 5% to $17 million.

The output is a relatively unreadable XML document where all the ``

``, ````, and ```` elements and attributes described in the previous sections have been augmented with further attributes and where ````, ```` and ```` elements have been added. For clarity we show the output below after ```` and ```` mark up has been removed using the command ``lxreplace -q w|phr``. Removing extraneous mark-up in this way and at this point might be appropriate if named entity recognition was the final aim of the processing. If further processing such as chunking is to be done then the ```` and ```` mark-up must be retained. ::

>Peter Johnson, speaking in London yesterday afternoon, said that profits for ABC plc were up 5% to $17 million.

The ``nertag`` script --------------------- In the early part of the script the ``$lib`` variable is defined to point to ``TTT2/lib/nertag/`` which is the location of the resource files used by the ``nertag`` pipeline. The remainder of the script contains a sequence of processing steps piped together: :: 1. $here/scripts/numtimex | 2. $here/scripts/lexlookup | 3. $here/scripts/enamex | (``$here`` is defined in the setup as the ``TTT2`` directory). Unlike previous components, these steps are calls to subcomponents which are themselves shell scripts containing pipelines. Thus the ``nertag`` process is sub-divided into three subcomponents, ``numtimex`` to identify and mark up ```` and ```` elements, ``lexlookup`` to apply dictionary lookup for names and, finally, ``enamex `` which marks up ```` elements taking into account the output of ``lexlookup``. The following subsections describe each of these subcomponents in turn. Note that the ``lxtransduce`` grammars used in the ``numtimex`` subcomponent are updated versions of the grammars used in Mikheev, Grover, and Moens (1998) [7]_ and previously distributed in the original LT-TTT distribution. The output of ``numtimex`` is therefore of relatively high quality. The other two subcomponents are new for this release and the ``enamex`` rules have not been extensively tested or tuned. The ``numtimex`` script ----------------------- In the early part of the script the ``$lib`` variable is defined to point to ``TTT2/lib/nertag/`` which is the location of the resource files used by the ``numtimex`` pipeline. The remainder of the script contains the sequence of processing steps piped together that constitute the ``numtimex`` pipeline. The ``numtimex`` pipeline ------------------------- :: 1. lxtransduce -q s -l lex=$lib/numbers.lex $lib/numbers.gr | 2. lxreplace -q "phr/phr" | 3. lxreplace -q "phr[w][count(node())=1]" -t "&children;" | 4. lxtransduce -q s -l lex=$lib/currency.lex $lib/numex.gr | 5. lxreplace -q "phr[not(@c='cd') and not(@c='yrrange') and not(@c='frac')]" | 6. lxtransduce -q s -l lex=$lib/timex.lex -l numlex=$lib/numbers.lex $lib/timex.gr | 7. lxreplace -q "phr[not(.~' ')]" -t "&attrs;" **Step 1:** ``lxtransduce -q s -l lex=$lib/numbers.lex $lib/numbers.gr`` Numerical expressions are frequent subparts of ```` and ```` entities so the first step in the pipeline identifies and marks up a variety of numerical expressions so that they are available for later stages of processing. This step uses ``lxtransduce`` with the rules in the ``numbers.gr`` grammar file and uses the query ``-q s`` so as to process the input sentence by sentence. It consults a lexicon of number words (``numbers.lex``) which contains word entries for numbers (e.g. eighty, billion). If the following sentence is processed by step 1 after first having been put through ``tokenise`` and ``postag`` (and ``lemmatise`` but this doesn’t affect ``numtimex`` and is disregarded here): :: The third announcement said that the twenty-seven billion euro deficit was discovered two and a half months ago. the output will be this (again modulo white space): ::

The third announcement said that the twenty- seven billion euro deficit was discovered two and a half months ago .

This output can be seen more clearly if we remove the ```` elements: ::

The third announcement said that the twenty-seven billion euro deficit was discovered two and a half months ago.

Subsequent grammars are able to use such ``phr`` elements when building larger entity expressions. **Step 2:** ``lxreplace -q phr/phr`` The second step uses ``lxreplace`` to remove embedded ```` mark-up so that numerical phrases don’t have unnecessary internal structure: ::

The third announcement said that the twenty-seven billion euro deficit was discovered two and a half months ago.

**Step 3:** ``lxreplace -q phr[w][count(node())=1] -t &children;`` The third step makes another minor adjustment to the ```` mark-up. The grammar will sometimes wrap single words as ```` elements (e.g. the *third* in the current example) and, since this is unnecessary, in this step ``lxreplace`` is used to remove any ```` tag where there is a single ```` daughter. Thus the current example is changed to this: ::

The third announcement said that the twenty-seven billion euro deficit was discovered two and a half months ago.

**Step 4:** ``lxtransduce -q s -l lex=$lib/currency.lex $lib/numex.gr`` The fourth step of the pipeline recognises ```` entities using the rules in ``numex.gr``. It is this step which is responsible for the two instances of ```` mark-up in the example in section :ref:`nertag Overview `. For the current example, the output of this step (after removing ```` elements) is this: ::

The third announcement said that the twenty-seven billion euro deficit was discovered two and a half months ago.

The grammar makes use of the ``currency.lex`` lexicon which contains a list of the names of a wide range of currencies. Using this information it is able to recognise the money ```` element. **Step 5:** ``lxreplace -q phr[not(@c=’cd’) and not(@c=’yrrange’) and not(@c=’frac’)]`` It is not intended that ```` mark-up should be part of the final output of a pipeline—it is only temporary mark-up which helps later stages and it should be deleted as soon as it is no longer needed. At this point, ```` elements with ``cd``, ``frac`` and ``yrrange`` as values for the ``c`` attribute are still needed but other ```` elements are not. This step removes all ```` elements which are not still needed. **Step 6:** ``lxtransduce -q s -l lex=$lib/timex.lex -l numlex=$lib/numbers.lex $lib/timex.gr`` The sixth step of the pipeline recognises ```` entities using the rules in ``timex.gr``. It is this step which is responsible for the two instances of ```` mark-up in the example in section :ref:`gt-nerintro`. For the current example, the output of this step (after removing ```` elements) is this: ::

The third announcement said that the twenty-seven billion euro deficit was discovered two and a half months ago.

The grammar makes use of two lexicons, ``timex.lex``, which contains entries for the names of days, months, holidays, time zones etc., and ``numbers.lex``. In addition to examples of the kind shown here, the timex rules recognise standard dates in numerical or more verbose form (08/31/07, 31.08.07, 31st August 2007 etc.), times (half past three, 15:30 GMT etc.) and other time related expressions (late Tuesday night, Christmas, etc.). **Step 7:** ``lxreplace -q phr[not(.\sim’ ’)] -t &attrs;`` By this point the only ```` mark-up that will still be needed is that around multi-word phrases, i.e. those containing white space (e.g. *three quarters*). Where there is no white-space, this step creates a ```` element instead of the original ````. The new ```` element acquires first the attributes of the first ```` in the old ```` (``’w[1]/@*’``) and then the attributes of the old ```` itself (``&attrs;``) — since both have a ``c`` attribute, the one from the ```` is retained. The text content of the embedded ```` elements are copied but the embedded ```` element tags are not. The following is an example of input to this step. Note that the line break between *three* and *-* is there for layout purposes and does not exist in the actual input. ::

two thousand; three -quarters

The output for this example is this: ::

two thousand; three-quarters

The result is that *three-quarters* is now recognised as a single word token, rather than the three from before. This brings the mark-up more into line with standard tokenisation practise which does not normally split hyphenated numbers: subsequent steps can therefore assume standard tokenisation for such examples. The *two thousand* example is left unchanged because standard tokenisation treats this as two tokens. However, since we have computed that together *two* and *thousand* constitute a numerical phrase, we keep the ```` mark-up for future components to benefit from. For example a noun group chunking rule can describe a numeric noun specifier as either a ```` or a ```` instead of needing to make provision for one or more numeric words in specifier position. If, however, the ``numtimex`` component is to be the last in a pipeline and no further LT-TTT2 components are to be used, either the last step can be changed to remove all ```` mark-up or the call to ``numtimex`` can be followed by a call to ``lxreplace`` to remove ```` elements. The ``lexlookup`` script ------------------------ In the early part of the script the ``$lib`` variable is defined to point to ``TTT2/lib/nertag/`` which is the location of the resource files used by the ``lexlookup`` pipeline. The remainder of the script contains the sequence of processing steps piped together that constitute the ``lexlookup`` pipeline. The ``lexlookup`` pipeline -------------------------- :: 1. lxtransduce -q s -a firstname $lib/lexlookup.gr | 2. lxtransduce -q s -a common $lib/lexlookup.gr | 3. lxtransduce -q s -a otherloc $lib/lexlookup.gr | 4. lxtransduce -q s -a place $lib/lexlookup.gr **Step 1:** ``lxtransduce -q s -a firstname $lib/lexlookup.gr`` This step uses ``lexlookup.gr`` to mark up words which are known forenames. The ``-a`` option to ``lxtransduce`` instructs it to apply the ``firstname`` rule: :: This rule does look-up against two lexicons of female and male first names where the locations of the lexicons are defined in the grammar like this: :: i.e. the lexicons are expected to be located in the same directory as the grammar itself. The lexicons are derived from lists at ``_. This step adds the attribute ``pername=true`` to words which match so that ``Peter`` becomes ``Peter``. **Step 2:** ``lxtransduce -q s -a common $lib/lexlookup.gr`` This step uses ``lexlookup.gr`` to identify capitalised nominals which are known to be common words. The ``-a`` option to ``lxtransduce`` instructs it to apply the ``common`` rule: :: This rule does look-up against a lexicon of common words where the location of the lexicon is defined in the grammar like this: :: i.e. the lexicon is expected to be located in the same directory as the grammar itself. The common word lexicon is derived from an intersection of lower case alphabetic entries in Moby Part of Speech (``_) and a list of frequent common words derived from ``docfreq.gz`` available from the Berkeley Web Term Document Frequency and Rank site (``_). Because this is a very large lexicon (25,307 entries) it is more efficient to use a memory-mapped version (with a ``.mmlex`` extension) since the default mechanism for human-readable lexicons loads the entire lexicon into memory and incurs a significant start-up cost if the lexicon is large. Memory-mapped lexicons are derived from standard lexicons using the LT-XML2 program, ``lxmmaplex``. The source of ``common.mmlex``, ``common.lex``, is located in the ``TTT2/lib/nertag`` directory and can be searched. If it is changed, the memory-mapped version needs to be recreated. The effect of step 2 is to add the attribute ``common=true`` to capitalised nominals which match so that ``Paper`` becomes ``Paper``. **Step 3:** ``lxtransduce -q s -a otherloc $lib/lexlookup.gr`` This step uses ``lexlookup.gr`` to identify the names of countries (e.g. *France*) as well as capitalised words which are adjectives or nouns relating to place names (e.g. *French*). The ``-a`` option to ``lxtransduce`` instructs it to apply the ``otherloc`` rule: :: The first lookup in the rule accesses the lexicon of country names while the second accesses the lexicon of locational adjectives, where the location of the lexicons are defined in the grammar like this: :: i.e. the lexicons are expected to be located in the same directory as the grammar itself. The lexicons are derived from lists at ``_ and ``_. The effect of step 3 is to add the attributes ``country=true`` and ``locadj=true`` to capitalised words which match so that ``Portuguese`` and ``Brazil`` become ``Portuguese`` and ``Brazil``. **Step 4:** ``lxtransduce -q s -a place $lib/lexlookup.gr`` The final step uses ``lexlookup.gr`` to identify the names of places. The ``-a`` option to ``lxtransduce`` instructs it to apply the ``place`` rule: :: This accesses two rules, one for multi-word place names and one for single word place names. For multi-word place names, the assumption is that these are unlikely to be incorrect, so the rule wraps them as ````: :: Single word place names are highly likely to be ambiguous so the rule for these just adds the attribute ``locname=single`` to words which match. :: These rules access lexicons of multi-word and single-word place names, where the location of the lexicons are defined in the grammar like this: :: i.e. the lexicons are expected to be located in the same directory as the grammar itself. The source of the lexicons is the Alexandria Digital Library Project Gazetteer (``_), specifically, the name list, which can be downloaded from ``_ [8]_. Various filters have been applied to the list to derive the two separate lexicons, to filter common words out of the single-word lexicon and to discard certain kinds of entries. As with the common word lexicon, we use memory-mapped versions of the two lexicons because they are very large (1,797,719 entries in ``alexandria-multi.lex`` and 1,634,337 entries in ``alexandria-single.lex``). The effect of step 4 is to add ```` mark-up or ``locname=single`` to words which match so that ``Manhattan`` becomes ``Manhattan`` and ``New York`` becomes ``New York``. Note that because the rules in ``lexlookup.gr`` are applied in a sequence of calls rather than all at once, a word may be affected by more than one of the look-ups. See, for example, the words *Robin*, *Milton* and *France* in the output for *Robin York went to the British Rail office in Milton Keynes to arrange a trip to France.*: :: Robin York went to the British Rail office in Milton Keynes to arrange a trip to France. The new attributes on ```` elements are used by the rules in the ```` component, while the multi-word location mark-up prevents these entities from being considered by subsequent rules. Thus *Milton Keynes* will not be analysed as a person name. The ``enamex`` script --------------------- In the early part of the script the ``$lib`` variable is defined to point to ``TTT2/lib/nertag/`` which is the location of the resource files used by the ``enamex`` pipeline. The remainder of the script contains the sequence of processing steps piped together that constitute the ``enamex`` pipeline. The ``enamex`` pipeline ----------------------- :: 1. lxtransduce -q s -l lex="$lib/enamex.lex" $lib/enamex.gr | 2. lxreplace -q "enamex/enamex" > $tmp-pre-otf 3. $here/scripts/onthefly <$tmp-pre-otf >$tmp-otf.lex 4. lxtransduce -q s -l lex=$tmp-otf.lex $lib/enamex2.gr <$tmp-pre-otf | 5. lxreplace -q subname **Step 1:** ``lxtransduce -q s -l lex=$lib/enamex.lex $lib/enamex.gr`` Step 1 in the ``enamex`` pipeline applies the main grammar, ``enamex.gr``, which marks up ```` elements of type ``person``, ``organization`` and ``location``, as well as miscellaneous entities such as urls. An input like this: ::

Mr. Joe L. Bedford (www.jbedford.org) is President of JB Industries Inc. Bedford has an office in Paris, France.

is output as this (```` mark-up suppressed): ::

Mr. Joe L. Bedford (www.jbedford.org) is President of JB Industries Inc. Bedford has an office in Paris, France.

At this stage, single-word place names are not marked up as they can be very ambiguous — in this example *Bedford* is a person name, not a place name. The country name *France*, has been marked up, however, because the ``lexlookup`` component marked it as a country and country identification is more reliable. **Step 2:** ``lxreplace -q enamex/enamex > $tmp-pre-otf`` Multi-word locations are identified during ``lexlookup`` and can form part of larger entities, with the result that it is possible for step 1 to result in embedded marked, e.g.: :: Bishops Stortford Town Council Since embedded mark-up is not consistently identified, it is removed. This step applies ``lxreplace`` to remove inner ```` mark-up. The output of this step is written to the temporary file ``$tmp-pre-otf`` because it feeds into the creation of an ‘on the fly’ lexicon which is created from the first pass of ``enamex`` in order to do a second pass matching repeat examples of first pass ```` entities. **Step 3:** ``$here/scripts/onthefly <$tmp-pre-otf >$tmp-otf.lex`` The temporary file from the last step, ``$tmp-pre-otf``, is input to the script ``TTT2/scripts/onthefly`` (described in Sections :ref:`gt-otfscript` and :ref:`gt-otfpipe`) which creates a small lexicon containing the ```` elements which have already been found plus certain variants of them. If the example illustrating step 1 is input to ``TTT2/scripts/onthefly``, the lexicon which is output is as follows: :: person location organization person person person **Step 4:** ``lxtransduce -q s -l lex=$tmp-otf.lex $lib/enamex2.gr <$tmp-pre-otf`` The ‘on the fly’ lexicon created at step 3 is used in step 4 with a second enamex grammar, ``enamex2.gr``. This performs lexical lookup against the lexicon and in our current example this leads to the recognition of *Bedford* in the second sentence as a person rather than a place. The grammar contains a few other rules including one which finally accepts single word placenames (````) as locations — this results in *Paris* in the current example being marked up. **Step 5:** ``lxreplace -q subname`` The final step of the ``enamex`` component (and of the ``nertag`` component) is one which removes a level of mark-up that was created by the ``enamex`` rules in the ``enamex.gr`` grammar, namely the element ````. This was needed to control how a person name should be split when creating the ‘on the fly’ lexicon, but it is no longer needed at this stage. The final output of the ``nertag`` component for the current example is this: ::

Mr. Joe L. Bedford (www.jbedford.org) is President of JB Industries Inc. Bedford has an office in Paris, France.

.. _gt-otfscript: The ``onthefly`` script ----------------------- This script uses the LT-XML2 programs to extract names from the first pass of ``enamex`` and convert them into an ‘on the fly’ lexicon (the lexicon ``$tmp-otf.lex`` referred to above). The conversion is achieved through sequences of ``lxreplace`` and ``lxt`` as well as use of ``lxsort`` and ``lxuniq``. This is a useful example of how simple steps using these programs can be combined together to create a more complex program. In the early part of the script the ``$lib`` variable is defined to point to ``TTT2/lib/nertag/`` which is the location of the resource files used by the ``onthefly`` pipeline. The remainder of the script contains the sequence of processing steps piped together that constitute the ``onthefly`` pipeline. .. _gt-otfpipe: The ``onthefly`` pipeline ------------------------- :: 1. lxgrep -w lexicon enamex[@type='person' and not(subname[@type='fullname'])] |subname[@type='fullname']|enamex[@type='location']|enamex[@type='organization']" | 2. lxreplace -q "enamex" -t "\&attrs;\&children;" | 3. lxreplace -q "w/@*" | 4. lxreplace -q "name/subname" -t "\&children;" | 5. lxreplace -q "w/w" | 6. lxreplace -q "lexicon/subname" -t "\&children;" | 7. lxreplace -q "lexicon/*/text()" -r "normalize-space(.)" | 8. lxreplace -q "w[.~'^(.|[A-Z]\.)$']" -t "\&children;" | 9. lxt -s $lib/expandlex.xsl | 10. lxreplace -q "w[position()``\ ``!``\ ``=1]" -t " \&this;" | 11. lxreplace -q w | 12. lxreplace -q "name[not(node())]" -t "" | 13. lxreplace -q name -t "" | 14. lxt -s $lib/merge-lexicon-entries.xsl | 15. lxsort lexicon lex @word | 16. lxuniq lexicon lex @word | 17. lxsort lex cat . | 18. lxuniq lex cat . **Step 1** The first step uses ``lxgrep`` to extract location and organization ```` elements as well as either full person ```` elements or a relevant subpart of a name which contains a title. The input is a document with ``

``, ````, ````, and ````, ```` and ```` mark-up and the output of this call to ``lxgrep`` for the previous *Mr. Joe L. Bedford* example is this: :: Joe Bedford JB Industries Inc France **Steps 2–8** The next seven steps use ``lxreplace`` to gradually transform the ```` and ```` elements in the ``lxgrep`` output into ```` elements: The ```` elements inside the ```` elements lose their attributes and the white space between them is removed (because the original white space in the source text may be irregular and include newlines). In Step 8, ```` elements which are initials are given the attribute ``init=yes`` so that they can be excluded from consideration when variants of the entries are created. The output from these five steps is this: :: JoeL.Bedford JBIndustriesInc France **Step 9** Step 9 uses ``lxt`` with the stylesheet ``TTT2/lib/nertag/expandlex.xsl`` to create extra variant entries for person names. The output now looks like this: :: JoeL.Bedford Bedford Joe Joe Bedford Bedford JoeBedford JBIndustriesInc France The duplicates are a side-effect of the rules in the stylesheet and are removed before the end of the pipeline. **Steps 10–13** The next four steps use ``lxreplace`` to continue the transformation of the ```` elements. Regular white space is inserted between the ```` elements and then the ```` mark up is removed. Any empty ```` elements are removed and the conversion to proper ``lxtransduce`` lexicon format is done with the final ``lxreplace``. The output now looks like this: :: person person person person person person person organization location **Step 14** At this stage there are still duplicates so this step uses ``lxt`` with the stylesheet ``TTT2/lib/nertag/merge-lexicon-entries.xsl`` to add to each entry the ```` elements of all its duplicates. The output from this step looks like this: :: person personpersonperson personperson personperson personpersonperson personpersonperson person organization location Note that in this example, each entity is only of one type. In other examples, the same string may have been identified by the enamex grammar as belonging to different types in different contexts, for example, *Prof. Ireland happens to work in Ireland.* In this case the output at this stage looks like this: :: personlocation personlocation **Steps 15–18** The final four steps of the pipeline use ``lxsort`` and ``lxuniq`` to remove duplicate entries and duplicate ```` elements. The final result for the running example is this: :: person location organization person person person .. _gt-chunk: The ``chunk`` Component ======================= .. _gt-chunkintro: Overview -------- The ``chunk`` component is a Unix shell script. Input is read from standard input and output is to standard output. The script requires two parameters supplied through ``-s`` and ``-f`` options. The ``-s`` option specifies the style of output that is required with possible arguments being: ``conll``, ``flat``, ``nested`` or ``none``. The ``-f`` option specifies the format of output with possible arguments being: ``standoff``, ``bio`` or ``inline``. The ``chunk`` component is a rule-based chunker which recognises and marks up shallow syntactic groups such as noun groups, verb groups etc. A description of an earlier version of the chunker can be found at Grover and Tobin (2006) [9]_. The earlier version only marked up noun and verb groups while the current version also marks up preposition, adjective, adverb and sbar groups. The first part of the pipeline produces mark-up which is similar to, though not identical to, the chunk mark-up in the CoNLL 2000 data (Tjong Kim Sang and Buchholz 2000) [10]_. This mark-up is then converted to reflect different chunking styles and different formats of output through use of the ``-s`` and ``-f`` parameters. The output of the first part of the pipeline, when applied after tokenisation and POS tagging, converts this input: :: In my opinion, this example hasn't turned out well. to this output (whitespace altered): ::

In my opinion , this example hasn't turned out well .

Note that ```` elements have attributes indicating values for tense, aspect, voice, modality and negation and that head verbs and nouns are marked as ``headv=yes`` and ``headn=yes`` respectively. These attributes are extra features which are not normally output by a chunker but which are included in this one because it is relatively simple to augment the rules for these features. The effects of the different style and format options are described below. The chunk rules require POS tagged input but can be applied before or after lemmatisation. The ``chunk`` component would typically be applied after the ``nertag`` component since the rules have been designed to utilise the output of ``nertag``; however, the rules do not require ``nertag`` output and the chunker can be used directly after POS tagging. The ``chunk`` script -------------------- Since ``chunk`` is called with arguments, the early part of the script is more complex than scripts with no arguments. The ``while`` and ``if`` loops) set up the ``-s`` and ``-f`` options so that style and format parameters can be provided when the component is called. For example, the run script calls the ``chunk`` component in this way: :: $here/scripts/chunk -s nested -f inline In the early part of the script the ``$lib`` variable is defined to point to ``TTT2/lib/chunk/`` which is the location of the resource files used by the ``chunk`` pipeline. The remainder of the script contains the sequence of processing steps piped together that constitute the basic ``chunk`` pipeline as well as conditional processing steps which format the output depending on the the choice of values supplied to the ``-s`` and ``-f`` parameters. The ``chunk`` pipeline ---------------------- :: 1. lxtransduce -q s $lib/verbg.gr | 2. lxreplace -q "vg[w[@neg='yes']]" -t "&attrs;&children;" | 3. lxtransduce -q s $lib/noung.gr | 4. lxtransduce -q s -l lex=$lib/other.lex $lib/otherg.gr | 5. lxreplace -q "phr|@c" > $tmp-chunked **Step 1:** ``lxtransduce -q s $lib/verbg.gr`` The first step applies a grammar to recognise verb groups. The verb groups are wrapped as ```` elements and various values for attributes encoding tense, aspect, voice, modality, negation and the head verb are computed. For example, the verb group from the previous example is output from this step as follows: :: has n't turned out The ```` element contains the attributes ``tense``, ``asp``, ``voice`` and ``modal`` while the ``headv`` attribute occurs on the head verb and a ``neg`` attribute occurs on any negative words in the verb group. **Step 2:** ``lxreplace -q vg[w[@neg=’yes’]] -t &attrs;&children;`` In the second step, information about negation is propagated from a negative word inside a verb group to the enclosing ```` element. Thus the previous example now looks like this: :: has n't turned out **Step 3:** ``lxtransduce -q s $lib/noung.gr`` In this step the noun group grammar is applied. Noun groups are wrapped as ```` elements and the head noun is marked with the attribute ``headn=yes`` — see for example the two noun groups in the current example in Section :ref:`chunk Overview `. In the case of compounds, all the nouns in the compound are marked with the ``headn`` attribute: :: A snow storm In the case of coordination, the grammar treats conjuncts as separate noun groups if possible: :: green eggs and blue ham but where a noun group seems to contain a coordinated head then there is one noun group and all head nouns as well as conjunctions are marked as ``headn=yes``: :: green eggs and ham In this particular case, there is a genuine ambiguity as to the scope of the adjective *green* depending on whether it is just the eggs that are green or both the eggs and the ham that are green. The output of the grammar does not represent ambiguity and a single analysis will be output which will sometimes be right and sometimes wrong. The output above gives *green* scope over both nouns and therefore gives the second reading. This is appropriate for this case but would probably be considered wrong for *red wine and cheese*. The noun group grammar rules allow for the possibility that the text has first been processed by the ``nertag`` component by defining ````, ```` and ```` elements as possible sub-parts of noun groups. This means that the output of the noun group grammar may differ depending on whether ``nertag`` has been applied or not. For example, the ``nertag`` component identifies *the Office for National Statistics* as an ```` element and this is then treated by the noun group grammar as an ````: :: the Office for National Statistics When ``nertag`` isn’t first applied, the chunker outputs the example as a sequence of noun group, preposition group, noun group: :: the Office for National Statistics **Step 4:** ``lxtransduce -q s -l lex=$lib/other.lex $lib/otherg.gr`` The fourth step uses the grammar ``otherg.gr`` to identify all other types of phrases. The lexicon it consults is a small list of multi-word prepositions such as *in addition to*. The grammar identifies preposition groups (````), adjective groups (````), adverb groups (````) and sbar groups (````) so the output for *And obviously, over time, it seems that things get better.* is this (```` mark up suppressed): ::

And obviously, over time, it seems that things get better.

The only words which are not part of a chunk are punctuation marks and occasional function words such as the *And* in this example. The heads of the chunks identified by ``otherg.gr`` are not marked as such though it would be fairly simple to do so if necessary. **Step 5:** ``lxreplace -q phr|@c > $tmp-chunked`` The fifth step is the final part of the chunking part of the ``chunk`` pipeline. This step uses ``lxreplace`` to discard mark-up which is no longer needed: ```` elements were added by the ``nertag`` component and are used by the chunk rules but can be removed at this point. The ``c`` attribute on words is also no longer needed. The output at this stage is written to a temporary file, ``$tmp-chunked``, which is used as the input to the next steps in the pipeline which format the chunk output depending on the choices made with the ``-s`` and ``-p`` parameters. **Final steps: style and format** Through the ``-s`` parameter, the user can require the chunker output to conform to a particular style. The possible options for this parameter are ``conll``, ``flat``, ``nested`` or ``none``. As described in Grover and Tobin (2006) [9]_, different people may make different assumptions about how to mark up more complex chunks and there is a difference between our assumptions and those behind the mark-up of the CoNLL chunk data. To make it easier to compare with CoNLL-style chunkers, the grammars in the previous steps of the pipeline create an initial chunk mark-up which can be mapped to the CoNLL style or to some other style. The ``none`` option for ``-s`` causes this initial mark-up to be output. If the example *Edinburgh University’s chunker output can be made to vary* is first processed with the ``nertag`` component so that *Edinburgh University* is marked up as an ```` and is then processed by the following two steps: :: $here/scripts/chunk -s none -f inline | lxreplace -q w then the output is as follows: :: Edinburgh University 's chunker output can be made to vary The example contains a possessive noun phrase and a verb with an infinitival complement, which cause the main points of difference in style. The ```` and ```` elements have been created as temporary mark-up which can be modified in different ways to create different styles. CoNLL style is created through the following ``lxreplace`` steps: :: lxreplace -q cvg -t "&children;" | lxreplace -q "vg/vg" | lxreplace -q "ng[cng]" -t "&children;" | lxreplace -q "cng" -t "&children;" | lxreplace -q "ng[ng]" -t "&children;" | lxreplace -q "numex|timex|enamex" Here the embedded ```` and the ```` are output as ```` elements while the embedded ```` elements are discarded and the ```` is mapped to a ````. Mark up created by ``nertag`` (````, ```` and ```` elements) is also discarded: :: Edinburgh University 's chunker output can be made to vary An alternative non-hierarchical style is created using the ``-s flat`` option which causes the following ``lxreplace`` steps to be taken: :: lxreplace -q cvg | lxreplace -q "cng|ng/ng" | lxreplace -q "numex|timex|enamex" Here the ```` is removed and the embedded ```` elements are retained while embedded mark up in ```` elements is removed and ``nertag`` mark-up is also removed: :: Edinburgh University's chunker output can be made to vary The ``nested`` style is provided for users who prefer to retain a hierarchical structure and is achieved through the following ``lxreplace`` steps: :: lxreplace -q "cng" | lxreplace -q "cvg" -n "'vg'" The output of this style is as follows: :: Edinburgh University 's chunker output can be made to vary So far all the examples have used the ``-f inline`` option, however, two other options are provided, ``bio`` and ``standoff``. The ``bio`` option converts chunk element mark-up to attribute mark-up on ```` elements using the CoNLL BIO convention where the first word in a chunk is marked as beginning that chunk (e.g. ``B-NP`` for the first word of a noun group), other words in a chunk are marked as in that chunk (e.g. ``I-NP`` for non-initial words in a noun group) and words outside a chunk are marked as ``O``. These labels appear as values of the attribute ``group`` on ```` elements and the chunk element mark-up is removed. This conversion is done using ``lxt`` with the stylesheet ``TTT2/lib/chunk/tag2attr.xsl``. If the previous example is put through ``$here/scripts/chunk -s flat -f bio``, the output is this (irrelevant attributes suppressed): :: Edinburgh University 's chunker output can be made to vary . Chunk-related attributes on words are retained (e.g. ``headn`` and ``headv``) but attributes on ```` elements have been lost and would need to be mapped to attributes on head verbs if it was felt necessary to keep them. Note that BIO format is incompatible with hierarchical styles and an attempt to use it with the ``nested`` or ``none`` styles will cause an error. If the ``bio`` format option is chosen the output can then be passed on for further formatting, for example to create non-XML output. The stylesheet ``TTT2/lib/chunk/biocols.xsl`` has been included as an example and will produce the following column format: :: Edinburgh NNP B-NP University NNP I-NP 's POS I-NP chunker NN I-NP output NN I-NP can MD B-VP be VB I-VP made VBN I-VP to TO B-VP vary VB I-VP . . O The standoff format is included to demonstrate how NLP component mark-up can be encoded as standoff mark-up. If the previous example is put through ``$here/scripts/chunk -s flat -f standoff``, the output is this: ::

Edinburgh University 's chunker output can be made to vary .

Edinburgh University's chunker output can be made to vary
Using ``lxt`` with the stylesheet ``TTT2/lib/chunk/standoff.xsl``, the chunk mark up is removed from its inline position and a new ```` element is created as the last element inside the ```` element. This contains ````, ```` etc. elements. The text content of the elements in ```` is a copy of the string that they wrapped when they were inline. The relationship between the ```` elements in the text and the chunk elements in ```` is maintained through the use of the ``sw`` and ``ew`` attributes whose values are the ``id`` values of the start and end words of the chunk. If the ``nested`` style option is chosen then all levels of ``nertag`` and ``chunk`` mark-up are put in the ```` element: :: Edinburgh University's chunker output Edinburgh University Edinburgh University can be made to vary can be made to vary .. _gt-visualise: Visualising output ================== XML documents with many layers of annotation are often hard to read. I this section we describe ways in which the mark-up from the pipelines can be viewed more easily. Often, simple command line instructions can be useful. For example, the output of run can be piped through a sequence of LT-XML2 programs to allow the mark-up you are interested in to be more visible: :: echo 'Mr. Joe L. Bedford (www.jbedford.org) is President of JB Industries Inc. Bedford opened an office in Paris, France in September 2007.' | ./run | lxreplace -q w | lxgrep "s/*" This command processes the input with the ``run`` script and then removes the word mark-up and pulls out the chunks (immediate daughters of ````) so that they each appear on a line: :: Mr. Joe L. Bedford www.jbedford.org is President of JB Industries Inc Bedford opened an office in Paris France in September 2007 Another approach to visualising output is to convert it to HTML for viewing in a browser. In ``TTT2/lib/visualise`` we provide three style sheets, one to display ``nertag`` mark-up (``htmlner.xsl``), one to display ``chunk`` mark-up (``htmlchunk.xsl``) and one to display both (``htmlnerandchunk.xsl``). The following command: :: echo 'Mr. Joe L. Bedford (www.jbedford.org) is President of JB Industries Inc. Bedford opened an office in Paris, France in September 2007.' | ./run | lxt -s ../lib/visualise/htmlnerandchunk.xsl > visualise.html creates an HTML file, ``visualise.html`` which when viewed in a browser looks like this: .. _gt-outFig: .. figure:: images/output.png :width: 90% :align: center :alt: output example Visualisation of ``nertag`` and ``chunk`` mark-up .. rubric:: Footnotes .. [1] http://www.gutenberg.org/etext/3203 .. [2] Curran, J. R. and S. Clark (2003). Investigating GIS and smoothing for maximum entropy taggers. In *Proceedings of the 11th Meeting of the European Chapter of the Association for Computational Linguistics (EACL-03)*, pp. 91–98. .. [3] Marcus, M. P., B. Santorini, and M. A. Marcinkiewicz (1993). Building a large annotated corpus of English: the Penn Treebank. *Computational Linguistics 19(2)*. .. [4] Minnen, G., J. Carroll, and D. Pearce (2000). Robust, applied morphological generation. In *Proceedings of INLG*. .. [5] ``_ The SPECIALIST lexicon is Open Source and is freely available subject to certain terms and conditions which are reproduced in the LT-TTT2 distribution as ``TTT2/lib/lemmatise/SpecialistLexicon-terms.txt``. .. [6] Chinchor, N. A. (1998). *Proceedings of the Seventh Message Understanding Conference (MUC-7)*. .. [7] Mikheev, A., C. Grover, and M. Moens (1998). Description of the LTG system used for MUC-7. In *Seventh Message Understanding Conference MUC-7)*. .. [8] This list is available for download and local use within the limits of the ADL copyright statement, which is reproduced in the LT-TTT2 distribution as ``TTT2/lib/nertag/ADL-copyright-statement.txt``. .. [9] Grover, C. and R. Tobin (2006). Rule-based chunking and reusability. In *Proceedings of LREC 2006*, Genoa, Italy, pp. 873–878. .. [10] Tjong Kim Sang, E. F. and S. Buchholz (2000). Introduction to the CoNLL-2000 shared task: Chunking. In *Proceedings of the Conference on Natural Language Learning (CoNLL-2000)*.