.. _examples:

##################
Practical Examples
##################

This chapter describes some examples of using the geoparser with text
from different domains, such as modern text, historical documents and
classical works in English translation. Using the command line
parameters you can switch different lexicons and gazetteers on or off,
to suit different kinds of text. Each of the examples below looks at
different aspects of the output produced.

The examples here are for the domains we have tested, and the relevant
files are included in the distribution (in the ``in`` directory) so
you can run the examples as described below. These are real texts we
have worked with, not prepared examples, and the output will contain
errors - of precision or recall over the entities, or through
mis-idendification of locations. The publications in :ref:`appPubs`
discuss the performance you can expect in various domains.

If your domain fits one of these categories you should be able to use
the geoparser without adaptation, by simply specifying the ``-t type``
and ``-g gazetteer`` parameters appropriately. See :ref:`-t and -g
parameters <qs-tandgparms>` for the available options.

For a discussion of the issues involved in customising the geoparser
for a new domain, see "Adapting the Edinburgh Geoparser for Historical
Geo-referencing" in :ref:`appPubs`.

.. _ex-modern:

Modern text
===========

**Plain text: "burtons.txt"**

We start with a simple example using the file "burtons.txt", without
creating any visualisation files, and writing to stdout. Here the
command is being run from the geoparser root directory, but it could be
run from anywhere, with appropriately specified paths::

  cat in/burtons.txt | scripts/run -t plain -g unlock

The following command, using input redirection instead of a pipe, is
of course completely equivalent::

  scripts/run -t plain -g unlock < in/burtons.txt

This run uses Edina's Unlock gazetteer which is mainly UK
oriented. The input file starts like this::

  How the home of Mini Rolls and Smash was gobbled up

  Food factory workers facing the the sack will march on Saturday for an
  economy that values more than just money

  Among the thousands of people who join the big anti-cuts march this
  Saturday will be a coach load from Wirral. ...

.. _ex-output:

The output starts like this::

  <?xml version="1.0" encoding="UTF-8"?>
  <document version="3">
  <meta>
   <attr name="docdate" id="docdate" year="2021" month="05" date="25" sdate="2021-05-25" day-number="737934" day="Tuesday" wdaynum="2">20210525</attr>
   <attr name="tokeniser_version" date="20151216"/></meta>
  </meta>
  <text>
   <p>
    <s id="s1">
     <w pws="yes" id="w13" p="WRB" group="B-ADVP">How</w>
     <w pws="yes" id="w17" p="DT" group="B-NP">the</w>
     <w l="home" pws="yes" id="w21" p="NN" headn="yes" group="I-NP">home</w>
     <w pws="yes" id="w26" p="IN" group="B-PP">of</w>
     <w common="true" l="minus" pws="yes" id="w29" p="NNP" event="true"
        headn="yes" group="B-NP">Mini</w>
     <w common="true" vstem="roll" l="roll" pws="yes" id="w34" p="NNP"
        event="true" headn="yes" group="I-NP">Rolls</w>
     <w pws="yes" id="w40" p="CC" headn="yes" group="I-NP">and</w>
     <w common="true" l="smash" pws="yes" id="w44" p="NNP" event="true"
        headn="yes" group="I-NP">Smash</w>
     <w l="be" pws="yes" id="w50" p="VBD" group="B-VP">was</w>
     <w l="gobble" pws="yes" id="w54" p="VBN" headv="yes" group="I-VP">gobbled</w>
     <w pws="yes" id="w62" p="RP" group="I-VP">up</w>
    </s>
   </p>
   ...

The output is xml with paragraphs and sentences marked and individual
tokens in ``<w>`` elements, with various linguistic attributes
added. The unique "id" attribute is based on character position in the
input text. Some meta data has been added, including a "docdate" which
defaults to the current date as no ``-d docdate`` parameter was
specified. Placename mentions found in the text will have a "locname"
attribute on the ``<w>`` element, but this is part of the intermediate
processing, and the final Named Entity markup is specified using
standoff xml as described below.

The ``<text>`` element is followed by a ``<standoff>`` section. The
following sample shows the structure::

  <standoff>
   <ents source="ner-rb">
    <ent date="29" month="05" year="2021" sdate="2021-05-29"
          day-number="737938" id="rb1" wdaynum="6"
          day="Saturday" type="date">
     <parts>
      <part ew="w125" sw="w125">Saturday</part>
     </parts>
    </ent>
    ...
   </ents>
   <ents source="events">
    <ent tense="past" voice="pass" asp="simple" modal="no" id="ev1"
         subtype="gobble" type="event">
     <parts>
      <part ew="w54" sw="w54">gobbled</part>
     </parts>
    </ent>
    ...
   </ents>
   <relations source="temprel">
    <relation id="rbr1" type="beforeorincl" text="was gobbled up">
     <argument arg1="true" ref="ev1"/>
     <argument arg2="true" ref="docdate"/>
    </relation>
    ...
   </relations>
  </standoff>

There are two sets of ``<ents>`` elements because the pipeline uses
two separate steps. The first is a rule-based process ("ner-rb") to
identify and classify the entity mentions - the above example shows a
*date* entity. The entity categories detected are: **date**,
**location**, **person**, and **organisation**. The entities are tied
back to their positions in the text by the ``<part>`` element, which
has "sw" (start word) and "ew" (end word) attributes whose values
match the "id"s on the ``<w>`` s in the text.

The second set of ``<ents>`` are mainly verbs and verb phrases, tagged as
a basis for detecting events mentioned in the text. The
``<relations>`` section relates pairs of ``<ent>`` s, identified by a
"ref" attribute that points to *event* ``<ent>`` s (such as "ev1") or
*rule-based* ones (*eg* "rb1") or to the *docdate* as in this example.

From a purely geoparsing point of view, only the *rule-based*
"location" entities may be required, which look like this::

  <ents source="ner-rb">
   <ent id="rb3" type="location" lat="53.37141177657975"
        long="-3.083005177123173" gazref="unlock:13315840"
        feat-type="civila">
    <parts>
     <part ew="w288" sw="w288">Wirral</part>
    </parts>
   </ent>
  <ent>

These can easily be extracted if desired. (For example one could
extract these with **lxgrep** or remove other unwanted nodes with
**lxreplace**, both of which are included in the LT-XML2
toolkit). Tools to create the rest of the markup have been added to
the pipeline at various times for different projects and the full
output is included.

.. _ex-172:

**News text with known date: "172172"**

With news text the date of the story is often known, and can be
specified to the geoparser to help with event detection. The next
example also specifies the ``-o outdir prefix`` option so that a full set
of visualisation files will be produced in addition to the main output
described above (which will be in a file named
"outdir/prefix.out.xml")::

  cat in/172172.txt |
    scripts/run -t plain -g geonames -d 2010-08-13 -o out 172172

In this case we have directed output to the pipeline's ``out``
directory but it can be sent anywhere using a relative or absolute
path. The online Geonames gazetteer has been chosen, as the text
doesn't relate to the UK. It begins::

  Nadal and Murray set up semi showdown
  (CNN) -- Rafael Nadal and Andy Murray are both through to the
  semifinals of the Rogers Cup in Toronto, where they will face each
  other for a place in Sunday's final.
  Murray played some superb tennis in crushing the in-form David
  Nalbandian but Nadal had to recover from dropping the opening set to
  get past Germany's Philipp Kohlschreiber.
  Nalbandian won the ATP title in Washington last weekend and came
  into Friday's encounter on an 11-match unbeaten streak. ...

Specifying the ``-o`` option means that, instead of just the tagged
text file, we get a collection of output files:

* 172172.display.html
* 172172.events.xml
* 172172.gaz.xml
* 172172.gazlist.html
* 172172.gazmap.html
* 172172.geotagged.html
* 172172.nertagged.xml
* 172172.out.xml
* 172172.timeline.html

The output files are described in the Quick Start Guide,
:ref:`qs-out`. We looked at the format of the "172172.out.xml" file
above. The other main file is "172172.display.html", which looks as
shown in Figure :ref:`ex-172displayFig`. The map window uses OpenStreetMap maps to display the placename locations, with green markers for the
top-ranked candidate for each place and red markers for the other
candidates. The bottom left panel shows the input text, with
placenames highlighted, and the bottom right panel lists the
placenames with all the candidate locations found for each. The first
in the list is the chosen one, in green. You can see from the length
of the horizontal scroll bar that there are typically a great many
other candidates - this is especially true when using Geonames, as
common placenames like the ones in this file are repeated many times
all over the world. The display is centred on the first placename
mention, "Toronto", and can be re-centred by selecting other lat/long
positions from the placename list.

.. _ex-172displayFig:

.. figure:: images/172172display-updated.png
   :width: 100%
   :align: center
   :alt: geoparser display file

   Geoparser display file for news text input

This was quite a short file to try to detect events in, but those found are listed in "172172.events.xml" which is used to produce the Timeline display shown in Figure :ref:`ex-timelineFig`.

Note that the Timeline display will work when served from a local web server and we have tested it on Firefox, Chrome and Safari.  You need to have Python installed, in your terminal go to the directory containing the geoparser output files and type the following command:

For Python 2::

  python -m SimpleHTTPServer

For Python 3::

  python -m http.server

Then open your browser and point it to::

  http://localhost:8000

Click on the ``172172timeline.html`` file and the display will open as shown in Figure :ref:`ex-timelineFig`.

.. _ex-timelineFig:

.. figure:: images/172172timeline-updated.png
   :width: 100%
   :align: center
   :alt: timeline file


   Timeline file

This display shows other entity categories besides the locations, which are in green. Personal names are in red, organisations in blue and time expressions in yellow. The pipeline detected 5 events in this input but was only able to assign specific dates to two of them, which are the two plotted on the timeline. The other events included references to "this season" and "this year", which couldn't be placed on the timeline. In the screen shot, an "info" box has been brought up, by clicking on one of the events. It shows the text of the event and its timestamp.

.. _ex-deep:

Historical documents (relating to England)
==========================================

We now take a more complex example, using some historical text. The
input file is "cheshirepleas.txt", which starts thus::

  On Saturday (fn. 2) next after the feast of St. Edward the King in
  the 33rd year of the reign of King Edward [1305] Robert le Grouynour
  did his homage and fealty for all the tenements of Lostoke, and
  acknowledged that he held the manor of Lostoke entirely from the manor
  of Weverham for homage and service and fealty [fo. 38d (275 d)] and
  suit at the court of Weverham every fortnight, and 17s. yearly to the
  manor of Weverham at the four terms, and two customary pigs, and four
  foot-men in time of war at Chester bridge over the Dee, when Weverham
  finds eight foot-men, and three when the manor of Weverham finds six,
  or two when Weverham finds four men, with the ward and relief of
  Lostok for all service. ...

The appropriate gazetteer for this text is DEEP, a specialist
gazetteer of historical placenames in England (see :ref:`footnote [1]
<qs-ftnote>` in the Quick Start Guide for details). If we know
that the text is about Cheshire we can restrict the gazetteer to that
county. The text deals with dates in the 14th century - in fact over
several different years, despite the rather specific sound of "On
Saturday next", so whilst a ``docdate`` parameter may not be
appropriate, we can limit the DEEP candidates to ones attested for the
medieval period, using a date range (say for the 12th to 14th
centuries). The ``run`` command we will use is::

  cat in/cheshirepleas.txt |
    scripts/run -t plain -g deep -c Cheshire -r 1100 1400 -o out chespleas

In this case we have specified that a full set of output files should be produced, in the usual ``out`` directory and prefixed with the string "chespleas". Figure :ref:`ex-chesdisplayFig` shows the display file created.

.. _ex-chesdisplayFig:

.. figure:: images/chespleasdisplay-updated.png
   :width: 100%
   :align: center
   :alt: geoparser display file

   Display file for Cheshire input and DEEP gazetteer

As expected, the chosen locations are clustered together in Cheshire,
the single outlier being a reference to "Wales".

Note that in the previous version of the Geoparser the text window used to contain links back to the source gazetteer material at placenames.org.uk. However, as this site no longer exists, linking is no longer possible. The MADS format data that placenames.org.uk used to serve can be downloaded from JISC: `https://digitisation.jiscinvolve.org/wp/2017/05/24/digitisation-of-english-placenames-deep-project-data <https://digitisation.jiscinvolve.org/wp/2017/05/24/digitisation-of-english-placenames-deep-project-data/>`_. Moreover, information from the results of the deep option look-up in the ``.gaz.xml`` output file (in this example, chesplease.gaz.xml) contains information not shown in the display::

 <placename id="rb14" name="Lostok">
  <place rank="1" long="-2.142703862" lat="53.34264479" deepcounty="Cheshire" madsid="epns-deep-44-c-mappedname-001816" uriins="http://epns.nottingham.ac.uk/England/Cheshire/Macclesfield%20Hundred/Prestbury Poynton%20with%20Worth/Lostockhall%20Farm" name="Lostok">
   <attestations>
    <attestation>
     <date begin="1285" end="1285" subtype="simple">1285</date>
     <source id="ch223" style="italic">Eyre</source>
    </attestation>
   </attestations>
  </place>
 </placename>

The attribute ``@madsid`` refers to the record entry in the MADS data available from JISC and the attestation information gives an indication of date and source of the attestation. The url in the ``@uriins`` attribute does not work but it does encode a pathway through the EPNS data hierarchy that can be followed when browsing on the `http://epns.nottingham.ac.uk <http://epns.nottingham.ac.uk>`_ site. Thus this reading of "Lostock" refers to a place in the county of Cheshire, in Macclesfield Hundred, in the parish of Prestbury, in the settlement of Poynton with Worth. The place identified is "Lostokhall Farm" and the EPNS page about it at `http://epns.nottingham.ac.uk/browse/Cheshire/Prestbury/53283645b47fc408560013ea-Lostockhall+Fm <http://epns.nottingham.ac.uk/browse/Cheshire/Prestbury/53283645b47fc408560013ea-Lostockhall+Fm>`_ can be reached by navigating the path just described. On this page one can view both the historical forms and the etymology. For more detail about the DEEP data, see the paper "A Gazetteer and Georeferencing for Historical English Documents" in :ref:`appPubs`.

Classical texts
===============

As part of the `GAP <http://googleancientplaces.wordpress.com/>`_
(Google Ancient Places) project, the geoparser was adapted to deal
with classical texts in English translation. This requires different
lexicons of places and personal names and uses the `Pleiades
<http://pleiades.stoa.org/>`_ gazetteer of ancient places. (See the
:ref:`Pleiades+ <gaz-plplus>` section for details of Pleiades and
Pleiades+.)

The geoparser output was post-processed by the GAP project to creat
the GapVis display (versions `1 <http://nrabinowitz.github.io/gapvis/>`_
and `2 <http://enridaga.github.io/gapvis/gap2/>`_ are currently
available). This only requires one location per toponym mention so
only the top-ranked candidate was passed on. If you only require the
"best" location (the green markers in the displays above) then specify
the ``-top`` option::

  cat in/herodotusBk1.txt |
    scripts/run -t plain -g plplus -o out hbk1plplus -top

The ``-top`` option can be used for any input and will result in only the top
candidate place being shown in the display file. An example of this is
illustrated in Figure :ref:`ex-herdisplayFig`. The text is the opening
of Book 1 of the *Histories* by Herodotus.

.. _ex-herdisplayFig:

.. figure:: images/herdisplay-updated.png
   :width: 100%
   :align: center
   :alt: geoparser display file

   Herodotus display file

In principle it might be possible to process input in the original
Latin or Greek (or indeed in any language), if suitable linguistic
components could be substituted in the geotagging stage of the
pipeline. This is a project for another day. The `Hestia 2
<http://hestia.open.ac.uk/hestia-phase-2/>`_ project has taken steps
along the way towards allowing students to work with the Greek version
of the *Histories*.


Using pre-formatted input
=========================

The examples above all use plain text input files. If your input files
already contain markup, such as html or xml, you may wish to alter the
pre-processing steps of the pipeline to cater for it. Alternatively,
it may be simpler to strip the markup and treat your input as plain
text. The ``type`` parameter accepts two specific formats, ``gb``
(Google Books html files) and ``ltgxml`` (a simple xml style used by LTG, that
has paragraphs pre-marked; see sample file ``in/172172.xml`` for the format).

Google Books format
-------------------

Another spin-off from the GAP project was the need to process Google
Books input. GAP was a Google-funded project using bulk quantities of
Google Books material, specifically classical texts. The original
scanning and OCR work was done on a very large scale by Google and the
quality can be variable to say the least. The data was made available
as html files and we had the choice of either stripping all the
markup - which would have thrown away valuable information - or
attempting to ingest the raw files. The ``prepare`` stage at the start
of the pipeline was amended to do just enough pre-processing of the
html to ensure that the many non-printable characters contained in the
OCR-ed input don't break the xml tools. Because the files vary so much
from book to book it was not possible to do more detailed
tailoring. If this was required, the ``prepare-gb`` script might be a
starting point.

The following example uses another edition of the Herodotus text,
taking a single page from a Google Books edition as input::

  cat in/gb_pg234.html | scripts/run -t gb -g plplus -top -o out gb_pg234

The output files are similar to those already shown and are available
in the ``out`` directory.

The `Open Library <https://openlibrary.org/>`_ provides an alternative
source of scanned and OCR-ed texts, and experiments were also done
with material from this source. The text displays many of the same OCR
errors but is available as plain text (as well as other formats less
useful for processing) rather than html.


XML input
---------

The "172172.txt" input used :ref:`above <ex-172>` was actually
orginally generated as xml, in the "ltgxml" format - the plain text
version has the markup stripped out. In the ltgxml format the docdate
can be included if known::

  <?xml version="1.0" encoding="UTF-8"?>
  <document version="3">
  <meta>
   <attr name="docdate" id="docdate" year="2010" month="08" date="13"
        sdate="2010-08-13" day-number="733996" day="Friday" wdaynum="5"/>
  </meta>
  <text>
   <p>Nadal and Murray set up semi showdown</p>
   <p>(CNN) -- Rafael Nadal and Andy Murray are both through to the semifinals
      of the Rogers Cup in Toronto, where they will face each other for a
      place in Sunday's final.</p>
   <p>Murray played some superb tennis in crushing the in-form David Nalbandian
      but Nadal had to recover from dropping the opening set to get past
      Germany's Philipp Kohlschreiber.</p>
   <p>Nalbandian won the ATP title in Washington last weekend and came into
      Friday's encounter on an 11-match unbeaten streak.</p>
   ...

We would get identical output to that obtained above (give or take
white space) with this command::

    cat in/172172.xml | scripts/run -t ltgxml -g geonames -o out 172172

The ``-t type`` is changed to reflect the input and the ``-d docdate`` is
no longer required.

Apart from the docdate specification, the other substantive difference
with using xml input is that paragraph markers can be passed to the
pipeline if you already have them.