.. _examples: ################## Practical Examples ################## This chapter describes some examples of using the geoparser with text from different domains, such as modern text, historical documents and classical works in English translation. Using the command line parameters you can switch different lexicons and gazetteers on or off, to suit different kinds of text. Each of the examples below looks at different aspects of the output produced. The examples here are for the domains we have tested, and the relevant files are included in the distribution (in the ``in`` directory) so you can run the examples as described below. These are real texts we have worked with, not prepared examples, and the output will contain errors - of precision or recall over the entities, or through mis-idendification of locations. The publications in :ref:`appPubs` discuss the performance you can expect in various domains. If your domain fits one of these categories you should be able to use the geoparser without adaptation, by simply specifying the ``-t type`` and ``-g gazetteer`` parameters appropriately. See :ref:`-t and -g parameters ` for the available options. For a discussion of the issues involved in customising the geoparser for a new domain, see "Adapting the Edinburgh Geoparser for Historical Geo-referencing" in :ref:`appPubs`. .. _ex-modern: Modern text =========== **Plain text: "burtons.txt"** We start with a simple example using the file "burtons.txt", without creating any visualisation files, and writing to stdout. Here the command is being run from the geoparser root directory, but it could be run from anywhere, with appropriately specified paths:: cat in/burtons.txt | scripts/run -t plain -g unlock The following command, using input redirection instead of a pipe, is of course completely equivalent:: scripts/run -t plain -g unlock < in/burtons.txt This run uses Edina's Unlock gazetteer which is mainly UK oriented. The input file starts like this:: How the home of Mini Rolls and Smash was gobbled up Food factory workers facing the the sack will march on Saturday for an economy that values more than just money Among the thousands of people who join the big anti-cuts march this Saturday will be a coach load from Wirral. ... .. _ex-output: The output starts like this:: 20210525

How the home of Mini Rolls and Smash was gobbled up

... The output is xml with paragraphs and sentences marked and individual tokens in ```` elements, with various linguistic attributes added. The unique "id" attribute is based on character position in the input text. Some meta data has been added, including a "docdate" which defaults to the current date as no ``-d docdate`` parameter was specified. Placename mentions found in the text will have a "locname" attribute on the ```` element, but this is part of the intermediate processing, and the final Named Entity markup is specified using standoff xml as described below. The ```` element is followed by a ```` section. The following sample shows the structure:: Saturday ... gobbled ... ... There are two sets of ```` elements because the pipeline uses two separate steps. The first is a rule-based process ("ner-rb") to identify and classify the entity mentions - the above example shows a *date* entity. The entity categories detected are: **date**, **location**, **person**, and **organisation**. The entities are tied back to their positions in the text by the ```` element, which has "sw" (start word) and "ew" (end word) attributes whose values match the "id"s on the ```` s in the text. The second set of ```` are mainly verbs and verb phrases, tagged as a basis for detecting events mentioned in the text. The ```` section relates pairs of ```` s, identified by a "ref" attribute that points to *event* ```` s (such as "ev1") or *rule-based* ones (*eg* "rb1") or to the *docdate* as in this example. From a purely geoparsing point of view, only the *rule-based* "location" entities may be required, which look like this:: Wirral These can easily be extracted if desired. (For example one could extract these with **lxgrep** or remove other unwanted nodes with **lxreplace**, both of which are included in the LT-XML2 toolkit). Tools to create the rest of the markup have been added to the pipeline at various times for different projects and the full output is included. .. _ex-172: **News text with known date: "172172"** With news text the date of the story is often known, and can be specified to the geoparser to help with event detection. The next example also specifies the ``-o outdir prefix`` option so that a full set of visualisation files will be produced in addition to the main output described above (which will be in a file named "outdir/prefix.out.xml"):: cat in/172172.txt | scripts/run -t plain -g geonames -d 2010-08-13 -o out 172172 In this case we have directed output to the pipeline's ``out`` directory but it can be sent anywhere using a relative or absolute path. The online Geonames gazetteer has been chosen, as the text doesn't relate to the UK. It begins:: Nadal and Murray set up semi showdown (CNN) -- Rafael Nadal and Andy Murray are both through to the semifinals of the Rogers Cup in Toronto, where they will face each other for a place in Sunday's final. Murray played some superb tennis in crushing the in-form David Nalbandian but Nadal had to recover from dropping the opening set to get past Germany's Philipp Kohlschreiber. Nalbandian won the ATP title in Washington last weekend and came into Friday's encounter on an 11-match unbeaten streak. ... Specifying the ``-o`` option means that, instead of just the tagged text file, we get a collection of output files: * 172172.display.html * 172172.events.xml * 172172.gaz.xml * 172172.gazlist.html * 172172.gazmap.html * 172172.geotagged.html * 172172.nertagged.xml * 172172.out.xml * 172172.timeline.html The output files are described in the Quick Start Guide, :ref:`qs-out`. We looked at the format of the "172172.out.xml" file above. The other main file is "172172.display.html", which looks as shown in Figure :ref:`ex-172displayFig`. The map window uses OpenStreetMap maps to display the placename locations, with green markers for the top-ranked candidate for each place and red markers for the other candidates. The bottom left panel shows the input text, with placenames highlighted, and the bottom right panel lists the placenames with all the candidate locations found for each. The first in the list is the chosen one, in green. You can see from the length of the horizontal scroll bar that there are typically a great many other candidates - this is especially true when using Geonames, as common placenames like the ones in this file are repeated many times all over the world. The display is centred on the first placename mention, "Toronto", and can be re-centred by selecting other lat/long positions from the placename list. .. _ex-172displayFig: .. figure:: images/172172display-updated.png :width: 100% :align: center :alt: geoparser display file Geoparser display file for news text input This was quite a short file to try to detect events in, but those found are listed in "172172.events.xml" which is used to produce the Timeline display shown in Figure :ref:`ex-timelineFig`. Note that the Timeline display will work when served from a local web server and we have tested it on Firefox, Chrome and Safari. You need to have Python installed, in your terminal go to the directory containing the geoparser output files and type the following command: For Python 2:: python -m SimpleHTTPServer For Python 3:: python -m http.server Then open your browser and point it to:: http://localhost:8000 Click on the ``172172timeline.html`` file and the display will open as shown in Figure :ref:`ex-timelineFig`. .. _ex-timelineFig: .. figure:: images/172172timeline-updated.png :width: 100% :align: center :alt: timeline file Timeline file This display shows other entity categories besides the locations, which are in green. Personal names are in red, organisations in blue and time expressions in yellow. The pipeline detected 5 events in this input but was only able to assign specific dates to two of them, which are the two plotted on the timeline. The other events included references to "this season" and "this year", which couldn't be placed on the timeline. In the screen shot, an "info" box has been brought up, by clicking on one of the events. It shows the text of the event and its timestamp. .. _ex-deep: Historical documents (relating to England) ========================================== We now take a more complex example, using some historical text. The input file is "cheshirepleas.txt", which starts thus:: On Saturday (fn. 2) next after the feast of St. Edward the King in the 33rd year of the reign of King Edward [1305] Robert le Grouynour did his homage and fealty for all the tenements of Lostoke, and acknowledged that he held the manor of Lostoke entirely from the manor of Weverham for homage and service and fealty [fo. 38d (275 d)] and suit at the court of Weverham every fortnight, and 17s. yearly to the manor of Weverham at the four terms, and two customary pigs, and four foot-men in time of war at Chester bridge over the Dee, when Weverham finds eight foot-men, and three when the manor of Weverham finds six, or two when Weverham finds four men, with the ward and relief of Lostok for all service. ... The appropriate gazetteer for this text is DEEP, a specialist gazetteer of historical placenames in England (see :ref:`footnote [1] ` in the Quick Start Guide for details). If we know that the text is about Cheshire we can restrict the gazetteer to that county. The text deals with dates in the 14th century - in fact over several different years, despite the rather specific sound of "On Saturday next", so whilst a ``docdate`` parameter may not be appropriate, we can limit the DEEP candidates to ones attested for the medieval period, using a date range (say for the 12th to 14th centuries). The ``run`` command we will use is:: cat in/cheshirepleas.txt | scripts/run -t plain -g deep -c Cheshire -r 1100 1400 -o out chespleas In this case we have specified that a full set of output files should be produced, in the usual ``out`` directory and prefixed with the string "chespleas". Figure :ref:`ex-chesdisplayFig` shows the display file created. .. _ex-chesdisplayFig: .. figure:: images/chespleasdisplay-updated.png :width: 100% :align: center :alt: geoparser display file Display file for Cheshire input and DEEP gazetteer As expected, the chosen locations are clustered together in Cheshire, the single outlier being a reference to "Wales". Note that in the previous version of the Geoparser the text window used to contain links back to the source gazetteer material at placenames.org.uk. However, as this site no longer exists, linking is no longer possible. The MADS format data that placenames.org.uk used to serve can be downloaded from JISC: `https://digitisation.jiscinvolve.org/wp/2017/05/24/digitisation-of-english-placenames-deep-project-data `_. Moreover, information from the results of the deep option look-up in the ``.gaz.xml`` output file (in this example, chesplease.gaz.xml) contains information not shown in the display:: 1285 Eyre The attribute ``@madsid`` refers to the record entry in the MADS data available from JISC and the attestation information gives an indication of date and source of the attestation. The url in the ``@uriins`` attribute does not work but it does encode a pathway through the EPNS data hierarchy that can be followed when browsing on the `http://epns.nottingham.ac.uk `_ site. Thus this reading of "Lostock" refers to a place in the county of Cheshire, in Macclesfield Hundred, in the parish of Prestbury, in the settlement of Poynton with Worth. The place identified is "Lostokhall Farm" and the EPNS page about it at `http://epns.nottingham.ac.uk/browse/Cheshire/Prestbury/53283645b47fc408560013ea-Lostockhall+Fm `_ can be reached by navigating the path just described. On this page one can view both the historical forms and the etymology. For more detail about the DEEP data, see the paper "A Gazetteer and Georeferencing for Historical English Documents" in :ref:`appPubs`. Classical texts =============== As part of the `GAP `_ (Google Ancient Places) project, the geoparser was adapted to deal with classical texts in English translation. This requires different lexicons of places and personal names and uses the `Pleiades `_ gazetteer of ancient places. (See the :ref:`Pleiades+ ` section for details of Pleiades and Pleiades+.) The geoparser output was post-processed by the GAP project to creat the GapVis display (versions `1 `_ and `2 `_ are currently available). This only requires one location per toponym mention so only the top-ranked candidate was passed on. If you only require the "best" location (the green markers in the displays above) then specify the ``-top`` option:: cat in/herodotusBk1.txt | scripts/run -t plain -g plplus -o out hbk1plplus -top The ``-top`` option can be used for any input and will result in only the top candidate place being shown in the display file. An example of this is illustrated in Figure :ref:`ex-herdisplayFig`. The text is the opening of Book 1 of the *Histories* by Herodotus. .. _ex-herdisplayFig: .. figure:: images/herdisplay-updated.png :width: 100% :align: center :alt: geoparser display file Herodotus display file In principle it might be possible to process input in the original Latin or Greek (or indeed in any language), if suitable linguistic components could be substituted in the geotagging stage of the pipeline. This is a project for another day. The `Hestia 2 `_ project has taken steps along the way towards allowing students to work with the Greek version of the *Histories*. Using pre-formatted input ========================= The examples above all use plain text input files. If your input files already contain markup, such as html or xml, you may wish to alter the pre-processing steps of the pipeline to cater for it. Alternatively, it may be simpler to strip the markup and treat your input as plain text. The ``type`` parameter accepts two specific formats, ``gb`` (Google Books html files) and ``ltgxml`` (a simple xml style used by LTG, that has paragraphs pre-marked; see sample file ``in/172172.xml`` for the format). Google Books format ------------------- Another spin-off from the GAP project was the need to process Google Books input. GAP was a Google-funded project using bulk quantities of Google Books material, specifically classical texts. The original scanning and OCR work was done on a very large scale by Google and the quality can be variable to say the least. The data was made available as html files and we had the choice of either stripping all the markup - which would have thrown away valuable information - or attempting to ingest the raw files. The ``prepare`` stage at the start of the pipeline was amended to do just enough pre-processing of the html to ensure that the many non-printable characters contained in the OCR-ed input don't break the xml tools. Because the files vary so much from book to book it was not possible to do more detailed tailoring. If this was required, the ``prepare-gb`` script might be a starting point. The following example uses another edition of the Herodotus text, taking a single page from a Google Books edition as input:: cat in/gb_pg234.html | scripts/run -t gb -g plplus -top -o out gb_pg234 The output files are similar to those already shown and are available in the ``out`` directory. The `Open Library `_ provides an alternative source of scanned and OCR-ed texts, and experiments were also done with material from this source. The text displays many of the same OCR errors but is available as plain text (as well as other formats less useful for processing) rather than html. XML input --------- The "172172.txt" input used :ref:`above ` was actually orginally generated as xml, in the "ltgxml" format - the plain text version has the markup stripped out. In the ltgxml format the docdate can be included if known::

Nadal and Murray set up semi showdown

(CNN) -- Rafael Nadal and Andy Murray are both through to the semifinals of the Rogers Cup in Toronto, where they will face each other for a place in Sunday's final.

Murray played some superb tennis in crushing the in-form David Nalbandian but Nadal had to recover from dropping the opening set to get past Germany's Philipp Kohlschreiber.

Nalbandian won the ATP title in Washington last weekend and came into Friday's encounter on an 11-match unbeaten streak.

... We would get identical output to that obtained above (give or take white space) with this command:: cat in/172172.xml | scripts/run -t ltgxml -g geonames -o out 172172 The ``-t type`` is changed to reflect the input and the ``-d docdate`` is no longer required. Apart from the docdate specification, the other substantive difference with using xml input is that paragraph markers can be passed to the pipeline if you already have them.