.. _georesolve: ************* Georesolution ************* The georesolution step takes the tagged text file as input and processes the location entities to give them spatial co-ordinates. The chosen gazetteer is queried to produce a list of candidate locations for each toponym and these are ranked, with the highest ranking one chosen to be shown as a green marker on the map display, or as the only marker if the ``-top`` option is used. The tagged text file produced by the geotagging step contains further markup - for other entity categories besides location (person, organisation, time expressions) and for temporal events, which are expressed as binary relations between pairs of entities. Although obviously the geoparser's main business is with spatial entities, the temporal relations are processed at the end of the georesolution step, to produce a timeline display of events detected in the text. The input file for this step is in a temporary file, labelled "tmp-temprel" in the flowcharts of the Overview chapter; see :ref:`ov-georesolutionFig`. The actual file will be in the ``/tmp`` directory, with a name that includes the username of the process in which the script was run and a unique string generated from the name of the script that's running and its process number, suffixed in this case with "temprel" to identify the content, *eg* "$USER-run-5648-temprel". These temporary files are removed when the pipeline exits unless the $LXDEBUG environment variable is set, in which case they are kept for examination. The final output file - written to ``$outdir.out.xml`` if ``-o outdir`` is specified and to stdout otherwise - is described at :ref:`output file ` in the Practical Examples chapter, and there is an example file :download:`here ` (html documentation only). The "tmp-temprel" file differs only in respect of the **location** entities. In the unprocessed temprel file these look like this:: Toronto The georesolution step adds extra attributes to this element, from the Geonames gazetteer in this example:: Toronto .. _gr-gazxml: This is the top-ranked candidate, `http://www.geonames.org/6167865/toronto.html `_. The other candidates are listed in ``$outdir/gaz.xml`` - see example file :download:`here ` (html documentation only). In this example there were 20 candidates for Toronto, which is the maximum number the geoparser considers. The first five are shown below:: ... ... There is one ```` element for each distinct placename found in the input document - note, not for each individual mention. If a place is mentioned multiple times in a document the geoparser assumes the same place is being talked about each time. Clearly there are examples where this would be an erroneous assumption, *eg* in the text snippet: "Are we talking about London, England or London, Ontario?" There is in fact a special rule to catch containment expressed in this co-ordinated way, but nevertheless the current version of the geoparser will only pick a single location for London (the first one, in England). The rest of the output files produced if ``-o`` is specified are for visualisation in a browser. The rest of this chapter looks at each step of the georesolution process in a little more detail: firstly the collection of candidate places from the gazetteer, then the ranking process and finally the production of display files. Gazetteer Lookup ================ The ``run`` script calls another, named ``geoground``, which carries out two tasks by calling further scripts. The first is gazetteer lookup, done by the ``geogaz`` script which calls a version of ``gazlookup`` tailored for the gazetteer and including the gazetteer name. So for example, if ``-g geonames`` were specified to the ``run`` script then ``gazlookup-geonames`` would be used at this point, whereas if Pleiades+ were required then ``gazlookup-plplus`` would be invoked. If you look in the ``scripts`` directory you will find a collection of these ``gazlookup`` scripts, most being completely separate routines, needed because the connection methods and queries to be used differ greatly between different gazetteers. The "Unlock" option is an exception as it has three variants - "unlock", "unlockgeonames" and "naturalearth" (see :ref:`-t and -g parameters `) - but these can be dealt with by parameterisation within a single script, ``gazlookup-unlock``. There are soft links to this script to cover the other two variants because, in order to make it straightforward to add new gazetteer options, the ``geogaz`` script looks for a script named ``gazlookup-$gaz``, where "$gaz" is the ``-g $gaz`` command line parameter. (The OS option differs slightly from the other Unlock gazlookups and is a separate script rather than a soft link.) This means that to add a new gazetteer to the pipeline, all you need do is create a script named ``gazlookup-newgaz`` that handles the connection and querying appropriately, and returns a set of candidates formatted as required for the next stage; and then alter the ``run`` script to accept "$newgaz" as a valid ``-g`` option. Of course, if the domain covered by the new gazetteer is completely new, then alterations to the geotagging stage would also be needed - as for example was the case when the Pleiades gazetteer of ancient places was added to cater for classical texts. The input to the ``gazlookup-$gaz`` step is a list of the locations found in the input, extracted by an XSL stylesheet named ``extractlocs.xsl``. The list is formatted as shown in this example:: The output of the gazetteer lookup is a collection of up to 20 candidate ```` nodes for each ````. The final step of the ``geogaz`` script is to sort and deduplicate - as explained above, the assumption is made that multiple references to the same toponym string within a single document are referring to the same place. The output of this stage is in a temporary file suffixed "gazunres.xml", following the naming conventions described above. An example is :download:`here ` (html documentation only). It contains feature information extracted from the gazetteer for each candidate location, to be used by the ranking algorithm. The first few lines for our example are as follows:: ... This example makes clear the need for ranking over a reasonable number of candidates, at least for a gazetteer like Geonames with so many candidates for most placenames. For Toronto, the first four places returned were in Tanzania, Austria, Cuba and Columbia. We are up to numbers 13 and 14 before Canadian places appear in the list. For many places Geonames will return an extremely long list; the geoparser truncates the results at 20, which will almost always include the right one and makes the ranking process manageable in terms of processing time. Ranking ======= The ranking of the ```` candidates is done by the ``georesolve`` script. If the gazetteer supplies feature information the ranking makes use of it, for example preferring populated places (Geonames code "PPL") over natural features, and preferring larger to smaller places (based on population size). Apart from the attributes of the candidate places, the ranking algorithm considers their locations compared pair-wise with each of the other places in the document. It will prefer places that cluster with other locations in the same document. For example, if most of the places mentioned in a text seem to be in Canada, a mention of "London" will probably be placed in Ontario rather than England. If you know the geographical area that your input document deals with, you can specify either a locality circle or box using the ``-l`` or ``-lb`` command line options. These are explained in in the Quick Start chapter, :ref:`qs-locality`. This is another factor that will be considered by the ranker, making it prefer locations in the area specified but still allowing the selection of places elsewhere that may be mentioned in the text. The "score" parameter can be used for weighting the degree of preference; if using this option it is probably best to experiement with different weights. .. _gr-gazxml2: The output of the ``georesolve`` ranking step is the ``$outdir/gaz.xml`` that was described :ref:`above `. It is a ranked list of ```` candidates for each ````. The candidates have the features from the gazetteer and the extra attributes added by the ranking algorithm, such as "clusteriness" referring to how well the places mention form a spatial group. The raw scores are scaled and combined to produce an overall "score" attribute, which in turn determines the "rank" for each candidate ````. See the sample output :download:`here ` (html documentation only). It is worth noting here that for various reasons including the clustering factor, the geoparser works better with short texts than very long ones. It was originally designed to handle large numbers of short text documents (roughly one page at a time) processed in a loop. If an attempt is made to process an entire book in one go, the ranking algorithm may be overloaded - pairwise comparisons of locations throughout the document may break it - and in any case the assumption about locality will probably be invalid. We advise that long texts are split into small parts, preferably into coherent chunks of narrative. Formatting Output ================= If the ``-o outdir`` option is not specified then the output of the pipeline is written to standard out (and can of course be redirected to a file), and consists of a single xml ```` as described at :ref:`output file ` in the Practical Examples chapter, with an example file :download:`here ` (html documentation only). The output is a tagged version of the input file, in standoff xml format, with the ```` node having ```` and ```` children (plus a metadata node). The placenames are tagged entities within the text, appearing as ```` nodes in the standoff section with pointers back to their position in the tokenised text. Only the top candidate for each place is included in this output, as a tagged entity, such as:: Toronto The ranking detail is removed and only the most important gazetteer features are retained: the latitude and longitude co-ordinates, and (for Geonames which supplies them) the country and feature type codes and population. If the ``-o outdir`` option **is** specified then the georesolution component has several extra steps, which are simply reformatting of all the output generated so far, using XSL stylesheets to produce a collection of files for visualising the output. These steps are illustrated on the :ref:`ov-georesolutionFig`. The "plainvis.xsl" stylesheet is used to format the input text as an html page with the toponyms highlighted; DEEP has a special version which adds links back to the source gazetteer. The ``gazmap`` script pulls this html page together with the xml list of candidate placename locations (in the ``$outdir/gaz.xml`` file described :ref:`earlier `) and adds a map display created by plotting the locations using Google Maps. The three components are combined in a single file named ``$outdir.display.html``. Various examples are shown in the Practical Examples chapter, including :ref:`ex-172displayFig`, which has the maps panel at the top (green markers for top candidates, red for others), the tagged text on the left and the ``$outdir/gaz.xml`` list on the right. If the ``-top`` option is specified then an additional set of files is created, with only the top candidate locations (green markers) retained. :ref:`ex-herdisplayFig` shows an example. Finally, the ``timeline`` script takes the tagged file and produces a display highlighting all the entities found: names, organisations and time expressions as well as locations. It also extracts the events detected and, where these can be given a specific date, uses javascript to create a timeline visualisation using a `Simile widget `_. :ref:`ex-timelineFig` shows an example of the ``$outdir.timeline.html`` file. The events found are listed in ``$outdir.events.xml``, which is in the format required by the Timeline widget, as illustrated below:: Nadal and Murray set up semi showdown (CNN) -- Rafael Nadal and Andy Murray are both through to the semifinals of the Rogers Cup in Toronto, where they will face each other for a place in Sunday's final. ... The complete file for this example is :download:`here ` (html documentation only). In summary, with the ``-o out`` option, the following files are created: ========================= ================================================== File Description ========================= ================================================== $out.out.xml Main output: tagged and geogrounded text $out.gaz.xml Locations list $out.gazlist.html Locations list in html format $out.gazmap.html Locations plotted on Google maps $out.geotagged.html Geotagged text as html file $out.display.html 3-panel display: map + text + locations list $out.gazlist-top.html Top-ranked candidate list in html format $out.gazmap-top.html Top-ranked locations plotted on Google maps $out.display-top.html 3-panel display: map + text + top-locations list $out.nertagged.xml Output from NER stage $out.events.xml Events extracted in Timeline format $out.timeline.html Display page with all NEs and timeline ========================= ================================================== The three "\*-top\*" files are only produced if the ``-top`` option is used.