Georesolution¶
The georesolution step takes the tagged text file as input and
processes the location entities to give them spatial co-ordinates. The
chosen gazetteer is queried to produce a list of candidate locations
for each toponym and these are ranked, with the highest ranking one
chosen to be shown as a green marker on the map display, or as the
only marker if the -top
option is used.
The tagged text file produced by the geotagging step contains further markup - for other entity categories besides location (person, organisation, time expressions) and for temporal events, which are expressed as binary relations between pairs of entities. Although obviously the geoparser’s main business is with spatial entities, the temporal relations are processed at the end of the georesolution step, to produce a timeline display of events detected in the text.
The input file for this step is in a temporary file, labelled
“tmp-temprel” in the flowcharts of the Overview chapter; see
Georesolution flowchart. The actual file will be in the /tmp
directory, with a name that includes the username of the process in
which the script was run and a unique string generated from the name
of the script that’s running and its process number, suffixed in this
case with “temprel” to identify the content, eg
“$USER-run-5648-temprel”. These temporary files are removed when the
pipeline exits unless the $LXDEBUG environment variable is set, in
which case they are kept for examination.
The final output file - written to $outdir.out.xml
if -o
outdir
is specified and to stdout otherwise - is described at
output file in the Practical Examples chapter, and
there is an example file here
(html documentation only). The
“tmp-temprel” file differs only in respect of the location
entities. In the unprocessed temprel file these look like this:
<ent type="location" id="rb6">
<parts>
<part sw="w148" ew="w148">Toronto</part>
</parts>
</ent>
The georesolution step adds extra attributes to this element, from the Geonames gazetteer in this example:
<ent id="rb6" type="location" lat="43.7001138" long="-79.4163042"
in-country="CA" gazref="geonames:6167865" feat-type="ppl"
pop-size="4612191">
<parts>
<part ew="w148" sw="w148">Toronto</part>
</parts>
</ent>
This is the top-ranked candidate,
http://www.geonames.org/6167865/toronto.html. The other candidates
are listed in $outdir/gaz.xml
- see example file here
(html documentation only). In this example
there were 20 candidates for Toronto, which is the maximum number the geoparser
considers. The first five are shown below:
<placenames>
<placename id="rb6" name="Toronto">
<place rank="1" score="1.762934636" scaled_type="0.8" scaled_pop=
"0.9327814568" scaled_contained_by="0" scaled_contains="0" scaled_near="0"
in-cc="CA" long="-79.4163" lat="43.70011" type="ppla" gazref=
"geonames:6167865" name="Toronto" pop="4612191" clusteriness="870.3494166"
scaled_clusteriness="0.03015317872" clusteriness_rank="9" locality="0"
distance-to-known="99999" scaled_known="0"/>
<place rank="2" score="1.363160631" scaled_type="0.4" scaled_pop=
"0.9327814568" scaled_contained_by="0" scaled_contains="0" scaled_near="0"
in-cc="CA" long="-79.66632" lat="43.60012" type="rgn" gazref=
"geonames:6167864" name="Toronto" pop="4612191" clusteriness="869.4440736"
scaled_clusteriness="0.03037917422" clusteriness_rank="8" locality="0"
distance-to-known="99999" scaled_known="0"/>
<place rank="3" score="1.162435057" scaled_type="0.2" scaled_pop=
"0.9327814568" scaled_contained_by="0" scaled_contains="0" scaled_near="0"
in-cc="CA" long="-79.61286" lat="43.68066" type="fac" gazref=
"geonames:6296338" name="Toronto Pearson International Airport"
pop="4612191" clusteriness="872.3540873" scaled_clusteriness=
"0.02965359988" clusteriness_rank="10" locality="0" distance-to-known=
"99999" scaled_known="0"/>
<place rank="4" score="0.6922152501" scaled_type="0.6" scaled_pop="0"
scaled_contained_by="0" scaled_contains="0" scaled_near="0" in-cc="US"
long="-92.52546" lat="38.00365" type="ppl" gazref="geonames:4411872"
name="Toronto" clusteriness="653.9875787" scaled_clusteriness=
"0.09221525012" clusteriness_rank="1" locality="0" distance-to-known=
"99999" scaled_known="0"/>
<place rank="5" score="0.6883702413" scaled_type="0.6" scaled_pop="0"
scaled_contained_by="0" scaled_contains="0" scaled_near="0" in-cc="US"
long="-89.62982" lat="39.71394" type="ppl" gazref="geonames:4251360"
name="Toronto" clusteriness="665.6708161" scaled_clusteriness=
"0.08837024133" clusteriness_rank="2" locality="0" distance-to-known=
"99999" scaled_known="0"/>
...
</placename>
...
</placenames>
There is one <placename>
element for each distinct placename
found in the input document - note, not for each individual
mention. If a place is mentioned multiple times in a document the
geoparser assumes the same place is being talked about each
time. Clearly there are examples where this would be an erroneous
assumption, eg in the text snippet:
“Are we talking about London, England or London, Ontario?”
There is in fact a special rule to catch containment expressed in this co-ordinated way, but nevertheless the current version of the geoparser will only pick a single location for London (the first one, in England).
The rest of the output files produced if -o
is specified are for
visualisation in a browser.
The rest of this chapter looks at each step of the georesolution process in a little more detail: firstly the collection of candidate places from the gazetteer, then the ranking process and finally the production of display files.
Gazetteer Lookup¶
The run
script calls another, named geoground
, which carries
out two tasks by calling further scripts. The first is gazetteer
lookup, done by the geogaz
script which calls a version of
gazlookup
tailored for the gazetteer and including the gazetteer
name. So for example, if -g geonames
were specified to the run
script then gazlookup-geonames
would be used at this point,
whereas if Pleiades+ were required then gazlookup-plplus
would be
invoked.
If you look in the scripts
directory you will find a collection of
these gazlookup
scripts, most being completely separate routines,
needed because the connection methods and queries to be used differ
greatly between different gazetteers. The “Unlock” option is an
exception as it has three variants - “unlock”, “unlockgeonames” and
“naturalearth” (see -t and -g parameters) - but
these can be dealt with by parameterisation within a single script,
gazlookup-unlock
. There are soft links to this script to cover the
other two variants because, in order to make it straightforward to add
new gazetteer options, the geogaz
script looks for a script named
gazlookup-$gaz
, where “$gaz” is the -g $gaz
command line
parameter. (The OS option differs slightly from the other Unlock
gazlookups and is a separate script rather than a soft link.)
This means that to add a new gazetteer to the pipeline, all you need
do is create a script named gazlookup-newgaz
that handles the
connection and querying appropriately, and returns a set of candidates
formatted as required for the next stage; and then alter the run
script to accept “$newgaz” as a valid -g
option. Of course, if the
domain covered by the new gazetteer is completely new, then
alterations to the geotagging stage would also be needed - as for
example was the case when the Pleiades gazetteer of ancient places was
added to cater for classical texts.
The input to the gazlookup-$gaz
step is a list of the locations found in the input, extracted by an XSL stylesheet named extractlocs.xsl
. The list is formatted as shown in this example:
<?xml version="1.0" encoding="UTF-8"?>
<placenames>
<placename id="rb6" name="Toronto"/>
<placename id="rb11" name="Germany"/>
<placename id="rb14" name="Washington"/>
<placename id="rb22" name="Montreal"/>
<placename id="rb28" name="Wimbledon"/>
<placename id="rb32" name="France"/>
</placenames>
The output of the gazetteer lookup is a collection of up to 20
candidate <place>
nodes for each <placename>
. The final step
of the geogaz
script is to sort and deduplicate - as explained
above, the assumption is made that multiple references to the same
toponym string within a single document are referring to the same
place.
The output of this stage is in a temporary file suffixed
“gazunres.xml”, following the naming conventions described above. An
example is here
(html documentation
only). It contains feature information extracted from the gazetteer
for each candidate location, to be used by the ranking algorithm. The
first few lines for our example are as follows:
<placenames>
<placename name="Toronto" id="rb6">
<place name="Toronto" gazref="geonames:149454" type="ppl"
lat="-4.9000000" long="38.1000000" in-cc="TZ" pop="0"/>
<place name="Toronto" gazref="geonames:2146222" type="ppl"
lat="-33.0000000" long="151.6000000" in-cc="AU" pop="0"/>
<place name="Toronto" gazref="geonames:3535110" type="ppl"
lat="22.7833300" long="-82.5000000" in-cc="CU" pop="0"/>
<place name="Toronto" gazref="geonames:3666869" type="ppl"
lat="8.4039600" long="-75.2790700" in-cc="CO" pop="0"/>
...
This example makes clear the need for ranking over a reasonable number of candidates, at least for a gazetteer like Geonames with so many candidates for most placenames. For Toronto, the first four places returned were in Tanzania, Austria, Cuba and Columbia. We are up to numbers 13 and 14 before Canadian places appear in the list. For many places Geonames will return an extremely long list; the geoparser truncates the results at 20, which will almost always include the right one and makes the ranking process manageable in terms of processing time.
Ranking¶
The ranking of the <place>
candidates is done by the
georesolve
script. If the gazetteer supplies feature information
the ranking makes use of it, for example preferring populated places
(Geonames code “PPL”) over natural features, and preferring larger to
smaller places (based on population size).
Apart from the attributes of the candidate places, the ranking algorithm considers their locations compared pair-wise with each of the other places in the document. It will prefer places that cluster with other locations in the same document. For example, if most of the places mentioned in a text seem to be in Canada, a mention of “London” will probably be placed in Ontario rather than England.
If you know the geographical area that your input document deals with,
you can specify either a locality circle or box using the -l
or
-lb
command line options. These are explained in in the Quick
Start chapter, Limiting geographical area: -l -lb. This is another factor that will be
considered by the ranker, making it prefer locations in the area
specified but still allowing the selection of places elsewhere that
may be mentioned in the text. The “score” parameter can be used for
weighting the degree of preference; if using this option it is
probably best to experiement with different weights.
The output of the georesolve
ranking step is the
$outdir/gaz.xml
that was described above. It is
a ranked list of <place>
candidates for each <placename>
. The
candidates have the features from the gazetteer and the extra
attributes added by the ranking algorithm, such as “clusteriness”
referring to how well the places mention form a spatial group. The raw
scores are scaled and combined to produce an overall “score”
attribute, which in turn determines the “rank” for each candidate
<place>
. See the sample output here
(html documentation only).
It is worth noting here that for various reasons including the clustering factor, the geoparser works better with short texts than very long ones. It was originally designed to handle large numbers of short text documents (roughly one page at a time) processed in a loop. If an attempt is made to process an entire book in one go, the ranking algorithm may be overloaded - pairwise comparisons of locations throughout the document may break it - and in any case the assumption about locality will probably be invalid. We advise that long texts are split into small parts, preferably into coherent chunks of narrative.
Formatting Output¶
If the -o outdir
option is not specified then the output of the
pipeline is written to standard out (and can of course be redirected
to a file), and consists of a single xml <document>
as described
at output file in the Practical Examples chapter,
with an example file here
(html documentation only). The
output is a tagged version of the input file, in standoff xml format,
with the <document>
node having <text>
and <standoff>
children (plus a metadata node).
The placenames are tagged entities within the text, appearing as
<ent>
nodes in the standoff section with pointers back to their
position in the tokenised text. Only the top candidate for each place
is included in this output, as a tagged entity, such as:
<ent id="rb6" type="location" lat="43.70011" long="-79.4163"
gazref="geonames:6167865" in-country="CA" feat-type="ppla"
pop-size="4612191">
<parts>
<part ew="w150" sw="w150">Toronto</part>
</parts>
</ent>
The ranking detail is removed and only the most important gazetteer features are retained: the latitude and longitude co-ordinates, and (for Geonames which supplies them) the country and feature type codes and population.
If the -o outdir
option is specified then the georesolution
component has several extra steps, which are simply reformatting of
all the output generated so far, using XSL stylesheets to produce a
collection of files for visualising the output. These steps are
illustrated on the Georesolution flowchart.
The “plainvis.xsl” stylesheet is used to format the input text as an
html page with the toponyms highlighted. The gazmap
script
pulls this html page together with the xml list of candidate placename
locations (in the $outdir/gaz.xml
file described earlier) and adds a map display created by plotting the
locations using Mapbox / OpenStreetMap. The three components are combined in a
single file named $outdir.display.html
. Various examples are shown
in the Practical Examples chapter, including Geoparser display file for news text input,
which has the maps panel at the top (green markers for top candidates,
red for others), the tagged text on the left and the
$outdir/gaz.xml
list on the right.
If the -top
option is specified then the display file only shows the
top candidate locations (green markers). Herodotus display file shows an example.
Finally, the timeline
script takes the tagged file and produces a
display highlighting all the entities found: names, organisations and
time expressions as well as locations. It also extracts the events
detected and, where these can be given a specific date, uses
javascript to create a timeline visualisation using a Simile widget. Timeline file
shows an example of the $outdir.timeline.html
file. The events
found are listed in $outdir.events.xml
, which is in the format
required by the Timeline widget, as illustrated below:
<?xml version="1.0" encoding="UTF-8"?>
<data date-time-format="iso8601">
<event start="2010-08-15T00:00:00Z" title="will face each other for a place in Sunday">
Nadal and Murray set up semi showdown (CNN) -- Rafael Nadal and Andy
Murray are both through to the semifinals of the Rogers Cup in Toronto,
where they will face each other for a place in Sunday's final.
</event>
...
</data>
The complete file for this example is here
(html documentation only).
In summary, with the -o out
option, the following files are created:
File |
Description |
---|---|
$out.out.xml |
Main output: tagged and geogrounded text |
$out.gaz.xml |
Locations list |
$out.gazlist.html |
Locations list in html format |
$out.gazmap.html |
Locations plotted using Mapbox / OpentStreetMap |
$out.geotagged.html |
Geotagged text as html file |
$out.display.html |
3-panel display: map + text + locations list |
$out.nertagged.xml |
Output from NER stage |
$out.events.xml |
Events extracted in Timeline format |
$out.timeline.html |
Display page with all NEs and timeline |