Quick Start Guide

Installation

To install the Edinburgh Geoparser, download the software bundle from the LTG’s geoparser software page and unpack it in a suitable location (in your home directory, say). The directory structure produced will be as shown in the File layout Figure.

The geoparser runs on 64 bit Linux and Macintosh platforms. The underlying LT-XML2 components are available in source code for local compilation, from the LTG software page , but some required components are binary only.

MacOS only

Recent version of MacOS (since Catalina) will not, by default, run programs downloaded from the web. Before running the geoparser for the first time, run this command in your terminal in the top-level geoparser directory:

xattr -d com.apple.quarantine bin/*/*

This will remove the “quarantine” flag from the binaries.

Mapping

The visualisation component uses Leaflet mapping software in conjunction with either Mapbox or OpenStreetMap map tiles.

To use it with Mapbox you will need a Mapbox key (access token) which can be obtained from www.mapbox.com. When you create a Mapbox account you are automatically asigned a public access token. You can use that or create a new one. Before running the geoparser you should set the environment variable GEOPARSER_MAP_KEY to your access token.

Mapbox now requires you to provide a credit card number when you create an account, and you may not want to do this. If GEOPARSER_MAP_KEY is not set, OpenStreetMap tiles will be used instead. The main disadvantage of this - from the point of view of an English-language geoparser - is that OpenStreetMap generally displays maps in the language of the area, rather than English.

Local Gazetteer Database

The geoparser can reference a range of different gazetteers, hosted by Information Services at the University of Edinburgh or locally. For the former (see -g gazetteer parameter) no additional software is needed.

It is possible that you will want to set up a local copy of a gazetteer and in this case you will obviously need to install and manage it. The pros and cons of using a local gazetteer are discussed in the Options for Local Gazetteer section and an example using Geonames - for which the geoparser is already configured - is described. This example uses a locally managed MySQL database. If you plan to use the geonames-local option you will need to set the GEOPARSER_DB_COMMAND environment variable to specify how to connect to the server. This is also explained in the Options for Local Gazetteer section.

Running the Pipeline

To test the pipeline, do this:

cd scripts
cat ../in/172172.txt | ./run -t plain -g unlock

This uses the option of plain text input and uses unlock as the gazetteer. The output xml file is sent to stdout.

Note that the order of the -t and -g options is immaterial. This applies to all the command line options.

Visualisation output: -o

To run and create visualisation files:

cat ../in/172172.txt | ./run -t plain -g unlock -o ../out 172172

Same as before except that -o takes two args, an output directory ../out and a prefix for the output file names 172172. The output directory must already exist. The results appear in the output directory (../out):

../out/172172.display.html  ../out/172172.geotagged.html
../out/172172.events.xml    ../out/172172.out.xml
../out/172172.gaz.xml       ../out/172172.nertagged.xml
../out/172172.gazlist.html
../out/172172.gazmap.html
  • 172172.display.html is the geoparser map display.

  • 172172.timeline.html is the timeline display 1 (note that person, location, organisation and date entities are highlighted in this display).

  • 172172.out.xml is the output that goes to stdout when it is run without -o.

The other files are ones used for the map display or ones which may be useful in their own right.

Single placename markers: -top

By default, all candidate placenames are shown in the display, with the top-ranked one in green and the rest in red. If the -top option is added to the command line then the display file will show only the top-ranked candidate for each place, not all the alternatives considered.

Input type and gazetteer: -t -g

The options for -t type and -g gazetteer are:

-t   plain          (plain text)
     ltgxml         (xml file in a certain format with paragraphs marked up)
     gb             (Google Books html files)

-g   unlock         (Edina's Unlock gazetteer)
     os             (Just the OS part of Unlock)
     naturalearth   (Just the Natural Earth part of Unlock)
     unlockgeonames (Just the GeoNames part of Unlock)
     geonames       (online world-wide gazetteer)
     plplus         (Pleiades+ gazetteer of ancient places)
     deep           (DEEP gazetteer of historical placenames in England)

   [ geonames-local (locally maintained copy on ed.ac.uk network) ]
   [ plplus-local   (locally maintained Pleiades+, with geonames lookup) ]

The last two gazetteer options will only be usable if local gazetteers are maintained; they are included in case useful. See Options for Local Gazetteer for how to make use of them.

If your input is xml with paragraphs already marked, it may be worth converting it to ltgxml format. See the example in/172172.xml for the format.

For Google Books input, which can be extremely untidy, pre-processing is done to ensure it doesn’t break the xml processes in the pipeline.

Docdate: -d

If you know the creation/writing date of the document you can supply this with -d docdate:

cat ../in/172172.txt | ./run -t plain -g unlock -d 2010-08-13
cat ../in/172172.txt | ./run -t plain -g unlock -o ../out 172172 -d 2010-08-13

This will be used in event and relation detection and timeline display.

Limiting geographical area: -l -lb

If you know that toponyms in your text are likely to be in a particular geographical area you can specify a bounding circle -l locality or a rectangular -lb locality box. The geoparser will prefer places in the area specified but will still choose locations outside it if other factors give them higher weighting.

To specify a circular locality:

-l lat long radius score

where

  • lat and long are in decimal degrees (ie 57.5 for 57 degrees 30 mins)

  • radius is in km

  • score is a numeric weight assigned to locations within the area (else 0).

To specify a locality box:

-lb W N E S score

where

  • W(est) N(orth) E(ast) S(outh) are decimal degrees

  • score is as for -l option.

DEEP only options: -c -r

For DEEP 2 a new -c county option has been added. This allows the user to specify the county that the document is about in order to only consider DEEP gaz entries for that county. Multiple uses of -c allow several counties to be specified. For example:

cat <infile> | ./run -t plain -g deep -c Oxfordshire -c Wiltshire

The values for -c are the county names in the DEEP gazetteer:

Bedfordshire, Berkshire, Buckinghamshire, Cambridgeshire, Cheshire, Cumberland, Derbyshire, Devon, Dorset, Durham, East Riding of Yorkshire, Essex, Gloucestershire, Hertfordshire, Huntingdonshire, Leicestershire, Lincolnshire, Middlesex, Norfolk, North Riding of Yorkshire, Northamptonshire, Nottinghamshire, Oxfordshire, Rutland, Shropshire, Staffordshire, Surrey, Sussex, The Isle of Ely, Warwickshire, West Riding of Yorkshire, Westmorland, Wiltshire, Worcestershire.

Note that county names with white space need to be enclosed in double quotes:

cat <infile> | ./run -t plain -g deep -c Oxfordshire -c Wiltshire -c
"North Riding of Yorkshire" -c "East Riding of Yorkshire" -c "West
Riding of Yorkshire"

A new -r begindate enddate option is also available for DEEP to restrict the choice of DEEP gazetteer records which have attestation dates within the date range:

cat ../in/essexff.txt | ./run -t plain -g deep -c Essex -r 1000 1400

Footnotes

1

The timeline display has been tested in Firefox, Safari and Chrome and needs to be served on a web server to work properly. See more details in Practical Examples at the bottom of Modern text on how to do that.

2

DEEP, Digital Exposure of English Placenames http://deep.kdl.kcl.ac.uk/, was a JISC-funded project to digitise and make available the 86 volumes of the Survey of English Place-Names. See http://epns.nottingham.ac.uk to search or browse the source material it worked with, which covers the evolution of placenames in England. The 86-volume county by county survey details over four million variant forms, from classical sources, through the Anglo-Saxon period and into medieval England and beyond to the modern period. For more details about the DEEP data see the paper “A Gazetteer and Georeferencing for Historical English Documents” in Appendix 2: LTG Publications about the Geoparser chapter. See also Practical Examples > Historical documents (relating to England).