Gazetteers

Online Resources

The geoparser allows the user to choose from several different online gazetteers as the source authority against which to ground placenames. All gazetteers, except Geonames, are hosted by Information Services at the University of Edinburgh. In fact this also includes a mirror of Geonames, which is accessible through either the unlock or unlockgeonames options to the -g parameter, but the geonames option is configured to go directly to the http://www.geonames.org site.

When the pipeline is executed using the run command (see Running the Pipeline) the gazetteer to be used must be specified using the -g parameter. The complete set of six online gazetteer options is as follows:

  • Geonames, -g geonames - a world-wide gazetteer of over eight million placenames, made available free of charge.

  • OS, -g os - a detailed gazetteer of UK places, derived from the Ordnance Survey 1:50,000 scale gazetteer, under the OS Open Data initiative. The geoparser code adds GeoNames entries for large populated places around the globe when using this option to allow resolution of place names outside the UK.

  • Natural Earth, -g naturalearth - a public domain vector and raster map collection of small scale (1:10m, 1:50m, 1:110m) mapping, built by the Natural Earth project.

  • Geonames through Unlock, -g unlockgeonames - access to GeoNames via Unlock.

  • Unlock, -g unlock - a comprehensive gazetteer mainly for the UK, using all of OS, Natural Earth and GeoNames resources. This is the default option on the Unlock Places service and combines all their gazetteers except DEEP.

  • DEEP, -g deep - a gazetteer of historical placenames in England, built by the DEEP project (Digital Exposure of English Placenames). See footnote [1] in the Quick Start Guide and Historical documents (relating to England) in Practical Examples.

  • Pleiades+, -g plplus - a gazetteer of the ancient Greek and Roman world, based on the Pleiades dataset and augmented with links to Geonames.

It may be necessary to experiment with different gazetteer options to see what works best with your text.

Pleiades+

The Pleiades gazetteer of the classical Greek and Roman world was added to the geoparser’s resources as part of the GAP project in 2012-13. The version used was a snapshot of the Pleiades source dataset augmented with links to Geonames - this was dubbed Pleiades+. This static copy of the data is mirrored on the IS server and available with the -g plplus option. A locally hosted copy of it at Edinburgh’s School of Informatics was used by GAP.

Too late for the GAP project, the Pleiades dataset has been considerably augmented and daily snapshots are now available - see the Pleiades data download page. The Pleiades+ project - to align ancient places with their modern equivalents in Geonames where possible - has also been extended, and also provides daily downloads from the Pleiades Plus Github site. The organising teams behind both of these developments have kindly agreed that other sites can mirror their datasets, and IS and we (the Language Technology Group) are hoping to do that. If you are interested in using this data and would like to help us update the geoparser service for it, please get in touch.

Options for Local Gazetteer

The standard way to use the geoparser is by referring to an online gazetteer service, as described above. There may be circumstances in which a locally hosted gazetteer is preferable - for example if the online service is slow, for the multiple hits required by the pipeline. The Edinburgh Language Technology Group (LTG) have set up local gazetteers in this way and this section explains how to do it. If you decide to do it these models may be helpful to follow.

The advantages of hosting your gazetteer yourself are that access will typically be much faster so overall processing times are reduced, and you have complete control over the gazetteer so can correct errors or add new items. It may be necessary to have a local copy if your usage rates are so high that you exceed the limits placed by online services. The obvious disadvantage is that you create a maintenance burden for yourself, as you need to create and manage a database and write the software routines to interact with it.

A local gazetteer we use is a local copy of Geonames and of. Its setup is described below, as examples of how to go about the process. The code for running a local gazetteer is included in the geoparser download bundle but it is not possible to access the local MySQL databases on our servers remotely, as they are not configured as public services.

Example Setup: Geonames

The Geonames service includes a download option with daily updates provided on their download server. The Geonames database is large - around 8 million main entries plus alternative name records - and the online service provides update files so that insertions and deletions can be applied to a local copy, without having to recreate and re-index the tables every day.

In the LTG we created a MySQL database to hold the Geonames dataset It has a simple structure comprising a main table named “geoname” with one row per place, and a linked subsidiary table named “alternatename” that holds one row for each alternative name for a given place in the main table. There is also a smaller table named “hierarchy” that allows a hierarchical tree of places located within larger places to be constructed.

The database can be created by downloading the relevant files from the Geonames download server: “allCountries.zip”, “alternateNames.zip” and “hierarchy.zip”. Once unzipped, these can be imported into a MySQL database - set the character encoding to UTF-8 when you create the database:

create database geonames character set utf8 collate utf8_general_ci;

You will need to set up suitable access permissions and will probably also want to create indexes to speed query performance.

We keep our copy of the database up to date by running nightly cron jobs to download and apply changes. To make this easy, an extra set of tables is used: “updates_geoname”, “updates_alternatename”, “deletes_geoname”, “deletes_alternatename”. The steps are:

  1. Download the update files from the Geonames download server. These are named either “modifications” or “deletes” for the main table or the alternatename table, with a datestamp appended. Also download the hierarchy file.

  2. Load the modification and deletion data into the four holding tables (clearing these of previous data first).

  3. For the deletions, simply remove rows from “geoname” and “alternatename” that have a match in the holding tables for deletions.

  4. For the modifications, remove matching rows from “geoname” and “alternatename” and then insert the rows from the holding tables.

  5. Drop the hierarchy table then recreate and re-index it from the downloaded data.

  6. Log the transactions carried out, for reporting.

If you want to create a local copy of geonames for yourself there is a zip file of the database creation routine, daily update scripts and cron file here (html documentation only). The directory names would need to be tailored to your local setup. You may need to create a Geonames account name - see the Geonames website for details, as the policy seems to vary.

If a local copy of geonames is set up in this way then the -g geonames-local option can be used to access it with the geoparser; otherwise this option does not work. The command for connecting to the local database is specified by an environment variable GEOPARSER_DB_COMMAND. This variable should be set before running the pipeline. For a MySQL server running on a machine “dbserver”, with a database username “pipeline” and password “passwd”, a suitable command would be:

lxmysql -h dbserver -u pipeline -p passwd -d geonames

The -h option can be omitted if the database is running on the same machine as the pipeline, and the -p option if there is no password.

It is also possible to use a PostgreSQL database. Use the lxpostgresql command for this. It takes the same arguments as lxmysql.

The lxmysql and lxpostgresql binaries provided may not run on your machine. If necessary you can build your own binaries; they are part of our LTXML2 toolkit which can be downloaded (along with the required XML parser, RXP) from: