.. _gaz:
##########
Gazetteers
##########
Online Resources
================
The geoparser allows the user to choose from several different online
gazetteers as the source authority against which to ground
placenames. All except `Geonames `_ are
hosted by `Edina `_ through the Unlock services;
see :ref:`unlock` chapter. In fact *Unlock Places* also maintains a
mirror of Geonames, which is accessible through either the unlock or
unlockgeonames options to the ``-g`` parameter, but the geonames option is
configured to go directly to the `http://www.geonames.org
`_ site.
When the pipeline is executed using the ``run`` command (see
:ref:`qs-run`) the gazetteer to be used must be specified using the
``-g`` parameter. The complete set of six online gazetteer options is
as follows:
* Geonames, ``-g geonames`` - a world-wide gazetteer of over eight
million placenames, made available free of charge.
* OS, ``-g os`` - a detailed gazetteer of UK places, derived from the
Ordnance Survey 1:50,000 scale gazetteer, under the `OS Open Data
`_
initiative. The geoparser code adds GeoNames entries for large
populated places around the globe when using this option to allow
resolution of place names outside the UK.
* Natural Earth, ``-g naturalearth`` - a public domain vector and
raster map collection of small scale (1:10m, 1:50m, 1:110m) mapping,
built by the `Natural Earth `_
project.
* Geonames through Unlock, ``-g unlockgeonames`` - access to GeoNames
via Unlock.
* Unlock, ``-g unlock`` - a comprehensive gazetteer mainly for the UK,
using all of OS, Natural Earth and GeoNames resources. This is the
default option on the Unlock Places service and combines all their
gazetteers except DEEP.
* DEEP, ``-g deep`` - a gazetteer of historical placenames in England,
built by the DEEP project (Digital Exposure of English
Placenames). See :ref:`footnote [1] ` in the Quick Start
Guide and :ref:`ex-deep` in Practical Examples.
* Pleiades+, ``-g plplus`` - a gazetteer of the ancient Greek and
Roman world, based on the `Pleiades `_
dataset and augmented with links to Geonames.
It may be necessary to experiment with different gazetteer options to
see what works best with your text.
.. _gaz-plplus:
**Pleiades+**
The `Pleiades `_ gazetteer of the classical
Greek and Roman world was added to the geoparser's resources as part
of the `GAP `_ project in
2012-13. The version used was a snapshot of the Pleiades source
dataset augmented with links to Geonames - this was dubbed
Pleiades+. This static copy of the data is mirrored on Edina and
available with the ``-g plplus`` option. A locally hosted copy of
it at Edinburgh's School of Informatics was used by GAP.
Too late for the GAP project, the Pleiades dataset has been
considerably augmented and daily snapshots are now available - see the
`Pleiades data download page
`_. The Pleiades+ project - to
align ancient places with their modern equivalents in Geonames where
possible - has also been extended, and also provides daily downloads
from the `Pleiades Plus Github site
`_. The organising teams
behind both of these developments have kindly agreed that other sites
can mirror their datasets, and Edina and we (the Language Technology
Group) are hoping to do that. If you are interested in using this data
and would like to help us update the geoparser service for it, please
get in touch.
We have experimented with setting up a local copy of the latest
version of Pleiades and Pleiades+ privately on the LTG servers, and
the scripts to allow the ``-g plplus-local`` option, which accesses
this local copy, are included in the distribution as explained below.
.. _gaz-local:
Options for Local Gazetteers
============================
The standard way to use the geoparser is by referring to an online
gazetteer service, as described above. There may be circumstances in
which a locally hosted gazetteer is preferable - for example if the
online service is slow, for the multiple hits required by the
pipeline. The Edinburgh Language Technology Group (LTG) have set up local
gazetteers in this way and this section explains how to do it. If you
decide to do it these models may be helpful to follow.
The advantages of hosting your gazetteer yourself are that access will
typically be much faster so overall processing times are reduced, and
you have complete control over the gazetteer so can correct errors or
add new items. It may be necessary to have a local copy if your usage
rates are so high that you exceed the limits placed by online
services. The obvious disadavantage is that you create a maintenance
burden for yourself, as you need to create and manage a database and
write the software routines to interact with it.
The two local gazetteers we use are a local copy of Geonames and of
Pleiades+. The setup of these is described below, as examples of how
to go about the process. The code for these local gazetteers is
included in the geoparser download bundle but it is not possible to
access the local MySQL databases on our servers remotely, as they are
not configured as public services.
Example Setup: Geonames
-----------------------
The Geonames service includes a download option with daily updates
provided on their `download server
`_. The Geonames database
is large - around 8 million main entries plus alternative name
records - and the online service provides update files so that
insertions and deletions can be applied to a local copy, without
having to recreate and re-index the tables every day.
In the LTG we created a MySQL database to hold the Geonames dataset It
has a simple structure comprising a main table named "geoname" with
one row per place, and a linked subsidiary table named "alternatename"
that holds one row for each alternative name for a given place in the
main table. There is also a smaller table named "hierarchy" that
allows a hierarchical tree of places located within larger places to
be constructed.
The database can be created by downloading the relevant files from the
Geonames download server: "allCountries.zip", "alternateNames.zip" and
"hierarchy.zip". Once unzipped, these can be imported into a MySQL
database - set the character encoding to UTF-8 when you create the
database::
create database geonames character set utf8 collate utf8_general_ci;
You will need to set up suitable access permissions and will probably
also want to create indexes to speed query performance.
We keep our copy of the database up to date by running nightly cron
jobs to download and apply changes. To make this easy, an extra set of
tables is used: "updates_geoname", "updates_alternatename",
"deletes_geoname", "deletes_alternatename". The steps are:
#. Download the update files from the Geonames download server. These
are named either "modifications" or "deletes" for the main table or
the alternatename table, with a datestamp appended. Also download
the hierarchy file.
#. Load the modification and deletion data into the four holding
tables (clearing these of previous data first).
#. For the deletions, simply remove rows from "geoname" and
"alternatename" that have a match in the holding tables for
deletions.
#. For the modifications, remove matching rows from "geoname" and
"alternatename" and then insert the rows from the holding tables.
#. Drop the hierarchy table then recreate and re-index it from the
downloaded data.
#. Log the transactions carried out, for reporting.
If you want to create a local copy of geonames for yourself there is a
zip file of the database creation routine, daily update scripts and
cron file :download:`here ` (html
documentation only). The directory names would need to be tailored to
your local setup. You may need to create a Geonames account name - see
the Geonames website for details, as the policy seems to vary.
If a local copy of geonames is set up in this way then the ``-g
geonames-local`` option can be used to access it with the geoparser;
otherwise this option does not work. The ``gazlookup-geonames-local``
script must be edited to provide connection informaition for your
local MySQL database. If the username is "pipeline", with no password,
then only the server location needs altering, as this is the default
username in the script. The pipeline user should be set up as a
read-only account, as the pipeline never alters data in the
gazetteer. If the MySQL server is on the same machine as the pipeline
is running, then the ``-h host`` parameter is not required. In this,
the simplest case, the database connection string in the
``gazlookup-geonames-local`` scripts is::
lxmysql -u pipeline -d geonames
Example Setup: Pleiades+
------------------------
In the same way that the "geonames-local" option only works if a local
database is being maintained, ``-g plplus-local`` will work if and
only if a local copy of Pleiades+ is created. This may be desirable
because, at the time of writing, the Edina version is an out of date
snapshot, and newer material is available as described :ref:`above
`.
If a local version of Pleiades+ is created, then the relevant scripts
included with the geoparser download bundle will be able to use it. In
fact we have two versions: the one used for GAP and the newer version
released in 2014. If you want to experiment with these, have a look at
the ``gazlookup-plplus-OLDlocal`` (the GAP version) and
``gazlookup-plplus-NEWlocal`` scripts (the 2014 release). The ``-g
plplus-local`` option is set to use the new version.
A bundle of scripts that may be helpful if you wish to set up your own
local copy of the latest version of Pleiades and Pleiades+ is provided
:download:`here ` (html documentation
only). It includes routines for downloading the daily files and
loading them into a database. These could easily be set up as cron
jobs to refresh the database daily.
Note that the gazetteer lookup scripts to access a local ``pleiades``
database are at an experimental stage at present. The new database is
much more complicated than that used by GAP and the queries take a bit
longer, despite indexing. Also, there are a number of attributes
provided by the full Pleiades dataset that could be used to refine the
georesolution stage, but these alterations have not yet been
attempted. As mentioned above, the LTG would welcome partners who
would like to work with us on this.