.. _quickstart:
*****************
Quick Start Guide
*****************
Installation
============
To install the Edinburgh Geoparser, download the software bundle from
the `LTG's geoparser software page
`_ and unpack it in a
suitable location (in your home directory, say). The directory
structure produced will be as shown in the :ref:`ov-dirsFig` Figure.
The visualisation components use Google Maps and the ``gazmap`` and
``gazmap-top`` scripts contain API keys obtained for the ed.ac.uk
domain, held in the ``defkey`` variable. That kind of API key is no
longer available from Google so, rather than suggest you insert your
own key, we have left ours in place. If you do have a suitable API key
(obtained before 2013) please insert it in these scripts in place of
ours.
The geoparser runs on linux and Macintosh platforms, both 32 and 64
bit. The underlying LT-XML2 components are available in source code
for local compilation, from the `LTG software page
`_ , but some required components
are binary only.
The geoparser can reference a range of different gazetteers, hosted on
the web, on Edina's `Unlock service `_ or
locally. For the web-based and Unlock gazetteers (see :ref:`-g
gazetteer parameter `) no additional software is
needed.
It is possible that you will want to set up a local copy of a
gazetteer and in this case you will obviously need to install and
manage it. The pros and cons of using a local gazetteer are discussed
in the :ref:`gaz-local` section and two examples - for which the
geoparser is already configured - are described: `Geonames
`_ and `Pleiades
`_. Both of these examples use a locally
managed MySQL database. If you plan to use the ``geonames-local`` or
``plplus-local`` options you will need to set up the gazetteers as
described and edit the ``gazlookup-geonames-local`` and
``gazlookup-plplus-local`` scripts to contain the correct connection
string for your MySQL database. This is also explained in the
:ref:`gaz-local` section.
.. _qs-run:
Running the Pipeline
====================
To test the pipeline, do this::
cd scripts
cat ../in/172172.txt | ./run -t plain -g unlock
This uses the option of plain text input and uses unlock as the
gazetteer. The output xml file is sent to stdout.
Note that the order of the ``-t`` and ``-g`` options is
immaterial. This applies to all the command line options.
.. _qs-out:
Visualisation output: -o
------------------------
To run and create visualisation files::
cat ../in/172172.txt | ./run -t plain -g unlock -o ../out 172172
Same as before except that ``-o`` takes two args, an output directory
``../out`` and a prefix for the output file names ``172172``. **The
output directory must already exist.** The results appear in the output
directory (../out)::
../out/172172.display.html ../out/172172.geotagged.html
../out/172172.events.xml ../out/172172.out.xml
../out/172172.gaz.xml ../out/172172.nertagged.xml
../out/172172.gazlist.html ../out/172172.timeline.html
../out/172172.gazmap.html
* 172172.display.html is the geoparser map display.
* 172172.timeline.html is the timeline display [1]_ (note that person,
location, organisation and date entities are highlighted in this
display).
* 172172.out.xml is the output that goes to stdout when it is run without
``-o``.
The other files are ones used or the map and timeline display or ones
which may be useful in their own right.
Single placename markers: -top
------------------------------
By default, all candidate placenames are shown in the display, with
the top-ranked one in green and the rest in red. If the ``-top`` option is
added to the command line then three extra display files will be
created, which show only the top-ranked candidate for each place, not
all the alternatives considered. For the example used above the extra
files would be::
../out/172172.display-top.html ../out/172172.gazmap-top.html
../out/172172.gazlist-top.html
* 172172.gazlist-top.html is the geoparser map display with only one
placename marker per toponym.
.. _qs-tandgparms:
Input type and gazetteer: -t -g
-------------------------------
The options for ``-t type`` and ``-g gazetteer`` are::
-t plain (plain text)
ltgxml (xml file in a certain format with paragraphs marked up)
gb (Google Books html files)
-g unlock (Edina's gazetteer of mainly UK placenames)
os (Just the OS part of Unlock)
naturalearth (Just the Natural Earth part of Unlock)
geonames (online world-wide gazetteer)
plplus (Pleiades+ gazetteer of ancient places, on Edina)
deep (DEEP gazetteer of historical placenames in England)
[ geonames-local (locally maintained copy on ed.ac.uk network) ]
[ plplus-local (locally maintained Pleiades+, with geonames lookup) ]
The last two gazetteer options will only be usable if local gazetteers
are maintained; they are included in case useful. See :ref:`gaz-local`
for how to make use of them.
If your input is xml with paragraphs already marked, it may be worth converting
it to ltgxml format. See the example ``in/172172.xml`` for the format.
For Google Books input, which can be extremely untidy, pre-processing is done
to ensure it doesn't break the xml processes in the pipeline.
Docdate: -d
-----------
If you know the creation/writing date of the document you can supply
this with ``-d docdate``::
cat ../in/172172.txt | ./run -t plain -g unlock -d 2010-08-13
cat ../in/172172.txt | ./run -t plain -g unlock -o ../out 172172 -d 2010-08-13
This will be used in event and relation detection and timeline display.
.. _qs-locality:
Limiting geographical area: -l -lb
----------------------------------
If you know that toponyms in your text are likely to be in a particular
geographical area you can specify a bounding circle ``-l locality`` or a
rectangular ``-lb locality box``. The geoparser will prefer places in the
area specified but will still choose locations outside it if other factors
give them higher weighting.
To specify a circular locality::
-l lat long radius score
where
* lat and long are in decimal degrees (*ie* 57.5 for 57 degrees 30 mins)
* radius is in km
* score is a numeric weight assigned to locations within the area (else 0).
To specify a locality box::
-lb W N E S score
where
* W(est) N(orth) E(ast) S(outh) are decimal degrees
* score is as for -l option.
DEEP only options: -c -r
------------------------
For DEEP [2]_ a new ``-c county`` option has been added. This allows the
user to specify the county that the document is about in order to only
consider DEEP gaz entries for that county. Multiple uses of ``-c`` allow
several counties to be specified. For example::
cat | ./run -t plain -g deep -c Oxfordshire -c Wiltshire
The values for ``-c`` are the county names in the DEEP gazetteer:
Bedfordshire, Berkshire, Buckinghamshire, Cambridgeshire, Cheshire,
Cumberland, Derbyshire, Devon, Dorset, Durham, East Riding of
Yorkshire, Essex, Gloucestershire, Hertfordshire, Huntingdonshire,
Leicestershire, Lincolnshire, Middlesex, Norfolk, North Riding of
Yorkshire, Northamptonshire, Nottinghamshire, Oxfordshire, Rutland,
Shropshire, Staffordshire, Surrey, Sussex, The Isle of Ely,
Warwickshire, West Riding of Yorkshire, Westmorland, Wiltshire,
Worcestershire.
Note that county names with white space need to be enclosed in double quotes::
cat | ./run -t plain -g deep -c Oxfordshire -c Wiltshire -c
"North Riding of Yorkshire" -c "East Riding of Yorkshire" -c "West
Riding of Yorkshire"
A new ``-r begindate enddate`` option is also available for DEEP to
restrict the choice of DEEP gazetteer records which have attestation
dates within the date range::
cat ../in/essexff.txt | ./run -t plain -g deep -c Essex -r 1000 1400
.. _qs-ftnote:
.. rubric:: Footnotes
.. [1]
The timeline display has been tested in various browsers
and works without problems in Firefox and Safari on linux and Mac
platforms. With Chrome, the "allow-file-access-from-files" option
is required (on the command line when Chrome is started).
.. [2]
DEEP, `Digital Exposure of English Placenames
`_, was a JISC-funded
project to digitise and make available the 86 volumes of the Survey
of English Place-Names. See `placenames.org.uk
`_ for the source material it worked
with, which covers the evolution of placenames in England. The
86-volume county by county survey details over four million variant
forms, from classical sources, through the Anglo-Saxon period and
into medieval England and beyond to the modern period.