py4sci

Introduction

The Edinburgh Geoparser is a language processing tool designed to detect placename references in English text and ground them against an authoritative gazetteer so that they can be plotted on a map. The two main processes involved are entity recognition, to find the placename mentions and categorise them as such, followed by a ranking process that selects the likeliest location for each place from what may be a long list of candidates.

The Quick Start Guide explains how to install the software and start using it, and there are some worked examples of how to use it, with illustrations of the output produced, in the Practical Examples chapter.

The geoparser was developed by Claire Grover and Richard Tobin, of the Language Technology Group (LTG) in the School of Informatics at Edinburgh University. Over a number of years they and other colleagues from the LTG have refined and added to the geoparser’s functionality. Appendix 2: LTG Publications about the Geoparser contains a list of some published papers evaluating the geoparser’s performance relative to other similar systems, and discussing how it has been used by the LTG and our partners in various projects.

Like many linguistic tools of this kind, the geoparser software is designed to work in a “pipeline”, where the output of one process forms the input for the next. This construction gives flexibility and makes it relatively easy to switch components in and out - so if you prefer your own tokeniser to ours, say, it is easy to make the substitution. The Pipeline chapter explains the two steps, geotagging to find the placenames, and georesolution to ground them in space. See the Geotagging section for details on changing the linguistic components. The Overview of Software Structure chapter contains flowcharts and diagrams of how the whole pipeline fits together.

The geoparser is configured to work with a number of different gazetteers, as explained in the Gazetteers chapter. Although primarily designed to detect and geo-locate spatial references, the pipeline has evolved to find and categorise other entity categories, viz person, organisation and time expressions, as well as location. A range of visualisation files can be produced, including a display that shows all entity categories plus temporal events detected.

The geoparser works best with fairly short texts (up to a few pages), for reasons that are explained in the Georesolution section. Therefore if you have a very large corpus to process, it’s advisable to divide it into smaller chunks.

This documentation covers the downloadable version of the Edinburgh Geoparser, to be installed on your own local machine. There is also an online version embedded in the Edina [http://edina.ac.uk/] Unlock Text [http://edina.ac.uk/unlock/texts/] service, which is described in the Unlock chapter.

We expect the geoparser to continue to evolve, and already have plans for enhancements. We welcome suggestions and collaboration, so please get in touch if you have ideas about how we should develop the software.