The number of databases that contain biomedical data is increasing rapidly. Many of these databases are stand-alone and this makes it difficult for researchers to perform queries and analyses over data that spans multiple databases.
To make these queries and analyses possible, three essential principals need to be followed: 1. A uniform way must exist in which data can be accessed, regardless of the form it is stored in, 2. a mechanism must exist by which queries can be formulated in a flexible way to allow researchers to explore new combinations of results from several sources, 3. a reference system must be in place that allows researchers and tools to identify data correctly in order to maintain consistent relationships between them.
The MRC Human Genetics Unit based in Edinburgh has a database that collects information about vertebrate proteins that are localised in the cell nucleus: the Nuclear Protein Database (https://npd.hgu.mrc.ac.uk/). The data is carefully curated by the group leader and contains many links to other biomedical resources such as Entrez, OMIM, and PubMed. This means the database adheres to the third principle. However, it does not adhere to the first principle, as it can be accessed only through a web page via a simple text query.
In this project, you will Grid-enable this resource by making it available through a web service. A group at the University of Amsterdam wants to make use of this service through Taverna (http://taverna.sourceforge.net/), a workflow tool bench for Bioinformatics. They can supply you with queries that test whether the database provides useful ways of accessing the data it contains. One candidate technology for Grid-enabling is OGSA-DAI (http://www.ogsadai.org.uk/), which already integrates with Taverna. A secondary task is to have a closer look at the database, and to make it more manageable by the researchers themselves, preferably via web-based systems to edit its content.