Edinburgh Data-Intensive Research - MSc student project

Privacy Protection for a Brain Imaging Databank

David.Rodriguez — Fri, 20 Jan 2012 11:02:26 +0000

Student:

Jyothsna Vivekanand Shenoy

In recent years there has been an increasing trend towards releasing micro-data to the public. This can be very important for research, but in some cases (e.g. medical data) these releases are limited due to privacy protection issues. Anonymisation is a limited solution that does not fully protect the individuals. Even when all the personal identifiers have been removed it might be possible to identify an individual from an anonymous records using quasi-identifiers and data linking with some other external data source (see references).

In order to build a Normative Brain Imaging Bank we plan to collect a considerable amount of brain imaging, clinical, demographic and other data. To fully exploit the potential of such a resource the data has to be shared with other researchers.

The project consists in exploring the privacy risks of such a resource. The student will have access to the database schema, but not to the actual data. She will have to produce realistic simulated data to test her solutions. This solution will be a recomendation on what data items is safe to release etc.

https://projects.inf.ed.ac.uk/msc/project?number=P035

Supervisors @ NeSC:

David.Rodriguez

Project status:

Finished

Degree level:

MSc

Background:

Knowledge of databases. Programming skills.

Student project type:

MSc student project

References:

Fung, Benjamin C. M. and Wang, Ke and Chen, Rui and Yu, Philip S. "Privacy-preserving data publishing: A survey of recent developments" ACM Computing Surveys, Vol. 42, No. 4, Article 14 B.-C. Chen, D. Kifer, K. LeFevre and A. Machanavajjhala. "Privacy-Preserving Data Publishing" Foundations and TrendsR in Databases Vol. 2, Nos. 1–2 (2009) 1–167 L. Sweeney. "k-Anonymity: a model for protecting privacy". In International Journal on Uncertainty, Fuzziness and Knowledgebased Systems, 10(5), pages 557-570, 2002 Samarati P (2001). "Protecting respondents' identities in microdata release". IEEE Transactions on Knowledge and Data Engineering, 13(6):1010{1027

Investigating the Rule Construction Mechanism in Ant-Miner

Paolo.Besana — Thu, 20 Jan 2011 14:34:16 +0000

Student:

Hariharan Anantharaman

This project will appeal to you if you are interested in Learning from Data and Nature-Inspired Computation.

Ant-Miner [1] is the product of the first application of Ant Colony Optimization [2] to the induction of classification rules. The developers of Ant-Miner state that rule pruning is important to improve the accuracy of the induced rules. FRANTIC [3] is a development of ACO for the induction of fuzzy classification rules, and early experiments indicate that the rule pruning procedure used in Ant-Miner does not significantly improve the accuracy of FRANTIC induced rules.

A possible explanation is that Ant-Miner constructs rule antecedents prior to determining their rule consequents, whilst the opposite is true for FRANTIC. During FRANTIC rule construction all terms in an antecedent are added on the basis of how well they describe the same predetermined consequent – the partial rule antecedent must always cover a pre-specified number of training instances of the same class (class-dependent rule construction). During Ant-Miner rule construction the partial rule antecedent must still cover a minimum number of instances, but no restriction is placed on their class label, i.e. the instances covered may belong to different classes (class-independent rule construction).

[3] suggests that the Ant-Miner approach to rule construction necessitates rule pruning to resolve (remove) terms added due to instances with different class labels. However, there is no evidence to support or contradict this conjecture and this is what this project aims to provide.

This project provides the opportunity to explore a research question with the potential to advance our understanding of not merely one specific rule induction algorithm (Ant-Miner), but of several, i.e. algorithms that use class-dependent rule construction versus those that use class-independent rule construction. The project is challenging as it will necessitate identifying elements of the Ant-Miner algorithm that might require changes in order to make a comparison between the two rule construction approaches equitable.

Attachment	Size
MSc-2011-AM-RuleConstruction.pdf	354.92 KB

Supervisors @ NeSC:

Michelle.Galea

David.Rodriguez

Project status:

Finished

Degree level:

MSc

Student project type:

MSc student project

Investigating Array Databases for Managing Climate Data

Paolo.Besana — Thu, 20 Jan 2011 14:31:41 +0000

Student:

Jian Qiang

This is a challenging project and will appeal to students keen to make a contribution in the areas of scientific databases and geoinformatics.

The NCAR Command Language (NCL) is an interpreted language for scientific data analysis and visualisation ( http://www.ncl.ucar.edu/). NCL is widely-used for the processing of climate data which are held in arrays, generally in a netCDF format ( http://www.unidata.ucar.edu/software/netcdf/index.html). rasdaman is an array database system that uses an SQL-style query language to retrieve and maintain unlimited size multi-dimensional arrays stored in standard relational databases ( http://rasdaman.eecs.jacobs-university.de/trac/rasdaman).

This project will load NetCDF climate data into a rasdaman database, and develop an interface to allow its use from NCL commands. It will explore the technical issues that arise, and the benefits and limitations of the rasdaman system and its associated query languages and APIs.

The student will have an opportunity to interact with different specialists as he or she will have access to additional resources and support from the:

- School of Geosciences, University of Edinburgh, re the provision of climate data, use of NCL, and the identification of a range of example queries and analyses; and,

- Large-Scale Scientific Information Services Research Group, Jacobs University, Bremen, the developers of rasdaman.

This project has a real potential to inform advances in the management and manipulation of large climate data resources, of a type becoming widely used in the continuing exploration of climate science, climate change and its impacts.

Attachment	Size
MSc-2011-rasdaman-1.pdf	276.98 KB

Supervisors @ NeSC:

Michelle.Galea

Paolo.Besana

Jos.Koetsier

Project status:

Finished

Degree level:

MSc

Subject areas:

Databases

Software Engineering

Student project type:

MSc student project

De-identification of faces in 2D DICOM images

Paolo.Besana — Wed, 19 Jan 2011 21:28:38 +0000

With the increasing resolution of MR and CT scans, it has become feasible to reconstruct detailed 3D images of faces.

Usually face de-identification in medical imaging is done after the reconstruction, i.e. in 3D (see references). Different techniques are used to this end including brain extraction, removal of facial features and deformation of the face surface.

In some scenarios this might not be adequate and it would be preferable removing or altering the facial features that contribute more to face recognition (by humans) in the original DICOM 2D slices. A review of the literature to determine which are those features would have to be performed by the student, but the eyes, mouth and nose seem the most probable candidates.

One possible approach is to use machine learning to identify those features and then remove or alter them without any of the brain pixels. Nevertheless, other approaches might be explored. The student will evaluate the different possibilities and implement a prototype. The development will be done preferably in Java to ease the integration with existing software.

It would be desirable that the software handles both CT and MR images, although this would not be a requirement for completion. The formal evaluation of how good the de-identification is would require the participation of human observers and thus would be done better later on, but it would be good if the student defines the experiments for such an evaluation.

Attachment	Size
De-identification_of_Faces_in_2D.pdf	122.06 KB

Supervisors @ NeSC:

David.Rodriguez

Project status:

Still available

Degree level:

MSc

Other supervisors:

Trevor Carpenter

Subject areas:

Machine Learning/Neural Networks/Connectionist Computing

Student project type:

MSc student project

Scientific applications: exploiting the data bonanza. The microscopy case.

Paolo.Besana — Wed, 19 Jan 2011 21:25:50 +0000

he aim of the project is to perform some exploratory work on how to deal with the problem of I/O bound processing, by implementing technology-specific components in a provided system. The goal is to distribute data and processing so that a CPU processes data locally, minimising data transfer. The assumption is that I/O is the major bottleneck in processing, and computation could be done with less powerful (greener and cheaper) CPUs, rather than with a powerful CPU that wastes energy waiting for data. Different technologies for storing and processing the data can be explored. More than one student can work on this challenge. Each student can explore a different technology. In particular: 1. Distributed image storage and processing in array-based databases: exploiting Rasdaman, SciDB or MonetDB. 2. Distributed image storage and processing with Hadoop and Sector/Sphere, implementations of MapReduce The influence of the storage support (HDD and SDD) on performance should be analysed as well. The students will use a new cluster, composed by over 120 nodes, studied for dealing with I/O bound problems: each node is composed by a light-weight ATOM CPU, 6 TB storage, distributed between HDDs and SDDs.

Attachment	Size
mscmicroscopy.pdf	1.31 MB

Supervisors @ NeSC:

Paolo.Besana

Malcolm.Atkinson

Project status:

Finished

Degree level:

MSc

Subject areas:

e-Science

Databases

Distributed Systems

Software Engineering

Student project type:

MSc student project

Computing the best answer you can afford

Paolo.Besana — Wed, 19 Jan 2011 21:21:37 +0000

We are building a data-intensive machine as a research platform to explore data-intensive computational strategies. We are interested in computations over large bodies of data, where the data-handling is a dominant issue. Computational challenges with these properties are getting ever more prevalent as the cost of digital sensors and computational/societal data sources become ever cheaper, ever more powerful and more ubiquitous. The use of algorithms over such data are of growing importance in medicine, planning, engineering, policy and science.

It is not possible to resource ever computation that interests people, as the cost of accessing data is high (in time, joules or £). Therefore, there is interest in effective strategies for getting the best possible answer within some cost bound.

The project would explore strategies that take account of data locality and which make the best use of each data transfer. For example, each computing node samples data stored locally and when a disk access has been made, all of the records that can contribute to the answer are used.

More precisely: we seek to the best answer to f(proj, D) where f is some function we can potentially understand, such as mean, max, min, k-largest, etc., and for which we imagine a well provided library built by collaborations of Data-Intensive Distributed Computing (DIDC) engineers and Data Analysts. D is a (large) collection of data already distributed over nodes and storage units by an algorithm we may know, and may influence. We may use samples from astronomy, biology or seismology. proj, is a user defined function, such that proj(d_i) generates the value of interest to be analysed by f. So we want the best estimate we can afford of f(proj(d_i)) for all d_i in D. Note that we cannot have prior knowledge of proj, it can be any function that interests the person analysing the data. We cannot predict its cost, but we can learn it as the algorithm runs, or over successive runs. That learning is probably beyond the scope of the student project.

A naive approach would to keep choosing d_i randomly from D, looking it up, and using that value, iterating until time runs out (strictly, until the aggregation part of the algorithm has time to run). Such an approach would make pathologically poor use of t, as it would access whole blocks or whole files to use one object, and those files would not be co-located with the process evaluating f, so there would also be a lot of communication delays.

So the approach the student would take, would be to invent a new function f_p that does part of the work, and organise a framework for running f_p as follows:

Clone f_p across n (a control or tunable parameter) nodes of the computational framework as f_{p,k} workers. If we run for time t, then nt is a reasonable indicator of costs. Each f_{p,k} then randomly accesses local data units and uses all of the d_i in each data unit u_j read. The f_{p,k} either stream intermediary results or send final results to an aggregator, ag, when t has elapsed. The results sent are a tuple or some such.

Where u_n is the number of data storage units processed, u_{max} is the number data storage units available locally, \hat{f_k} is the local estimate of f(proj(D)), \hat{\epsilon_k} is the local estimate of the error, count_k is the number of d_i that were sampled.

The aggregator would produce its best estimate of \hat{f{proj(D)}}, its best estimate of the error \hat{\epsilon} and an estimate of coverage p, the percent of D sampled.

These would then be fed to a user or consumer algorithm.
Students would decide on a data-distribution description schema, e.g. that says how D is distributed over nodes, how it is divided into storage units on each node, and maybe how it was randomised on arrival. They can make this harder, by moving from simple statistical functions towards progressively more challenging data-mining algorithms. They can show how their strategy progressively improves over the naive method. They can show their indicators of result quality are indicative of the goodness of the result by comparing with an exhaustive computation.

A more advanced, streaming framework might be tried by an ambitious student.

We now have F(f, t_0, e, r, D), that when called generates a similar distributed algorithm to the above, where t_0 is time to run, f is the function to evaluate, e is an estimate of f(D), r is an estimate of the current error, coverage, etc.

After the algorithm has distributed its elements, and returned a result to show the user, or to satisfy a consumer, it repeats, with another time step t_1, to obtain a better answer, taking care not to re-use d_i used in step 1 and so on. t_i might be supplied as a Fibonacci series, and the process run continuously until complete coverage or consumer/user says, "Enough".

This may work as a student project (or two) and the second framework is easily expressed in DISPEL, given a supply of incremental algorithms of the first sort. Though now local housekeeping, to remember which data units had been sampled, would need to persist over the iterations, even though each node might not be sampled again, and be cleaned up at the end.

Attachment	Size
MScAffordableMalcolmAtkinson20110119.pdf	98.98 KB

Supervisors @ NeSC:

Malcolm.Atkinson

Paolo.Besana

Project status:

Still available

Degree level:

MSc

Subject areas:

e-Science

Algorithm Design

Student project type:

MSc student project

Runoff prediction from a Hydrologic Spatio-Temporal Database

Jano.van.Hemert — Mon, 05 Apr 2010 13:13:33 +0000

Student:

Charalampos Sfyrakis

Grade:

first

Present day instrumentation networks in rivers provide huge quantities of multi-dimensional data. Although there are numerous machine learning tools that can extract trends, find patterns and predict future states given some data, it is crucial to properly optimize these techniques according to the semantic content of the data. Hydrology is a data immense science, which requires efficient mining of trajectories of measurements taken at different time points and positions. The underlying dynamics are highly non-linear when examined in a short time window (minutes) and become chaotic in the long term, although they are governed by a periodical annual procedure. In this project we will deal with a multi-dimensional time-series dataset, extracted in parallel from multiple sampling and control hydrological stations in the Orava River in Slovakia. We will investigate how we could optimize and tweak conventional prediction systems in order to make full use of the spatio-temporal hydrological data we have obtained.

The purpose of this project is to try to predict the water height in several locations along the Orava River. These values could further be utilized in order to predict floods. In the first case we have a regression problem, while in the second a classification problem. In order to construct a prediction system, we will not use anything from the conventional hydraulics theory, but rather use machine learning tools. The data used are collected from 16 hydrological measurement stations under the Orava reservoir. An initial goal is to make prediction based on each station’s measurements only. The predictions could be the expected water height trend (up/down), the expected runoff, and a long-term water height, that could be used for flood prediction. Afterwards, we will try to integrate all the stations’ measurements’ and base our predictions from this new dataset. The challenges of this task is to extract as much information from these multiple data sources, and properly design a data mining procedure that would take the semantic content of these information into account. We could then try to minimize the time that model needs for training.

Supervisors @ NeSC:

Jano.van.Hemert

Project status:

Finished

Degree level:

MSc

Background:

data mining

Subject areas:

e-Science

Machine Learning/Neural Networks/Connectionist Computing

Projects:

ADMIRE

Student project type:

MSc student project

Accelerating Genome-Wide Association Studies with Graphics Processors

Jano.van.Hemert — Fri, 22 Jan 2010 13:04:00 +0000

Student:

Jeff Poznanovic

Grade:

first

Principal goal: to substantially improve the performance of the data-intensive analysis for genome-wide association studies (GWAS) by using graphics processing units (GPUs).

Description of the project: Genome-wide association studies are performed in order to detect the genetic variations associated with a certain disease. To conduct the study, researchers obtain the genomes from two large groups of participants: people who have the disease of interest, and people who do not have the disease. By doing massive-scale genetic comparisons between the two groups, the genetic variations that occur far more frequently in the diseased group can be "associated" with the disease. The knowledge gained from the association studies can help to pinpoint the genetic factors that cause the disease.

Due to the extremely large datasets involved in GWAS studies, it is essential to use the most efficient processing methods that are currently available. Although GPUs are not generally applicable to all computing tasks, they have the capability to significantly speedup many types of data-intensive applications due to the GPUs' massively parallel architecture. Only within the last few years have programmers been able to utilise GPUs for non-graphical applications; GPU-targeted compilers now exist that take high-level, general-purpose languages as input (e.g., OpenCL and CUDA).

PLINK is a popular open-source whole-genome association analysis toolset. This project aims to port PLINK's most time-consuming code regions into CUDA and OpenCL, in order to run those code regions on a graphics processor. The steps to complete this goal include the following:
1) Performance profile the PLINK software and identify the "hotspots" with respect to the tasks required to perform GWAS
2) Design and implement the hotspots for GPU in CUDA and OpenCL, which will likely require a significant amount of code refactoring/redesign
3) Attempt various architecture-specific optimisations and compare the hardware/software limitations of the GPU against other forms of parallelisation and distributed systems
4) Compare the performance results of the CUDA and OpenCL implementations using real-world GWAS data

Supervisors @ NeSC:

Jano.van.Hemert

Project status:

Finished

Degree level:

MSc

Other supervisors:

Dave Liewald, Centre for Cognitive Ageing and Cognitive Epidemiology. Gail Davies, Centre for Cognitive Ageing and Cognitive Epidemiology.

Subject areas:

e-Science

Bioinformatics

Computer Architecture

Distributed Systems

Parallel Programming

Student project type:

MSc student project

References:

NIH National Human Genome Research Institute, "Genome-wide association studies," http://www.genome.gov/20019523 PLINK, http://pngu.mgh.harvard.edu/~purcell/plink CUDA, http://www.nvidia.com/object/cuda_home.html OpenCL, http://www.khronos.org/opencl/

Parameter fitting of cosmological models using billions of galaxies

Jano.van.Hemert — Fri, 15 Jan 2010 19:45:05 +0000

Student:

Martha Axiak

Grade:

first

Principal goal: to develop, test and make available to the cosmology community a parameter estimation method for models that explain our dark Universe.

Cosmology is undergoing a transformation. The standard cosmological model is dominated by two components, dark matter and dark energy, that collectively account for 96% of the Universes total energy budget, and yet whose nature is entirely unknown. Dark matter and dark energy cannot be explained by modern physics, the illumination of the nature of these fundamental constituents of our Universe will mark a revolution in physics impacting particle physics and cosmology and will require new physics beyond the standard model of particle physics, general relativity or both.

To understand our dark Universe cosmologists have developed complex statistical tools that require information from many billions of galaxies. To create predictions using non-standard models that can then be used to predict the future outcome of experiments, or analyse cosmological data, many models have been developed that exist in the form of software packages, such as cosmomc (http://cosmologist.info/cosmomc/), LAMBDA (http://lambda.gsfc.nasa.gov/), iCosmo (http://www.icosmo.org).

This project will involve the implementation of parameter estimation methodologies based on machine learning and optimisation to aid in fitting the parameters of the existing models. A possible method is to use evolutionary computation.

To make the solution available to the cosmology community the goal is then to create a computational portal to allow cosmologist to search parameter space, given some data, on-line. This is a fraction of the work required, but an important step to get the result of your other efforts used.

Bonus if you can develop better models than the existing ones (where better is a combination of how well the models fit the data and how plausible they are in terms of explaning the universe!)

Supervisors @ NeSC:

Jano.van.Hemert

Project status:

Finished

Degree level:

MSc

Other supervisors:

Tom Kitching, Institute for Astronomy, Edinburgh; tdk@roe.ac.uk, tom.kitching@googlemail.com

Background:

Evolutionary computation, optimisation, machine learning and/or statistics are all desirable.

Subject areas:

Genetic Algorithms/Evolutionary Computing

Machine Learning/Neural Networks/Connectionist Computing

WWW Tools and Programming

Student project type:

MSc student project

References:

There is a good review of statistical methods used in cosmology here with some further references suggested http://xxx.lanl.gov/abs/0911.3105 chapter 13 goes into some discussion on the monte carlo methods we use. The standard tool for cosmological parameter estimation is cosmomc which is here http://cosmologist.info/cosmomc/ The original paper for this is here http://arxiv.org/abs/astro-ph/0205436 and the first application is here http://arxiv.org/abs/astro-ph/0302306 A slightly more advances nested sampling method is called multinest which is described here http://xxx.lanl.gov/abs/0809.3437 A general discussion on the current status of cosmology is http://xxx.lanl.gov/abs/astro-ph/0610906 though warning there is some technical details (and a lot of acronyms).

Data mining to identify small molecules with bioactivity

Jano.van.Hemert — Fri, 15 Jan 2010 19:31:47 +0000

Student:

Gideon Jansen Van Vuuren

Grade:

first

Principal goal: to apply machines learning to identify small molecues that are likely candidates to have relevant bioactivity for follow-up wet-lab experiments.

Nowadays, biologists screen hundreds of small molecules routinely to test them for bioactivity. The screens performed return many hit compounds that could be interesting for follow-up experiments. Quite often there are too many hits to be followed up and many of the molecules are ubiqutious binders.

One way to deal with this is to try to classify the hit molecules into groups by learning their molecule structure and properties to bin them. To do that, a dataset with molecules and experimental results is submitted to get the compounds classifed based on features that correspond to bioactivity in the screen and their structure.

Right now the current tools are only capable of handling binned results. The researchers have to specify which ones are
positive/negative hits or belong to specific groups A, B, C, etc. Your job would be to deliver a quantitative refinement of the classification. For example, you could instead of bin the results assign a real valued number that represents the interestingness or alternatively rank the hit compounds.

This project will allow you to apply machine learning algorithms to real data and to work with chemo-informatics libraries.

This project has the potential to be published as an application note in the journal Bioinformatics when finshed.

Supervisors @ NeSC:

Jano.van.Hemert

Project status:

Finished

Degree level:

MSc

Other supervisors:

Jan Wildenhain, Tyers Lab, School of Biological Sciences (http://tyerslab.bio.ed.ac.uk/lisa/indPage.php?id=jwil315) Michaela Spitzer, Tyers Lab, School of Biological Sciences

Background:

Machine learning essential, biology/bioinformatics desirable.

Subject areas:

Bioinformatics

Machine Learning/Neural Networks/Connectionist Computing

Student project type:

MSc student project