You are here

Historical Interest Only

This is a static HTML version of an old Drupal site. The site is no longer maintained and could be deleted at any point. It is only here for historical interest.

Improving Data Placement Strategy in Data-intensive Computations

Student: 
Yue Ma
Grade: 
third

Principle goal: to investigate existing data placement strategies and build a decision model to improve data placement strategies in enacting data-intensive workflow.

It is widely recognised that data-intensive methods are transforming the way research is conducted, as recognised in The Fourth Paradigm [2]. Challenges arise in handling large-scale data-intensive research that involve running computations across institutions and continents, where the data can be heterogeneous (e.g., relational database, XML, physical files from various filesystems such as HDFS, NFS) and scattered in geographically distributed locations. Dealing with the complexity and heterogeneity of these data requires carefully planned data management strategies.

This proposal focuses on one of the data management issues: data placement, and attempts to answer the following research questions.
- What are the decisions involved in data placement?
- What are the data properties that influence the decisions of data placement?
- How to assess the strategy in making these decisions?

You will study existing workflow systems (e.g. Kepler, Pegasus) or platforms that support large-scale data analysis (e.g. Apache Pig, Microsoft Dryad, Meandre) to establish the fundamental knowledge of data management related issues in handling data-intensive workflows, and construct a model to predict the performance of data placement decision. You will then test the model with workflows constructed from use real world use cases in an existing project called ADMIRE [1]. The challenge is to show how the variation of data placement strategies influences the execution performance of workflows. A successful project would devise a model to assist data placement decision that improves the performance of executing data-intensive workflows.

Project status: 
Finished
Degree level: 
MSc
Background: 
Distributed/parallel computing, databases desirable. Java programming essential.
Supervisors @ NeSC: 
Subject areas: 
Computer Architecture
Distributed Systems
System Level Integration
Projects: 
Student project type: 
References: 
[1] M. Atkinson, P. Brezany, O. Corcho, L. Han, J. van Hemert, L. Hluchy ́, A. Hume, I. Janciak, A. Krause, and D. Snelling. ADMIRE White Paper: Motivation, Strategy, Overview and Impact. Technical Report version 0.9, ADMIRE, EPCC, University of Edinburgh, January 2009. [2] T. Hey, S. Tansley, and K. T. (Editors). The Fourth Paradigm: Data-Intensive Scientific