Improving Data Placement Strategy in Data-intensive Computations

15 January 2010 - 2:57pm — Jano.van.Hemert

Student:

Yue Ma

Grade:

third

Principle goal: to investigate existing data placement strategies and build a decision model to improve data placement strategies in enacting data-intensive workflow.

It is widely recognised that data-intensive methods are transforming the way research is conducted, as recognised in The Fourth Paradigm [2]. Challenges arise in handling large-scale data-intensive research that involve running computations across institutions and continents, where the data can be heterogeneous (e.g., relational database, XML, physical files from various filesystems such as HDFS, NFS) and scattered in geographically distributed locations. Dealing with the complexity and heterogeneity of these data requires carefully planned data management strategies.

This proposal focuses on one of the data management issues: data placement, and attempts to answer the following research questions.
- What are the decisions involved in data placement?
- What are the data properties that influence the decisions of data placement?
- How to assess the strategy in making these decisions?

You will study existing workflow systems (e.g. Kepler, Pegasus) or platforms that support large-scale data analysis (e.g. Apache Pig, Microsoft Dryad, Meandre) to establish the fundamental knowledge of data management related issues in handling data-intensive workflows, and construct a model to predict the performance of data placement decision. You will then test the model with workflows constructed from use real world use cases in an existing project called ADMIRE [1]. The challenge is to show how the variation of data placement strategies influences the execution performance of workflows. A successful project would devise a model to assist data placement decision that improves the performance of executing data-intensive workflows.

Project status:

Finished

Degree level:

MSc

Background:

Distributed/parallel computing, databases desirable. Java programming essential.

Supervisors @ NeSC:

Chee.Sun.Liew

Malcolm.Atkinson

Subject areas:

Computer Architecture

Distributed Systems

System Level Integration

Projects:

ADMIRE

Student project type:

MSc student project

References:

[1] M. Atkinson, P. Brezany, O. Corcho, L. Han, J. van Hemert, L. Hluchy ́, A. Hume, I. Janciak, A. Krause, and D. Snelling. ADMIRE White Paper: Motivation, Strategy, Overview and Impact. Technical Report version 0.9, ADMIRE, EPCC, University of Edinburgh, January 2009. [2] T. Hey, S. Tansley, and K. T. (Editors). The Fourth Paradigm: Data-Intensive Scientific

Cookie Control

Main menu

Latest news

Pages

You are here

Historical Interest Only

Improving Data Placement Strategy in Data-intensive Computations

Cookie Control

Search form

Main menu

Latest news

Pages

You are here

Historical Interest Only

Improving Data Placement Strategy in Data-intensive Computations