Principle goal: to investigate existing data placement strategies and build a decision model to improve data placement strategies in enacting data-intensive workflow.
It is widely recognised that data-intensive methods are transforming the way research is conducted, as recognised in The Fourth Paradigm [2]. Challenges arise in handling large-scale data-intensive research that involve running computations across institutions and continents, where the data can be heterogeneous (e.g., relational database, XML, physical files from various filesystems such as HDFS, NFS) and scattered in geographically distributed locations. Dealing with the complexity and heterogeneity of these data requires carefully planned data management strategies.
This proposal focuses on one of the data management issues: data placement, and attempts to answer the following research questions.
- What are the decisions involved in data placement?
- What are the data properties that influence the decisions of data placement?
- How to assess the strategy in making these decisions?
You will study existing workflow systems (e.g. Kepler, Pegasus) or platforms that support large-scale data analysis (e.g. Apache Pig, Microsoft Dryad, Meandre) to establish the fundamental knowledge of data management related issues in handling data-intensive workflows, and construct a model to predict the performance of data placement decision. You will then test the model with workflows constructed from use real world use cases in an existing project called ADMIRE [1]. The challenge is to show how the variation of data placement strategies influences the execution performance of workflows. A successful project would devise a model to assist data placement decision that improves the performance of executing data-intensive workflows.