Accelerating data intensive applications using MapReduce

14 January 2010 - 3:08pm — Jano.van.Hemert

Student:

Hwee Yong Ong

Grade:

first

Principal goal: by a way of real case study in the Life Science, the goals of this project include: 1) Understanding data-parallel processing using MapReduce model for addressing Performance issues in data intensive applications; 2) Investigating how to adapt data mining algorihtms to the MapReduce model; 3) Prototyping and comparing performance with other frameworks that support data intensive applications.

Performance is an open issue in data intensive applications, e.g., distributed data mining and integration. Recent MapReduce programming model [1] has become a popular paradigm due to its simplicity and scalability at low cost. It can easily parallelise data over large-scale data centres with thousands of computing nodes and process data on terabyte and petabyte scales and thereby improve system performance. The MapReduce model was originally developed by Google. The MapReduce model provides a simple interface of two functions and allows developers to parallelise data processing tasks. Map function performs grouping that produces intermediate data sets and reduce function performs aggregation that aggregates intermediate data sets into smaller data sets. This project will apply the MapReduce model to a real data mining use case in the Life Science EURExpress-II [2, 3] that aims to automatically annotate anatomical components in an image with corresponding terminologies stored in an ontology database. Performance will be evaluated by a comparison study between the prototype of this project and the prototypes of the ADMIRE project [4] that is conducting research into architectures for large-scale and long-running data-intensive computations. Through this project, a student will be able to learn knowledge from through levels: 1)At the conceptual level, understanding the conception of data parallel frameworks for supporting large-scale data mining and integration applications. 2)From an algorithmic perspective, investigating the adaptation of data mining algorithms to the MapReduce model. 3)From a practical point of view, gaining practical programming skills via the architectural implementation and being able to thinking critically by a comparison study.

Project status:

Finished

Degree level:

MSc

Background:

Knowledge of programming in Java; Database, Data mining and integration, and distributed computing.

Supervisors @ NeSC:

Liangxiu.Han

Jano.van.Hemert

Subject areas:

e-Science

Algorithm Design

Computer Architecture

Computer Communication/Networking

Distributed Systems

Parallel Programming

Student project type:

MSc student project

References:

* [1]J. Dean, S. Ghemawat, Mapreduce: Simplified data processing on large clusters, in: In Proceedings of the 6th Symposium on Operating Systems Design and Implementation (OSDI), 2004, pp. 137–150. * [2]L. Han, J. I. van Hemert, R. Baldock, M. Atkinson, Automating gene expression annotation for mouse embryo, in: R. H. et al. (Ed.), Lecture Notes in Computer Science (Advanced Data Mining and Applications, ADMA 2009), Vol. LANI 5678, 2009, pp. 469–478. * [3]EURExpress-II, http://www.eurexpress.org/ee/ * [4] ADMIRE, http://www.admire-project.eu/

Cookie Control

Main menu

Latest news

Pages

You are here

Historical Interest Only

Accelerating data intensive applications using MapReduce

Cookie Control

Search form

Main menu

Latest news

Pages

You are here

Historical Interest Only

Accelerating data intensive applications using MapReduce