Cookie Control

This site uses cookies to store information on your computer.

Some cookies on this site are essential, and the site won't work as expected without them. These cookies are set when you submit a form, login or interact with the site by doing something that goes beyond clicking on simple links.

By using our site you accept the terms of our Privacy Policy.

(One cookie will be set to store your preference)
(Ticking this sets a cookie to hide this popup if you then hit close. This will not store any personal information)

About this tool

About Cookie Control

You are here

Historical Interest Only

This is a static HTML version of an old Drupal site. The site is no longer maintained and could be deleted at any point. It is only here for historical interest.

Optimising Data-Streaming Elements in Distributed Data Mining

Principle goal: To evaluate existing data streaming implementation, formulate model to predict streaming performance corresponding to buffering strategy and then optimise data streaming with dynamical buffering implementation.

The ADMIRE system [1] uses data streaming to connect software processing elements (PEs) so as to build directed acyclic graphs (DAGs) to efficiently distributed data mining and integration processes across computers. These data streams carry the output of a PE to the input of another PE. These PEs may have several inputs and consequently buffering may be needed within a data stream to handle different production and consumption rates.

The existing implementations interconnect with a bounded main memory buffer or buffers asso- ciated with communication between sockets. Jacobs [2] has observed the factor of more than 105 in the speed of memory access depending on whether it is serial or random. The workloads require arbitrary sized buffers due to mismatched processing speeds along different branches of a DAG. The challenge is to show how a set of implementations of data streaming can be designed so that they have efficient access patterns for the various scales of buffering required.

A successful project would study the requirement and existing implementations (in use locally as in the literature) and formulate a model that predicted performance versus buffer capacity. It would then test these predictions by measuring the performance with simulated workloads. Measurement tools and simulated loads will be provided in Java.

Project status: 
Finished
Degree level: 
MSc
Background: 
Java programming essential. Distributed/parallel computing desirable.
Subject areas: 
Computer Architecture
Distributed Systems
Parallel Programming
Performance Modelling and Simulation
System Level Integration
Projects: 
Student project type: 
References: 
[1] M. Atkinson, P. Brezany, O. Corcho, L. Han, J. van Hemert, L. Hluchy ́, A. Hume, I. Janciak, A. Krause, and D. Snelling. ADMIRE White Paper: Motivation, Strategy, Overview and Impact. Technical Report version 0.9, ADMIRE, EPCC, University of Edinburgh, January 2009. [2] A. Jacobs. The pathologies of big data. Commun. ACM, 52(8):36–44, 2009.