TY - CHAP
T1 - DISPEL Enactment
T2 - THE DATA BONANZA: Improving Knowledge Discovery for Science, Engineering and Business
Y1 - 2013
A1 - Chee Sun Liew
A1 - Krause, Amrey
A1 - Snelling, David
ED - Malcolm Atkinson
ED - Rob Baxter
ED - Peter Brezany
ED - Oscar Corcho
ED - Michelle Galea
ED - Parsons, Mark
ED - Snelling, David
ED - van Hemert, Jano
KW - Data Streaming
KW - Data-Intensive Engineering
KW - Dispel
KW - Workflow Enactment
AB - Chapter 12: "DISPEL enactment", describes the four stages of DISPEL enactment. It is targeted at the data-intensive engineers who implement enactment services.
JF - THE DATA BONANZA: Improving Knowledge Discovery for Science, Engineering and Business
PB - John Wiley & Sons Inc.
ER -
TY - JOUR
T1 - Data-Intensive Architecture for Scientific Knowledge Discovery
JF - Distributed and Parallel Databases
Y1 - 2012
A1 - Atkinson, Malcolm P.
A1 - Chee Sun Liew
A1 - Michelle Galea
A1 - Paul Martin
A1 - Krause, Amrey
A1 - Adrian Mouat
A1 - Oscar Corcho
A1 - Snelling, David
KW - Knowledge discovery, workflow management system
AB - This paper presents a data-intensive architecture that demonstrates the ability to support applications from a wide range of application domains, and support the different types of users involved in defining, designing and executing data-intensive processing tasks. The prototype architecture is introduced, and the pivotal role of DISPEL as a canonical language is explained. The architecture promotes the exploration and exploitation of distributed and heterogeneous data and spans the complete knowledge discovery process, from data preparation, to analysis, to evaluation and reiteration. The architecture evaluation included large-scale applications from astronomy, cosmology, hydrology, functional genetics, imaging processing and seismology.
VL - 30
UR - http://dx.doi.org/10.1007/s10619-012-7105-3
IS - 5
ER -
TY - BOOK
T1 - Optimisation of the enactment of fine-grained distributed data-intensive workflows
Y1 - 2012
A1 - Chee Sun Liew
AB - The emergence of data-intensive science as the fourth science paradigm has posed a data deluge challenge for enacting scientific workflows. The scientific community is facing an imminent flood of data from the next generation of experiments and simulations, besides dealing with the heterogeneity and complexity of data, applications and execution environments. New scientific workflows involve execution on distributed and heterogeneous computing resources across organisational and geographical boundaries, processing gigabytes of live data streams and petabytes of archived and simulation data, in various formats and from multiple sources. Managing the enactment of such workflows not only requires larger storage space and faster machines, but the capability to support scalability and diversity of the users, applications, data, computing resources and the enactment technologies. We argue that the enactment process can be made efficient using optimisation techniques in an appropriate architecture. This architecture should support the creation of diversified applications and their enactment on diversified execution environments, with a standard interface, i.e.~a workflow language. The workflow language should be both human readable and suitable for communication between the enactment environments. The data-streaming model central to this architecture provides a scalable approach to large-scale data exploitation. Data-flow between computational elements in the scientific workflow is implemented as streams. To cope with the exploratory nature of scientific workflows, the architecture should support fast workflow prototyping, and the re-use of workflows and workflow components. Above all, the enactment process should be easily repeated and automated. In this thesis, we present a candidate data-intensive architecture that includes an intermediate workflow language, named DISPEL. We create a new fine-grained measurement framework to capture performance-related data during enactments, and design a performance database to organise them systematically. We propose a new enactment strategy to demonstrate that optimisation of data-streaming workflows can be automated by exploiting performance data gathered during previous enactments.
PB - The University of Edinburgh
CY - Edinburgh
ER -
TY - JOUR
T1 - A Generic Parallel Processing Model for Facilitating Data Mining and Integration
JF - Parallel Computing
Y1 - 2011
A1 - Liangxiu Han
A1 - Chee Sun Liew
A1 - van Hemert, Jano
A1 - Malcolm Atkinson
KW - Data Mining and Data Integration (DMI)
KW - Life Sciences
KW - OGSA-DAI
KW - Parallelism
KW - Pipeline Streaming
KW - workflow
AB - To facilitate Data Mining and Integration (DMI) processes in a generic way, we investigate a parallel pipeline streaming model. We model a DMI task as a streaming data-flow graph: a directed acyclic graph (DAG) of Processing Elements PEs. The composition mechanism links PEs via data streams, which may be in memory, buffered via disks or inter-computer data-flows. This makes it possible to build arbitrary DAGs with pipelining and both data and task parallelisms, which provides room for performance enhancement. We have applied this approach to a real DMI case in the Life Sciences and implemented a prototype. To demonstrate feasibility of the modelled DMI task and assess the efficiency of the prototype, we have also built a performance evaluation model. The experimental evaluation results show that a linear speedup has been achieved with the increase of the number of distributed computing nodes in this case study.
PB - Elsevier
VL - 37
IS - 3
ER -
TY - JOUR
T1 - Performance database: capturing data for optimizing distributed streaming workflows
JF - Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences
Y1 - 2011
A1 - Chee Sun Liew
A1 - Atkinson, Malcolm P.
A1 - Radoslaw Ostrowski
A1 - Murray Cole
A1 - van Hemert, Jano I.
A1 - Liangxiu Han
KW - measurement framework
KW - performance data
KW - streaming workflows
AB - The performance database (PDB) stores performance-related data gathered during workflow enactment. We argue that by carefully understanding and manipulating this data, we can improve efficiency when enacting workflows. This paper describes the rationale behind the PDB, and proposes a systematic way to implement it. The prototype is built as part of the Advanced Data Mining and Integration Research for Europe project. We use workflows from real-world experiments to demonstrate the usage of PDB.
VL - 369
IS - 1949
ER -
TY - CONF
T1 - Towards Optimising Distributed Data Streaming Graphs using Parallel Streams
T2 - Data Intensive Distributed Computing (DIDC'10), in conjunction with the 19th International Symposium on High Performance Distributed Computing
Y1 - 2010
A1 - Chee Sun Liew
A1 - Atkinson, Malcolm P.
A1 - van Hemert, Jano
A1 - Liangxiu Han
KW - Data-intensive Computing
KW - Distributed Computing
KW - Optimisation
KW - Parallel Stream
KW - Scientific Workflows
AB - Modern scientific collaborations have opened up the opportunity of solving complex problems that involve multi- disciplinary expertise and large-scale computational experiments. These experiments usually involve large amounts of data that are located in distributed data repositories running various software systems, and managed by different organisations. A common strategy to make the experiments more manageable is executing the processing steps as a workflow. In this paper, we look into the implementation of fine-grained data-flow between computational elements in a scientific workflow as streams. We model the distributed computation as a directed acyclic graph where the nodes represent the processing elements that incrementally implement specific subtasks. The processing elements are connected in a pipelined streaming manner, which allows task executions to overlap. We further optimise the execution by splitting pipelines across processes and by introducing extra parallel streams. We identify performance metrics and design a measurement tool to evaluate each enactment. We conducted ex- periments to evaluate our optimisation strategies with a real world problem in the Life Sciences—EURExpress-II. The paper presents our distributed data-handling model, the optimisation and instrumentation strategies and the evaluation experiments. We demonstrate linear speed up and argue that this use of data-streaming to enable both overlapped pipeline and parallelised enactment is a generally applicable optimisation strategy.
JF - Data Intensive Distributed Computing (DIDC'10), in conjunction with the 19th International Symposium on High Performance Distributed Computing
PB - ACM
CY - Chicago, Illinois
UR - http://www.cct.lsu.edu/~kosar/didc10/index.php
ER -
TY - CONF
T1 - A Distributed Architecture for Data Mining and Integration
T2 - Data-Aware Distributed Computing (DADC'09), in conjunction with the 18th International Symposium on High Performance Distributed Computing
Y1 - 2009
A1 - Atkinson, Malcolm P.
A1 - van Hemert, Jano
A1 - Liangxiu Han
A1 - Ally Hume
A1 - Chee Sun Liew
AB - This paper presents the rationale for a new architecture to support a significant increase in the scale of data integration and data mining. It proposes the composition into one framework of (1) data mining and (2) data access and integration. We name the combined activity “DMI”. It supports enactment of DMI processes across heterogeneous and distributed data resources and data mining services. It posits that a useful division can be made between the facilities established to support the definition of DMI processes and the computational infrastructure provided to enact DMI processes. Communication between those two divisions is restricted to requests submitted to gateway services in a canonical DMI language. Larger-scale processes are enabled by incremental refinement of DMI-process definitions often by recomposition of lower-level definitions. Autonomous types and descriptions which will support detection of inconsistencies and semi-automatic insertion of adaptations.These architectural ideas are being evaluated in a feasibility study that involves an application scenario and representatives of the community.
JF - Data-Aware Distributed Computing (DADC'09), in conjunction with the 18th International Symposium on High Performance Distributed Computing
PB - ACM
ER -