TY - CHAP T1 - The Data-Intensive Survival Guide T2 - THE DATA BONANZA: Improving Knowledge Discovery for Science, Engineering and Business Y1 - 2013 A1 - Malcolm Atkinson ED - Malcolm Atkinson ED - Rob Baxter ED - Peter Brezany ED - Oscar Corcho ED - Michelle Galea ED - Parsons, Mark ED - Snelling, David ED - van Hemert, Jano KW - Data-Analysis Experts KW - Data-Intensive Architecture KW - Data-intensive Computing KW - Data-Intensive Engineers KW - Datascopes KW - Dispel KW - Domain Experts KW - Intellectual Ramps KW - Knowledge Discovery KW - Workflows AB - Chapter 3: "The data-intensive survival guide", presents an overview of all of the elements of the proposed data-intensive strategy. Sufficient detail is presented for readers to understand the principles and practice that we recommend. It should also provide a good preparation for readers who choose to sample later chapters. It introduces three professional viewpoints: domain experts, data-analysis experts, and data-intensive engineers. Success depends on a balanced approach that develops the capacity of all three groups. A data-intensive architecture provides a flexible framework for that balanced approach. This enables the three groups to build and exploit data-intensive processes that incrementally step from data to results. A language is introduced to describe these incremental data processes from all three points of view. The chapter introduces ‘datascopes’ as the productized data-handling environments and ‘intellectual ramps’ as the ‘on ramps’ for the highways from data to knowledge. JF - THE DATA BONANZA: Improving Knowledge Discovery for Science, Engineering and Business PB - John Wiley & Sons Ltd. ER - TY - CHAP T1 - Definition of the DISPEL Language T2 - THE DATA BONANZA: Improving Knowledge Discovery for Science, Engineering and Business Y1 - 2013 A1 - Paul Martin A1 - Yaikhom, Gagarine ED - Malcolm Atkinson ED - Rob Baxter ED - Peter Brezany ED - Oscar Corcho ED - Michelle Galea ED - Parsons, Mark ED - Snelling, David ED - van Hemert, Jano KW - Data Streaming KW - Data-intensive Computing KW - Dispel AB - Chapter 10: "Definition of the DISPEL language", describes the novel aspects of the DISPEL language: its constructs, capabilities, and anticipated programming style. JF - THE DATA BONANZA: Improving Knowledge Discovery for Science, Engineering and Business T3 - {Parallel and Distributed Computing, series editor Albert Y. Zomaya} PB - John Wiley & Sons Inc. ER - TY - RPRT T1 - Data-Intensive Research Workshop (15-19 March 2010) Report Y1 - 2010 A1 - Malcolm Atkinson A1 - Roure, David De A1 - van Hemert, Jano A1 - Shantenu Jha A1 - Ruth McNally A1 - Robert Mann A1 - Stratis Viglas A1 - Chris Williams KW - Data-intensive Computing KW - Data-Intensive Machines KW - Machine Learning KW - Scientific Databases AB - We met at the National e-Science Institute in Edinburgh on 15-19 March 2010 to develop our understanding of DIR. Approximately 100 participants (see Appendix A) worked together to develop their own understanding, and we are offering this report as the first step in communicating that to a wider community. We present this in turns of our developing/emerging understanding of "What is DIR?" and "Why it is important?'". We then review the status of the field, report what the workshop achieved and what remains as open questions. JF - National e-Science Centre PB - Data-Intensive Research Group, School of Informatics, University of Edinburgh CY - Edinburgh ER - TY - CONF T1 - Towards Optimising Distributed Data Streaming Graphs using Parallel Streams T2 - Data Intensive Distributed Computing (DIDC'10), in conjunction with the 19th International Symposium on High Performance Distributed Computing Y1 - 2010 A1 - Chee Sun Liew A1 - Atkinson, Malcolm P. A1 - van Hemert, Jano A1 - Liangxiu Han KW - Data-intensive Computing KW - Distributed Computing KW - Optimisation KW - Parallel Stream KW - Scientific Workflows AB - Modern scientific collaborations have opened up the opportunity of solving complex problems that involve multi- disciplinary expertise and large-scale computational experiments. These experiments usually involve large amounts of data that are located in distributed data repositories running various software systems, and managed by different organisations. A common strategy to make the experiments more manageable is executing the processing steps as a workflow. In this paper, we look into the implementation of fine-grained data-flow between computational elements in a scientific workflow as streams. We model the distributed computation as a directed acyclic graph where the nodes represent the processing elements that incrementally implement specific subtasks. The processing elements are connected in a pipelined streaming manner, which allows task executions to overlap. We further optimise the execution by splitting pipelines across processes and by introducing extra parallel streams. We identify performance metrics and design a measurement tool to evaluate each enactment. We conducted ex- periments to evaluate our optimisation strategies with a real world problem in the Life Sciences—EURExpress-II. The paper presents our distributed data-handling model, the optimisation and instrumentation strategies and the evaluation experiments. We demonstrate linear speed up and argue that this use of data-streaming to enable both overlapped pipeline and parallelised enactment is a generally applicable optimisation strategy. JF - Data Intensive Distributed Computing (DIDC'10), in conjunction with the 19th International Symposium on High Performance Distributed Computing PB - ACM CY - Chicago, Illinois UR - http://www.cct.lsu.edu/~kosar/didc10/index.php ER -