To explore, analyse and extract useful information and knowledge from massive amounts of data collected from geographically distributed sites, one has to overcome both data and computational intensive problems in distributed environments.
Much work has been done on parallel processing in specific data-mining algorithms to address these issues [1-3]. However, due to the complexity of the data mining and integration (DMI) process, even where some data-mining algorithms are customised for parallel implementations, they are rarely put into wider use. For example, in [3], the parallelised machine learning method is only suitable for shared memory machines. It could not be applied to the architecture of cellular or grid type mutiprocessors where cores have a local cache.
In essence, the main bottleneck in accessing large-scale data (e.g., terabytes or petabytes) is the cost of I/O operations and most of the current computing hardware are designed and optimised for sequential access to data. As shown in [4], random disk access is more than 1.5×105 times slower than sequential and more than 106 times slower than sequential RAM reads. When accessing data in a nonsequential way, it will cause the seek times, cache misses and pipeline stalls. This has led to the development of streaming models [5-6]. The streaming computational paradigm provides a powerful capability for processing large-scale data efficiently, as it makes a single pass over massive amounts of data using small working memory. Nevertheless, with the increasing need to access geographically distributed data resources, it is still a challenge to make use of the streaming model to efficiently map and schedule tasks across a distributed environment for facilitating data intensive applications.
Scientific workflows such as Taverna [7], Kepler [8] and Pegasus [9] have been used for analysing large data volumes and for transforming the data into knowledge by harnessing distributed resources. The strategies for performance improvement on these workflows tried to minimise the makespan by grouping and mapping of a workflow’s components. They do not support some scientific applications over remote data streams [10] that require using streaming pipeline parallelism. OGSA-DAI [11] is a data workflow execution system that supports data streams over the distributed sites.
To facilitate and speedup DMI processes in a generic way, we investigate a parallel pipeline streaming model. In this approach, we model a DMI task as a directed acyclic graph (DAG) of Processing Elements (PEs). The PEs can be assembled and composed to complete a DMI task. The composition mechanism links PEs via data streams, which may be in memory, buffered via disks or inter-computer data lows. This makes it possible to build arbitrary DAGs with pipelining and both data and task parallelisms, and therefore provide a room for performance enhancement. We have applied this approach to a real DMI use case and implemented a prototype using the OGSA-DAI. The performance evaluation result shows the proposed parallel processing model works well.
This presented work is supported by the ADMIRE project [12] (funded by EU Framework Programme 7 FP7- ICT-215024), which is conducting research into architectures for large-scale and long-running data-intensive computations.
References
[1] D. Caragea, A. Silvescu, V. Honavar, A framework for learning from distributed data using sufficient statistics and its application to learning decision trees, International Journal of Hybrid Intelligent Systems 1 (2) (2004) 80–89.
[2] B. Catanzaro, N. Sundaram, K. Keutzer, Fast support vector machine training and classification on graphics processors, in: ACM International Conference Proceeding Series: Proceedings of the 25th international conference on Machine learning, vol. 307, ACM New York,NY,USA, 2008.
[3] R. Jin, G. Yang, G. Agrawal, Shared memory parallelization of data mining algorithms: Techniques, programming interface, and performance, IEEE Transactions on Knowledge and Data Engineering 17 (1) (2005) 71–89.
[4] A. Jacobs, The pathologies of big data, Commun. ACM 52 (8) (2009) 36–44.
[5] N. Alon, Y. Matias, M. Szegedy, The space complexity of approximating the frequency moments., Journal of Computer and System Sciences 58 (1) (1999) 137– 147.
[6] M. R. Henzinger, P. Raghavan, S. Rajagopalan, Computing on data streams, In DIMACS series in Discrete Mathematics and Theoretical Computer Science 50 (1999) 107–118.
[7] T. Oinn, M. Addis, J. Ferris, D. Marvin, M. Senger, M. Greenwood, T. Carver, K. Glover, M. R. Pocock, A. Wipat, P. Li, Taverna: a tool for the composition and enactment of bioinformatics workflows, Bioinformatics 20 (17) (2004) 3045–3054.
[8] I. Altintas, C. Berkley, E. Jaeger, M. Jones, B. Ludascher, S. Mock, Kepler: an extensible system for design and execution of scientific workflows, Scientific and Statistical Database Management, 2004. Proceedings. 16th International Conference on (2004) 423–424.
[9] E. Deelman, G. Singh, M.-H. Su, J. Blythe, Y. Gil, C. Kesselman, G. Mehta, K. Vahi, G. B. Berriman, J. Good, A. C. Laity, J. C. Jacob, and D. S.Katz. Pegasus: A framework for mapping complex scientific workflows onto distributed systems. Scientific Programming, 13(3):219{237,2005.
[10] C. Rueda, M. Gertz., Real-time integration of geospatial raster and point data streams, in: B. scher and N.Mamoulis (ed.), 20th Intl. Conf. on Scientific and Statistical Database Management (SSDBM), vol. LNCS5069, Springer, 2008.
[11] M. Antonioletti, N. P. C. Hong, A. C. Hume, M. Jackson, K. Karasavvas, A. K. 1, J. M. Schopf, M. P. Atkinson, B. Dobrzelecki, M. Illingworth, N. McDonnell, M. Parsons, E. Theocharopoulos, OGSA-DAI 3.0 – the what’s and the why’s, in: Proceedings of the UK e-Science All Hands Meeting 2007.
[12] ADMIRE, http://www.admire-project.eu.