School of Informatics - University of Edinburgh Institute for Computing Systems Architecture - School of Informatics
Institute for Computing
Systems Architecture

Semi-Automatic Extraction and Exploitation of Hierarchical Pipeline Parallelism Using Profiling Information

    Paper - Semi-Automatic Extraction and Exploitation of Hierarchical Pipeline Parallelism Using Profiling Information
  • Type: Paper
  • Authors:
    G.Tournavitis and B.Franke.
  • Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT '10), Vienna, Austria, September 11-15, 2010.
  • Download as PDF
  • Abstract:

    In recent years multi-core computer systems have left the realm of high-performance computing and virtually all of today's desktop computers and embedded computing systems are equipped with several processing cores. Still, no single parallel programming model has found widespread support and parallel programming remains an art for the majority of application programmers. In addition, there exists a plethora of sequential legacy applications for which automatic parallelization is the only realistic hope to benefit from the potentially increased processing power of modern multi-core systems. However, in the past automatic parallelization largely focused on data parallelism. In this paper we present a novel approach to extracting and exploiting pipeline parallelism from sequential applications. We use profiling to overcome the limitations of static data and control flow analysis enabling more aggressive parallelization. Our approach is orthogonal to existing automatic parallelization approaches and additional data parallelism may be exploited in the individual pipeline stages. The key contribution of this paper is a whole-program representation that supports profiling, parallelism extraction and exploitation. We demonstrate how this enhances conventional pipeline parallelization by incorporating support for multi-level loops and pipeline stage replication in a uniform and automatic way. We have evaluated our methodology on a set of multimedia and stream processing benchmarks and demonstrate speedups of up to $4.7$ on a eight-core Intel Xeon machine.