Efficiently Parallelizing Instruction Set Simulation of Embedded Multi-Core Processors Using Region-based Just-in-Time Dynamic Binary Translation
-
Abstract:
Embedded systems, as typified by modern mobile phones, are already seeing a drive toward using multi-core processors. The number of cores will likely increase rapidly in the future. Engineers and researchers need to be able to simulate systems, as they are expected to be in a few generations time, running simulations of many-core devices on today’s multi-core machines. These requirements place heavy demands on the scalability of simulation engines, the fastest of which have typically evolved from just-in-time (JIT) dynamic binary translators (DBT). Existing work aimed at parallelizing DBT simulators has focused exclusively on trace-based DBT, wherein linear execution traces or perhaps trees thereof are the units of translation. Region-based DBT simulators have not received the same attention and require different techniques than their trace-based cousins. In this paper we develop an innovative approach to scaling multi-core, embedded simulation through region-based DBT. We initially modify the JIT code generator of such a simulator to emit code that does not depend on a particular thread with its thread-specific context and is, therefore, thread-agnostic. We then demonstrate that this thread-agnostic code generation is comparable to thread-specific code with respect to performance, but also enables the sharing of JIT-compiled regions between different threads. This sharing optimisation, in turn, leads to significant performance improvements for multi-threaded applications. In fact, our results confirm that an average of 76% of all JIT-compiled regions can be shared between 128 threads in representative, parallel workloads. We demonstrate that this translates into an overall performance improvement by 1.44x on average and up to 2.40x across 12 multi-threaded benchmarks taken from the SPLASH-2 benchmark suite, targeting our high-performance multi-core DBT simulator for embedded ARC processors running on a 4-core Intel host machine.