by Magnus Jahre and Lasse Natvig
The computer architecture group at the Norwegian University of Science and Technology (NTNU) in Trondheim, Norway, is working on issues that are arising as increasing numbers of processors are integrated on a single chip. Discrete event simulators and high-performance computers are indispensable tools in this quest. By combining the cutting-edge multi-core simulator M5 from the University of Michigan with the 5632-core Stallo cluster at the University of Tromsø, researchers are making progress on the issues facing future multi-core architectures.
The Stallo Cluster at the University of Tromsø (Photo: Thilo Bubek).
The chip multiprocessor (CMP) or multi-core architectures is a recent technological innovation that has received considerable attention in both academia and industry. The main reason is that such architectures reduce the impact of physical and economic design constraints. Consequently, a number of commercial vendors now produce CMPs, and most new desktop computers are equipped with multi-core processors.
The recent popularity of CMPs is due to the following factors:
- technology scaling has made it feasible to place multiple cores on one chip
- it has become increasingly difficult to improve performance with techniques that exploit Instruction Level Parallelism (ILP) beyond what is common today
- single-core, high-performance processors consume a great deal of power, meaning expensive packaging and noisy cooling solutions are needed. This limitation is known as the power wall. For large compute clusters, reduced power consumption gives doubled benefits – reduced power demands from both the processors and the cooling systems. Consequently, multi-core computing contributes to what is now called Green IT
- when designing a CMP, a processor core is designed once and reused as many times as there are cores on the chip. Furthermore, these cores can be simpler than their single-core counterparts. Consequently, CMPs facilitate design reuse and reduce time to market.
Processor performance has been improving at a faster rate than the main memory access time for more than twenty years. CMPs do not automatically reduce the impact of this problem. In fact, they can make it worse because multiple processors need to be fed from the same slow memory. In recent years, our group has pursued several avenues towards reducing the impact of this problem. Firstly, we have proposed techniques that increase shared cache utilization. We have also looked at prefetching, which analyses the memory access stream and then attempts to retrieve data before the processor requests it. Here, we have both proposed new prefetching heuristics and illustrated how prefetching can be used to improve memory bandwidth utilization. Finally, we have proposed techniques that reduce the performance impact of destructive interference between concurrently scheduled processes running on different cores.
Modern cycle-accurate simulators such as the M5 are complex pieces of software. Typically, they contain tens of thousands of lines of code. Most computer architecture simulators are event driven. Consequently, execution is modelled as discrete events that occur at specific times, and time is typically measured in clock cycles. The simulator is built around the time-ordered event queue. The latency of an operation is modelled by first calculating the latency and then adding an event to the event queue at the time the operation will complete. By expanding on this simple concept, it is possible to model very complex behaviour.
A computer architecture research paper can demand as much as 25 computer-years’ worth of simulation. A significant part of computer architecture research is Design Space Exploration. We test our architectural techniques on a variety of different programs (called benchmarks) to ensure that the technique is sufficiently general. Then, we investigate the impact of changing key architectural features such as the amount of cache space or off-chip bandwidth. Each combination of architectural parameters and benchmarks is one point in the design space. These points are independent of all other points and can therefore be executed in parallel. Consequently, our research is well suited to large clusters. Fortunately, we have been granted access to the 5632-core Stallo cluster at the University of Tromsø by the Norwegian Metacenter for Computational Science (NOTUR). Having access to a large, professionally maintained compute cluster is a great advantage.
Lasse Natvig, NTNU, Norway