Array processing is a good candidate for increasing computing power by using parallel computation. Additionally, it can help to solve architectural problems (eg distribution of control signals on a chip). The effectiveness of implementations is measured by a set of parameters, namely by the silicon area (A), execution time (T), dissipated power (P), consumed energy (E) and number of input/output pins (#I/O).
A number of different implementations of array processors are commercially available. In this project we have mainly concentrated on the topographic IBM Cell heterogeneous array processor architecture, because its development environment is open source and we wished to compare the results with those from our previous FPGA (field programmable gate array)-based implementations.
The Cell Broadband Engine Architecture (CBEA) is designed to achieve high-performance computing with better area/performance and power/performance ratios than the conventional multi-core architectures. The CBEA defines a heterogeneous multi-processor architecture where general-purpose processors called Power Processor Elements (PPE), and Single Instruction Multiple Data (SIMD) processors called Synergistic Processor Elements (SPE), are connected via a high-speed on-chip coherent bus called an Element Interconnect Bus (EIB). The CBEA architecture is flexible and the ratio of the different elements can be defined according to the requirements of different applications. The first implementation of the CBEA is the Cell Broadband Engine (Cell BE or informally Cell) designed for the Sony PlayStation 3 game console, which contains one PPE and eight SPEs.
In this work we have concentrated on an efficient CNN implementation on the Cell architecture. The basic CNN simulation kernel was successfully implemented on the Cell BE, and both linear and nonlinear CNN arrays can be simulated. The kernel was optimized according to the special requirements of the Cell architecture. Performance comparison showed that a roughly sixfold speedup can be achieved over a high-performance microprocessor in the single SPE solution, while the speedup is 35-fold when all eight SPEs are utilized. When using nonlinear templates the performance advantage of the Cell architecture is much higher. In a single SPE configuration, a 64-fold speedup can be achieved, while the use of eight SPEs means the performance is 429-fold improved.
In addition, the CNN paradigm was used to solve a complex spatio-temporal problem. Namely, the 3D Princeton Ocean Model was implemented on Cell BE and a significant improvement in performance was achieved. Our solution was optimized according to the special requirements of the Cell architecture. Performance comparison showed that an approximately 17-fold improvement can be achieved over a high-performance microprocessor in the single SPE solution, while the speedup is 85-fold when all six SPEs are utilized.
Figure 1 illustrates flow through a channel that includes two islands at the centre of the domain. The size of the modelled ocean is 1024km x 1024km, the north and south boundaries are closed, the east and west boundaries are open, the grid size is 128×128, and the grid resolution is 8km. The simulation time-step is 6s and 360 iterations are computed. In the future, further speedups might be achieved by using the full power of the Cell architecture on Cell processor-based IBM blades.