E = AI squared

by Marco Breiling, Bijoy Kundu (Fraunhofer Institute for Integrated Circuits IIS) and Marc Reichenbach (Friedrich-Alexander-Universität Erlangen-Nürnberg)

How small can we make the energy consumed by an artificial intelligence (AI) algorithm plus associated neuromorphic computing hardware for a given task? That was the theme of a German national competition on AI hardware-acceleration, which aimed to foster disruptive innovation. Twenty-seven academic teams, each made up of one or two partners from universities and research institutes, applied to enter the competition. Two of the eleven teams that were selected to enter were Fraunhofer IIS: ADELIA and Lo3-ML (the latter together with Friedrich-Alexander-University Erlangen-Nürnberg - FAU) [L1]. Finally Lo3-ML was one of the four national winners awarded by the German research minister Anja Karliczek for best energy efficiency.

A national competition, sponsored by the German research ministry BMBF [L2], challenged teams of researchers to classify two-minute-long electro-cardiogram (ECG) recordings as healthy or showing signs of atrial fibrillation. The two teams involving Fraunhofer IIS ran completely independently of each other and followed completely different approaches. “ADELIA” implemented a mixed-signal neural network (NN) accelerator, which is primarily made of crossbar arrays of novel analogue processing units (see Figure 1) and standard SRAM cells for storage, while “Lo3-ML” designed a basically digital accelerator with non-volatile Resistive RAM (RRAM) cells, for which analogue write-read-circuits were developed.

Figure 1: An analogue deep learning accelerator uses an analogue crossbar for energy-efficient analogue matrix multiplication, which is at the core of the neural network calculation for classifying the ECG signals. (Sources: Fraunhofer IIS and Creative Commons CC0).

In the ADELIA project, Fraunhofer IIS developed an analogue NN accelerator with a mixed-signal frontend. The NN is trained using quantized weights of seven levels (quasi-3bit), quantized gain (4 bit) and offset (5 bit) of a batch normalisation (BN), and a customised ReLU activation function (AF). Crossbar arrays made of novel analogue processing units (APUs) perform the primary computations of the NN: the multiply and accumulate operations (MACs). These APUs contain resistors and switches plus standard CMOS SRAM cells, which hold the trained weights after chip start-up. The non-linear activation circuits immediately drive (in analogue) their following NN layer, thus avoiding any conversion overhead from ADCs and DACs. Hardware-software co-design was heavily employed during training and circuit implementation of the NN. While the NN training iteratively took into account several hardware constraints such as quantized weights, quantized gains and offsets of BN, customised AF, and some component variations, the NN circuit was generated automatically from the software model using a tool developed during the project. Thus a fast analogue CMOS based energy efficient NN accelerator was developed that harnesses the parallel processing of analogue crossbar computing without inter layer data converters and without the need for memristor technology.

The second project, Lo3-ML, was a collaboration between Fraunhofer IIS and the chairs for Computer Architecture and Electronics Engineering at FAU. The project built upon their previous work with non-volatile RRAMs. These allow the accelerator core of the chip to be powered down while idle (which is most of the time) to save energy, while the pre-processing core is always awake and collects relatively slowly incoming data (e.g., at low sampling frequency). Once sufficient data is present, the accelerator core is woken up and does a quick processing before powering down again. This scheme saves up to 95% of energy compared to a conventional architecture, where both cores are always on. An RRAM cell can store binary or multi-level values [2]. A trade-off analysis showed that using ternary values both in the RRAM cells and as the weights of a severely quantized NN offered the best energy trade-off. However, for such extreme quantization we had to develop dedicated hardware-aware training algorithms, as corresponding tools were not available at that time. Moreover, to cope with this quantization, a Binary-Shift Batch Norn (BSBN) using only the shifting of values was introduced. The design and implementation of the chip was done iteratively. In order to achieve quick turn-around cycles, the design space exploration for both NN and HW architecture was partially automated including, for example, automatic generation of a bit-true simulation [1].

The projects were realised by intensive work in small teams over just 15 months. Despite operating as completely independent projects, both ended up selecting similar NNs with multiple convolutional layers followed by fully connected layers. The achieved metrics are 450 nJ per inference for ADELIA on a 22 nm Global Foundries-FDSOI technology and 270 nJ for Lo3-ML on a 130 nm IHP process. ADELIA produced one of the very few analogue NN accelerators worldwide that are actually ready for chip implementation. Analogue NN acceleration is generally considered as very energy-efficient and hence promising for the next generation of ultra-low-power AI accelerators. Although the task in the project was ECG analysis, the developed architectures can also be used for other applications and could be easily extended for more complex tasks. The design-flows developed in the two projects will be combined in the future and will serve as the basis for a highly automated design space exploration tool jointly for NN topology and HW architecture.

Links:
[L1] https://www.iis.fraunhofer.de/en/ff/kom/ai/neuromorphic.html
[L2] https://www.elektronikforschung.de/projekte/ki-sprung

References:
[1] J. Knödtel, M. Fritscher, et al.: “A Model-to-Netlist Pipline for Parametrized Testing of DNN Accelerators based on Systolic Arrays with Multibit Emerging Memories”, MOCAST Conf., 2020.
[2] M. Fritscher , J. Knödtel, et al.: “Simulating large neural networks embedding MLC RRAM as weight storage considering device variations”, Latin America Symposium on Circuits and System, 2021.

Please contact:
Marco Breiling
Fraunhofer Institute for Integrated Circuits IIS, Germany
This email address is being protected from spambots. You need JavaScript enabled to view it.

Sidebar

Contents

E = AI squared