System-Level Design of Fault-Tolerant Embedded Systems

by Alain Girault

Fault-tolerance is the ability of a system to maintain its functionality, even in the presence of faults. With the advent of ubiquitous computing and distributed embedded systems, it is becoming an aspect more and more crucial. We have provided new functionalities to the SynDEx system-level CAD software. SynDEx is ideal for optimising distributed real-time embedded systems and our new functionalities allow us to guarantee a specified fault-tolerance level for the generated embeddable code.

Our contribution to research in the fault-tolerant embedded systems consists of several scheduling/distribution heuristics. Their common feature is to take as an input two graphs: a data-flow graph ALG describing the algorithm of the application and a graph ARC describing the target distributed architecture (see figure).

To the left is an example of an algorithm graph: it has nine operations (represented by circles) and 11 data-dependences (represented by green arrows). Among the operations, one is a sensor operation (I), one is an actuator operation (O), while the seven others are computations (A to G). Below to the right is an example of an architecture graph: it has three processors (P1, P2, and P3) and three point-to-point communication links (L1.2, L1.3, and L2.3).

Also shown is a table giving the worst-case execution time of each operation onto each processor and the worst-case transmission time of each data-dependence onto each communication link. The architecture being a priori heterogeneous, these need not be identical. Below is an example of such a table for the operations of ALG. The infinity sign expresses the fact that the operation I cannot be executed by the processor P3, for instance, to account for the requirement of certain dedicated hardware.

From these three inputs, the heuristic distributes the operations of ALG onto the processors of ARC and schedules them statically, together with the communications induced by these scheduling decisions. The output of the heuristic is therefore a static schedule from which embeddable code can be generated.

Our fault hypothesis is that the hardware components are fail silent, meaning that a component is either healthy and works fine, or is faulty and produces no output at all. Recent studies on modern hardware architectures have shown that a fail-silent behaviour can be achieved at a reasonable cost, so our fault hypothesis is reasonable.

Our contribution consists of the definition of several new scheduling/distribution heuristics in order to generate static schedules that are also tolerant of a fixed number of hardware components (processors and/or communication links) faults. They are implemented inside SynDEx, as an alternative to its own default heuristics (called DSH: Distribution Scheduling Heuristic):

FTBAR (Fault-Tolerant Based Active Replication) generates a static schedule that tolerates Npf processor faults by replicating actively all the operations of the algorithm graph ALG exactly Npf+1 times. It works with target architectures having either point-to-point communication links or buses, but assumes that all the communication links are reliable. FTBAR tries to minimise the critical path of the obtained schedule w.r.t. the known WCETs of the operations onto the various processors of the architecture.
RBSA (Reliable Bicriteria Scheduling Algorithm) also generates a reliable and static schedule by actively replicating the operations of the algorithm graph. The difference with FTBAR is that the number of times an operation is replicated depends on the individual reliability of the processors it is scheduled on and on the overall reliability level required by the user. RBSA tries both to minimise the critical path of the obtained schedule and to maximise its reliability (these are the two criteria of this heuristic).
GRT + eDSH (Graph Redundancy Transformation + extended Distribution Scheduling Heuristic) generates a static schedule that tolerates Npf processor faults and Nlf communication link faults. It first transforms the algorithm graph ALG into another data-flow graph ALG* by adding redundancy into it such that the required number of hardware component faults will be tolerated. During this phase, it also generates exclusion relations between subsets of operations that must be scheduled onto distinct processors, and subsets of data dependences that must be routed through disjoint paths. Then it uses an extended version of the DSH heuristics to generate a static schedule of ALG* onto ARC, w.r.t. the exclusion relations generated during the first phase.
FPMH (Fault Patterns Merging Heuristic) is an original approach to generate a static schedule of ALG onto ARC, tolerant to a given list of fault patterns. A fault pattern is a subset of the architecture's component that can fail simultaneously. Our method involves two steps. First, for each fault pattern, we generate the corresponding reduced architecture (the architecture from which the pattern's component has been removed) and we generate a static schedule of ALG onto this reduced architecture (we use the basic DSH heuristic of SYnDEx for this). From N fault patterns we therefore obtain N basic schedules. The second step consists of the merging of these N basic schedules into one static schedule that will be, by construction, tolerant to all the specified fault patterns.

Links:
Fault-tolerance: http://pop-art.inrialpes.fr/~girault/Projets/FT/
SynDEx: http://www.syndex.org

Please contact:
Alain Girault, INRIA Rhône-Alpes
Tel: +33 476 61 53 51
E-mail: Alain.Giraultinrialpes.fr