I Introduction
Convolutional Neural Networks (CNNs) have achieved breakthroughs in various tasks, including classification [resnet], detection [ssd] and segmentation [long2015fully], etc. Due to their promising performance, CNNs have been utilized in various safetycritic applications, such as autonomous driving, intelligent surveillance, and identification. Meanwhile, driven by the recent academic and industrial efforts, the neural network accelerators based on various hardware platforms (e.g., Application Specific Integrated Circuits (ASIC) [chen2014diannao], Field Programmable Gate Array (FPGA) [qiu2016going], Resistive RandomAccess Memory (RRAM) [chi2016prime]) have been rapidly evolving.
The robustness and reliability related issues of deploying neural networks onto various embedded devices for safetycritical applications are attracting more and more attention. There is a large stream of algorithmic studies on various robustnessrelated characteristics of NNs, e.g., adversarial robustness [szegedy2013intriguing], data poisoning [shafahi2018poison], interpretability [zhang2018interpreting] and so on. However, no hardware models are taken into consideration in these studies. Besides the issues from the purely algorithmic perspective, there exist hardwarerelated reliability issues when deploying NNs onto the nowadays embedded devices. With the downscaling of CMOS technology, circuits become more sensitive to cosmic radiation and radioactive impurities [henkel2013reliable]. Voltage instability, aging, and temperature variations are also common effects that could lead to errors. As for the emerging metaloxide RRAM devices, due to the immature technology, they suffer from many types of device faults [chen2015rramdefect], among which hard faults such as StuckatFaults (SAFs) damage the computing accuracy severely and could not be easily mitigated [Xia2018StuckatFT]. Moreover, malicious attackers can attack the edge devices by embedding hardware Trojans, manipulating backdoors, and doing memory injection [zhao2019memory].
Recently, some studies [liu2017rescuing, vialatte2017astudy, schorn2018accurate]
analyzed the sensitivity of NN models. They proposed to predict whether a layer or a neuron is sensitive to faults and protect the sensitive ones. For fault tolerance, a straightforward way is to introduce redundancy in the hardware. Triple Modular Redundancy (TMR) is a commonly used but expensive method to tolerate a single fault
[bolchini2007tmr, she2017reducing, zhao2019finegrained]. Studies [Xia2018StuckatFT, liu2017rescuing] proposed various redundancy schemes for StuckatFaults tolerance in the RRAMbased Computing Systems. For increasing the algorithmic fault resilience capability, studies [he2019noise, hacene2019training] proposed to use faulttolerant training (FTT), in which random faults are injected in the training process.Although redesigning the hardware for reliability is effective, it is not flexible and inevitably introduces large overhead. It would be better if the issues could be mitigated as far as possible from the algorithmic perspective. Existing methods mainly concerned about designing training methods and analyzing the weight distribution [schorn2018accurate, he2019noise, hacene2019training]. Intuitively, the neural architecture might also be important for the fault tolerance characteristics [arechiga2018robustness, li2017understanding], since it determines the “path” of fault propagation. To verify these intuitions, the accuracies of baselines under a random bitbias feature fault model^{1}^{1}1The random bitbias feature fault model is formalized in Sec. IIID. are shown in Table I, and the results under SAF weight fault model^{2}^{2}2The SAF weight fault model is formalized in Sec. IIIE. are shown in Table II. These preliminary experiments on the CIFAR10 dataset show that the fault tolerance characteristics vary among neural architectures, which motivates the employment of the neural architecture search (NAS) technique into the designing of faulttolerant neural architectures. We emphasize that our work is orthogonal to most of the previous methods based on hardware or mapping strategy design. To our best knowledge, our work is the first to increase the algorithmic fault resilience capability by optimizing the NN architecture.
Model  Acc()  #Params  #FLOPs 

ResNet20  94.7/63.4/10.0  11.2M  1110M 
VGG16  93.1/21.4/10.0  14.7M  626M 
MobileNetV2  92.3/10.0/10.0  2.3M  182M 
Model  Acc(0/4%/8%)  #Params  #FLOPs 

ResNet20  94.7/64.8/17.8  11.2M  1110M 
VGG16  93.1/45.7/14.3  14.7M  626M 
MobileNetV2  92.3/26.2/11.7  2.3M  182M 
In this paper, we employ NAS to discover faulttolerant neural network architectures against feature faults and weight faults, and demonstrate the effectiveness by experiments. The main contributions of this paper are as follows.

We analyze the possible faults in various types of NN accelerators (ASICbased, FPGAbased, and RRAMbased), and formalize the statistical fault models from the algorithmic perspective. After the analysis, we adopt the MultiplyAccumulate (MAC)i.i.d BitBias (MiBB) model and the arbitrarydistributed StuckatFault (adSAF) model in the neural architecture search for tolerating feature faults and weight faults, respectively.

We establish a multiobjective neural architecture search framework. On top of this framework, we propose two methods to discover neural architectures with better reliability: FTNAS (NAS with a faulttolerant multiobjective), and FTTNAS (NAS with a faulttolerant multiobjective and faulttolerant training (FTT)).

We employ FTNAS and FTTNAS to discover architectures for tolerating feature faults and weight faults. The discovered architectures, FFTTNet and WFTTNet have comparable or fewer floatingpoint operations (FLOPs) and parameters, and achieve better fault resilience capabilities than the baselines. With the same fault settings, FFTTNet discovered under the feature fault model achieves an accuracy of 86.2% (VS. 68.1% achieved by MobileNetV2), and WFTTNet discovered under the weight fault model achieves an accuracy of 69.6% (VS. 60.8% achieved by ResNet20). The ability of WFTTNet to defend against several other types of weight faults is also illustrated by experiments.

We analyze the discovered architectures, and discuss how the weight quantization range, the capacity of the model, and the connection pattern influence the fault resilience capability of a neural network.
The rest of this paper is organized as follows. The related studies and the preliminaries are introduced in Section II. In Section III, we conduct comprehensive analysis on the possible faults and formalize the fault models. In Section IV, we elaborate on the design of the faulttolerant NAS system. Then in Section V, the effectiveness of our method is illustrated by experiments, and the insights are also presented. Finally, we discuss and conclude our work in Section VI and Section VII.
Ii Related work and preliminary
Iia Convolutional Neural Network
Usually, a convolutional neural network is constructed by stacking multiple convolution layers and optional pooling layers, followed by fullyconnected layers. Denoting the input feature map (IFM), beforeactivation output feature map, output feature map (OFM, i.e. activations), weights and bias of th convolution layer as , , , , , the computation can be written as:
(1)  
where is the convolution operator,
is the activation function, for which ReLU
is the commonest choice. From now on, we omit the subscript for simplicity.IiB NN Accelerators and Fixedpoint Arithmetic
With dedicated data flow design for efficient neural network processing, FPGAbased NN accelerators could achieve at least 10x better energy efficiency than GPUs [qiu2016going, guo2019survey]. And ASICbased accelerators could achieve even higher efficiency [chen2014diannao]. Besides, RRAMbased Computing Systems (RCSes) are promising solutions for energyefficient braininspired computing [chi2016prime]
, due to their capability of performing matrixvectormultiplications (MVMs) in memory. Existing studies have shown RRAMbased ProcessingInMemory (PIM) architectures can enhance the energy efficiency by over
compared with both GPU and ASIC solutions, as they can eliminate the large data movements of bandwidthbounded NN applications [chi2016prime]. For the detailed and formal hardware architecture descriptions, we refer the readers to the references listed above.Currently, fixedpoint arithmetic units are implemented by most of the NN accelerators, as 1) they consume much fewer resources and are much more efficient than the floatingpoint ones [guo2019survey]; 2) NN models are proven to be insensitive to quantization [qiu2016going, hubara2017quantized]
. Consequently, quantization is usually applied before a neural network model is deployed onto the edge devices. To keep consistent with the actual deploying scenario, our simulation incorporates 8bit dynamic fixedpoint quantization for the weights and activations. More specifically, independent step sizes are used for the weights and activations of different layers. Denoting the fraction length and bitwidth of a tensor as
and , the step size (resolution) of the representation is . For common CMOS platforms, in which complement representation is used for numbers, the representation range of both weights and features is(2) 
As for RRAMbased NN platforms, two separate crossbars are used for storing positive and negative weights [chi2016prime]. Thus the representation range of the weights is
(3) 
For the feature representation in RRAMbased platforms, by assuming that the Analog to Digital Converters (ADCs) and Digital to Analog Converters (DACs) have enough precision, and the CMOS bitwidth is bit, the representation range of features in CMOS circuits is
(4) 
IiC Fault Resilience for CMOSbased Accelerators
[henkel2013reliable, borkar2005designing, slayman2011soft] revealed that advanced nanotechnology makes circuits more vulnerable to soft errors. Unlike hard errors, soft errors do not damage the underlying circuits, but instead trigger an upset of the logic state. The dominant cause of soft errors in CMOS circuits is the radioactive events, in which a single particle strikes an electronic device. [arechiga2018robustness, libano2018selective] explored how the SingleEvent Upset (SEU) faults impact the FPGAbased CNN computation system.
TMR is a commonly used approach to mitigate SEUs [bolchini2007tmr, she2017reducing, zhao2019finegrained]. Traditional TMR methods are agnostic of the NN applications and introduce large overhead. To exploit the NN applications’ characteristics to reduce the overhead, one should understand the behavior of NN models with computational faults. [vialatte2017astudy] analyzed the layerwise sensitivity of NN models under two hypothetical feature fault models. [libano2018selective] proposed to only triplicate the vulnerable layers after layerwise sensitivity analysis and reduced the LUTs overhead for an NN model on Iris Flower from about 200% (TMR) to 50%. [schorn2018accurate] conducted sensitivity analysis on the individual neuron level. [li2017understanding] found that the impacts and propagation of computational faults in an NN computation system depend on the hardware data path, the model topology, and the type of layers. These methods analyzed the sensitivity of existing NN models at different granularities and exploited the resilience characteristics to reduce the hardware overhead for reliability. Our methods are complementary and discover NN architectures with better algorithmic resilience capability.
To avoid the accumulation of the persistent soft errors in FPGA configuration registers, the scrubbing technique is applied by checking and partially reloading the configuration bits [bolchini2007tmr, xilinx2000partial]. From the algorithmic perspective, [hacene2019training] demonstrated the effectiveness of faulttolerant training (FTT) in the presence of SRAM bit failures.
IiD Fault Resilience for RRAMbased Accelerators
RRAM devices suffer from lots of device faults [chen2015rramdefect], among which the commonly occurring SAFs are shown to cause severe degradation in the performance of mapped neural networks [Xia2018StuckatFT]. RRAM cells containing SAF faults get stuck at highresistance state (SAF0) or lowresistance state (SAF1), thereby causing the weight to be stuck at the lowest or highest magnitudes of the representation range, respectively. Besides the hard errors, resistance programming variation [le2019resistive] is another source of faults for NN applications [liu2015vortex].
For the detection of SAFs, [Kannan2015Modeling, Kannan2013Sneak] proposed fault detection methods that can provide high fault coverage, [xia2017fault] proposed online fault detection method that can periodically detect the current distribution of faults.
Most of the existing studies on improving the fault resilience ability of RRAMbased neural computation system focus on designing the mapping and retraining methods. [Xia2018StuckatFT, liu2017rescuing, xia2017fault, chen2017acceleratorfriendly] proposed different mapping strategies and the corresponding hardware redundancy design. After the distribution detection of the faults and variations, they proposed to retrain (i.e. finetune) the NN model for tolerating the detected faults, which is exploiting the intrinsic fault resilience capability of NN models. To overcome the programming variations, [liu2015vortex]
calculated the calibrated programming target weights with the lognormal resistance variation model, and proposed to map sensitive synapses onto cells with small variations. From the algorithmic perspective,
[liu2019afault] proposed to use errorcorrecting output codes (ECOC) to improve the NN’s resilience capability for tolerating resistance variations and SAFs.IiE Neural Architecture Search
Neural Architecture Search, as an automatic neural network architecture design method, has been recently applied to design model architectures for image classification and language models [nasnet, enas, DARTS]. The architectures discovered by the NAS techniques have demonstrated surpassing performance than the manually designed ones. NASNet [nasnet]
used a recurrent neural network (RNN) controller to sample architectures, trained them, and used the final validation accuracy to instruct the learning of the controller. Instead of using reinforcement learning (RL)learned RNN as the controller,
[DARTS] used a relaxed differentiable formulation of the neural architecture search problem, and applied gradientbased optimizer for optimizing the architecture parameters; [real2019aging] used evolutionarybased methods for sampling new architectures, by mutating the architectures in the population. Although NASNet [nasnet] is powerful, the search process is extremely slow and computationally expensive. To address this pitfall, a lot of methods are proposed to speed up the performance evaluation in NAS. [baker2017accelerating]incorporated learning curve extrapolation to predict the final performance after a few epochs of training;
[real2019aging] sampled architectures using mutation on existing models and initialized the weights of the sampled architectures by inheriting from the parent model; [enas] shared the weights among different sampled architectures, and using the shared weights to evaluate each sampled architecture.The goal of the NAS problem is to discover the architecture that maximizes some predefined objectives. The process of the original NAS algorithm goes as follows. At each iteration, is sampled from the architecture search space . This architecture is then assembled as a candidate network , where is the weights to be trained. After training the weights on the training data split , the evaluated reward of the candidate network on the validation data split will be used to instruct the sampling process. In its purest form, the NAS problem can be formalized as:
(5)  
where is the sampling operator, denotes the expectation with respect to the data distribution , denotes the evaluated reward used to instruct the sampling process, and denotes the loss criterion for back propagation during the training of the weights .
Originally, for the performance evaluation of each sampled architecture , one needs to find the corresponding by fully training the candidate network from scratch. This process is extremely slow, and shared weights evaluation is commonly used for accelerating the evaluation. In shared weights evaluation, each candidate architecture is a subgraph of a super network and is evaluated using a subset of the super network weights. The shared weights of the super network are updated along the search process.
Iii Fault Models
In Sec. IIIA, we motivate and discuss the formalization of applicationlevel statistical fault models. Platformspecific analysis are conducted in Sec. IIIB and Sec. IIIC. Finally, the MACi.i.d BitBias (MiBB) feature fault model and the arbitrarydistributed StuckatFault model (adSAF) weight fault model are described in Sec. IIID and Sec. IIIE, which would be used in the neural architecture search process. The analyses in this part are summarized in Fig. 4 (a) and Table III.
Iiia ApplicationLevel Modeling of Computational Faults
Computational faults do not necessarily result in functional errors [henkel2013reliable, li2017understanding]
. For example, a neural network for classification tasks usually outputs a class probability vector, and our work only regards it as a functional error i.f.f the top1 decision becomes different from the golden result. Due to the complexity of the NN computations and different functional error definition, it’s very inefficient to incorporate gatelevel fault injection or propagation analysis into the training or architecture search process. Therefore, to evaluate and further boost the algorithmic resilience of neural networks to computational faults, the applicationlevel fault models should be formalized.
From the algorithmic perspective, the faults fall into two categories: weight faults and feature faults. In this section, we analyze the possible faults in various types of NN accelerators, and formalize the statistical feature and weight fault models. A summary of these fault models is shown in Table III.
Note that we focus on the computational faults along the datapath inside the NN accelerator that could be modeled and mitigated from the algorithmic perspective. Faults in the control units and other chips in the system are not considered. See more discussion in the “limitation of applicationlevel fault models” section in Sec. VI.
Platform 






NN application level  

Simplified statistical model  
RRAM  SAF  SBcell  Crossbar  H  P 

W  
MBcell  
RRAM 

MBcell  Crossbar  S  P 

W 


FPGA/ASIC  SEE, overstress  SRAM  Weight buffer  H  P  ECC  W  

S  T  
FPGA  SEE, overstress  LUTs  PE  H  P 

F  
SEE, VS  S 


FPGA/ASIC/RRAM  SEE, overstress  SRAM  Feature buffer  H  P  ECC  F  
SEE, VS  S  T  
ASIC  SEE, overstress 

PE  H  P 

F  
SEE, VS  S  T 
refers to the standard deviation of RRAM programming variations;
refer to the soft and hard error rates of memory elements, respectively; refer to the soft and hard error rates of logic elements, respectively; is an amplifying coefficient for feature error rate due to multiple involved computational components; is a coefficient that abstracts the error accumulation effects over time. Abbreviations: SEE refers to SingleEvent Errors, e.g. SingleEvent Burnout (SEB), SingleEvent Upset (SEU), etc.; “overstress” includes conditions such as high temperature, voltage or physical stress; VS refers to voltage (down)scaling that is used for energy efficiency; SBcell and MBcell refer to singlebit and multibit memristor cells, respectively; CL gates refer to combinational logic gates; 3R refers to various Redundancy schemes and corresponding Remapping/Retraining techniques; PS loop refers to the programmingsensing loop during memristor programming; TMR refers to Triple Modular Redundancy; DICE refers to Dual Interlocked Cell.IiiB Analysis of CMOSbased Platforms: ASIC and FPGA
The possible errors in CMOSbased platforms are illustrated in Fig. 1. Soft errors that happen in the memory elements or the logic elements could lead to transient faulty outputs in ASICs. Compared with logic elements (e.g., combinational logic gates, flipflops), memory elements are more susceptible to soft errors [slayman2011soft]. An unprotected SRAM cell usually has a larger bit soft error rate (SER) than flipflops. Since the occurring probability of hard errors is much smaller than that of the soft errors, we focus on the analysis of soft errors, despite that hard errors lead to permanent failures.
The soft errors in the weight buffer could be modeled as i.i.d weight random bitflips. Given the original value as , the distribution of a faulty value under the random bitflip (BF) model could be written as
(6)  
where denotes whether a bitflip occurs at bit position , is the XOR operator.
By assuming that error occurs at each bit with an i.i.d bit SER of , we know that each bit weight has an i.i.d probability to encounter error, and , as . It is worth to note that throughout the analysis, we assume that the SERs of all components , hence the error rate at each level is approximated as the sum of the error rates of the independent subcomponents. As each weight encounters error independently, a weight tensor is distributed as i.i.d random bitflip (iBF): , where is the golden weights. [reagen2019ares] showed that the iBF model could capture the bit error behavior exhibited by real SRAM hardware.
The soft errors in the feature buffer are modeled similarly as i.i.d random bitflips, with a fault probability of approximately for bit feature values. The distribution of the output feature map (OFM) values could be written as , where is the golden results.
FPGAbased implementations are often more vulnerable to soft errors than their ASIC counterparts [asadi2007analytical]. Since the majority space of an FPGA chip is filled with memory cells, the overall SER rate is much higher. Moreover, the soft errors occurring in logic configuration bits would lead to persistent faulty computation, rather than transient faults as in ASIC logic. Persistent errors can not be mitigated by simple retry methods and would lead to statistically significant performance degradation. Moreover, since the persistent errors would be accumulated if no correction is made, the equivalent error rate would keep increasing as time goes on. We abstract this effect with a monotonic increasing function , where the subscript denotes “persistent”, and denotes the time.
Let us recap how one convolution is mapped onto the FPGAbased accelerator, to see what the configuration bit errors could cause on the OFM values. If the dimension of the convolution kernel is (channel, kernel height, kernel width, respectively), there are additions needed for the computation of a feature value. We assume that the add operations are spatially expanded onto adder trees constructed by LUTs, i.e., no temporal reusing of adders is used for computing one feature value. That is to say, the add operations are mapped onto different hardware adders^{3}^{3}3See more discussion in the “hardware” section in Sec. VI., and encounter errors independently. The perfeature error rate could be approximated by the adderwise SER times , where . Now, let’s dive into the adderlevel computation, in a 1bit adder with scale , the bitflip in one LUTs bit would add a bias to the output value, if the input bit signals match the address of this LUTs bit. If each LUT cell has an i.i.d SER of , in a bit adder, denoting the fraction length of the operands and result as , the distribution of the faulty output with the random bitbias (BB) faults could be written as
(7)  
As for the result of the adder tree constructed by multiple LUTbased adders, since the probability that multiple bitbias errors cooccur is orders of magnitude smaller, we ignore the accumulation of the biases that are smaller than the OFM quantization resolution . Consequently, the OFM feature values before the activation function follow the i.i.d Random BitBias distribution , where and are the bitwidth and fraction length of the OFM values, respectively.
We can make an intuitive comparison of the equivalent feature error rates induced by LUTs soft errors and feature buffer soft errors. As the majority of FPGAs is SRAMbased, considering the bit SER of LUTs cell and BRAM cell to be close, we can see that the feature error rate induced by LUTs errors is amplified by . As we have discussed, , the performance degradation induced by LUTs errors could be significantly larger than that induced by feature buffer errors.
IiiC Analysis of PIMbased Platforms: RRAM as an example
In an RRAMbased Computing System (RCS), compared with the accompanying CMOS circuits, the RRAM crossbar is much more vulnerable to various nonideal factors. In multibit RRAM cells, studies have showed that the distribution of the resistance due to programming variance is either Gaussian or LogNormal
[le2019resistive]. As each weight is programmed as the conductance of the memristor cell, the weight could be seen as being distributed as ReciprocalNormal or LogNormal. Besides the soft errors, common hard errors such as SAFs, caused by fabrication defects or limited endurance, could result in severe performance degradation [Xia2018StuckatFT]. SAFs occur frequently in nowadays RRAM crossbar: As reported by [chen2015rramdefect], the overall SAF ratio could be larger than 10% ( for SAF1 and for SAF0) in a fabricated RRAM device. The statistical model of SAFs in singlebit and multibit RRAM devices would be formalized in Sec. IIIE.As the RRAM crossbars also serve as the computation units, some nonideal factors (e.g., IRdrop, wire resistance) could be abstracted as feature faults. They are not considered in this work since the modeling of these effects highly depends on the implementation (e.g., crossbar dimension, mapping strategy) and hardwareintheloop testing [he2019noise].
IiiD Feature Fault Model
As analyzed in Sec. IIIB, the soft errors in LUTs are relatively the more pernicious source of feature faults, as 1) SER is usually much higher than hard error rate: , 2) these errors are persistent if no correction is made, 3) the perfeature equivalent error rate is amplified as multiple adders are involved. Therefore, we use the iBB fault model in our exploration of mitigating feature faults.
We have , where , and the probability of error occurring at every position in the OFM is , where is defined as the perMAC error rate. Denoting the dimension of the OFM as (channel, height, and width, respectively) and the dimension of each convolution kernel as , the computation of a convolution layer under this fault model could be written as
(8)  
where is the mask indicating whether an error occurs at each feature map position, represents the bit position of the bias, represents the bias sign. Note that this formulation is not equivalent to the random bitbias formalization in Eq. 7
, and is adopted for efficient simulation. These two formulations are close when the odds that two errors take effect simultaneously is small (
). This fault model is referred to as the MACi.i.d BitBias model (abbreviated as MiBB). An example of injecting feature faults is illustrated in Fig. 2.Intuitively, convolution computation that needs fewer MACs might be more immune to the faults, as the equivalent error rate at each OFM location is lower.
IiiE Weight Fault Model
As RRAMbased accelerators suffer from a much higher weight error rate than the CMOSbased ones. The StuckatFaults in RRAM crossbars are mainly considered for the setup of the weight fault model. We assume the underlying platform is RRAM with multibit cells, and adopt the commonlyused mapping scheme, in which separate crossbars are used for storing positive and negative weights [chi2016prime]. That is to say, when an SAF0 fault causes a cell to be stuck at HRS, the corresponding logical weight would be stuck at 0. When an SAF1 fault causes a cell to be stuck at LRS, the weight would be stuck at if it’s negative, or otherwise.
The computation of a convolution layer under the SAF weight fault model could be written as
(9)  
where refers to the representation bound in Eq. 3, is the mask indicating whether fault occurs at each weight position, is the mask representing the SAF types (SAF0 or SAF1) at faulty weight positions, is the mask representing the faulty target values ( or ). Every single weight has an i.i.d probability of to be stuck at , and to be stuck at the positive or negative bounds of the representation range, for positive and negative weights, respectively. An example of injecting weight faults is illustrated in Fig. 3.
Note that the weight fault model, referred to as arbitrarydistributed StuckatFault model (adSAF), is much harder to defend against than SAF faults with a specific known defect map. A neural network model that behaves well under the adSAF model is expected to achieve high reliability across different specific SAF defect maps.
The above adSAF fault model assumes the underlying hardware is multibit RRAM devices, adSAFs in singlebit RRAM devices are also of interest. In singlebit RRAM devices, multiple bits of one weight value are mapped onto different crossbars, of which the results would be shifted and added together [zhu2019aconfigurable]. In this case, an SAF fault that occurs in a cell would cause the corresponding bit of the corresponding weight to be stuck at or . The effects of adSAF faults on a weight value in singlebit RRAM devices can be formulated as
(10)  
where the binary representation of indicates whether fault occurs at each bit position, the binary representation of represents the target faulty values ( or ) at each bit position if fault occurs. We will demonstrate that the architecture discovered under the multibits adSAF fault model can also defend against singlebit adSAF faults and iBF weight faults caused by errors in the weight buffers of CMOSbased accelerators.
Iv FaultTolerant NAS
In this section, we present the FTTNAS framework. We first give out the problem formalization and framework overview in Sec. IVA. Then, the search space, sampling and assembling process are described in Sec. IVB and Sec. IVC, respectively. Finally, the search process is elaborated in Sec. IVD.
Iva Framework Overview
Denoting the fault distribution characterized by the fault models as , the neural network search for fault tolerance can be formalized as
(11)  
As the cost of finding the best weights for each architecture is almost unbearable, we use the sharedweights based evaluator, in which shared weights are directly used to evaluate sampled architectures. The resulting method, FTTNAS, is the method to solve this NAS problem approximately. And FTNAS can be viewed as a degraded special case for FTTNAS, in which no fault is injected in the inner optimization of finding .
The overall neural architecture search (NAS) framework is illustrated in Fig. 4 (b). There are multiple components in the framework: A controller that samples different architecture rollouts from the search space; A candidate network is assembled by taking out the corresponding subset of weights from the supernet. A shared weights based evaluator evaluates the performance of different rollouts on the CIFAR10 dataset, using faulttolerant objectives.
IvB Search Space
The design of the search space is as follows: We use a cellbased macro architecture, similar to the one used in [enas, DARTS]
. There are two types of cells: normal cell, and reduction cell with stride 2. All normal cells share the same connection topology, while all reduction cells share another connection topology. The layout and connections between cells are illustrated in Fig.
5.In every cell, there are nodes, node 1 and node 2 are treated as the cell’s inputs, which are the outputs of the two previous cells. For each of the other
nodes, two incoming connections will be selected and elementwise added. For each connection, the 11 possible operations are: none; skip connect; 3x3 average (avg.) pool; 3x3 max pool; 1x1 Conv; 3x3 ReLUConvBN block; 5x5 ReLUConvBN block; 3x3 SepConv block; 5x5 SepConv block; 3x3 DilConv block; 5x5 DilConv block.
The complexity of the search space can be estimated. For each cell type, there are
possible choices. As there are two independent cell types, there are possible architecture in the search space, which is roughly with in our experiments.IvC Sampling and Assembling Architectures
In our experiments, the controller is a recurrent neural network (RNN), and the performance evaluation is based on a super network with shared weights, as used by [enas].
An example of the sampled cell architecture is illustrated in Fig. 6. Specifically, to sample a cell architecture, the controller RNN samples blocks of decisions, one for each node . In the decision block for node , input nodes are sampled from , to be connected with node . Then operations are sampled from the basic operation primitives, one for each of the connections. Note that the two sampled input nodes can be the same node , which will result in two independent connections from node to node .
During the search process, the architecture assembling process using the sharedweights super network is straightforward [enas]: Just take out the weights from the super network corresponding to the connections and operation types of the sampled architecture.
IvD Searching for FaultTolerant Architecture
The FTTNAS algorithm is illustrated in Alg. 1. To search for a faulttolerant architecture, we use a weighted sum of the clean accuracy and the accuracy with fault injection as the reward to instruct the training of the controller:
(12) 
where is calculated by injecting faults following the fault distribution described in Sec. III. For the optimization of the controller, we employ the Adam optimizer [kingma2015adam] to optimize the REINFORCE [williams1992simple] objective, together with an entropy encouraging regularization.
In every epoch of the search process, we alternatively train the shared weights and the controller on separate data splits and , respectively. For the training of the shared weights, we carried out experiments under two different settings: without/with FTT. When training with FTT, a weighted sum of the clean cross entropy loss and the cross entropy loss with fault injection is used to train the shared weights. The FTT loss can be written as
(13) 
As shown in line 712 in Alg. 1, in each step of training the shared weights, we sample architecture
using the current controller, then backpropagate using the FTT loss to update the parameters of the candidate network. Training without FTT (in FTNAS) is a special case with
.As shown in line 1520 in Alg. 1, in each step of training the controller, we sample architecture from the controller, assemble this architecture using the shared weights, and then get the reward on one data batch in . Finally, the reward is used to update the controller by applying the REINFORCE technique [williams1992simple], with the reward baseline denoted as .
V Experiments
In this section, we demonstrate the effectiveness of the FTTNAS framework and analyze the discovered architectures under different fault models. First, we introduce the experiment setup in Sec. VA. Then, the effectiveness under the feature and weight fault models are shown in Sec. VB and Sec. VC, respectively. The effectiveness of the learned controller is illustrated in Sec. VD. Finally, the analyses and illustrative experiments are presented in Sec. VE.
Va Setup
Our experiments are carried out on the CIFAR10 [cifar10]
dataset. CIFAR10 is one of the most commonly used computer vision datasets and contains 60000
RGB images. Three manually designed architectures VGG16, ResNet20, and MobileNetV2 are chosen as the baselines. 8bit dynamic fixedpoint quantization is used throughout the search and training process, and the fraction length is found following the minimaloverflow principle.In the neural architecture search process, we split the training dataset into two subsets. 80% of the training data is used to train the shared weights, and the remaining 20% is used to train the controller. The super network is an 8cell network, with all the possible connections and operations. The channel number of the first cell is set to 20 during the search process, and the channel number increases by 2 upon every reduction cell. The controller network is an RNN with one hidden layer of size 100. The learning rate for training the controller is 1e3. The reward baseline is updated using a moving average with momentum 0.99. To encourage exploration, we add an entropy encouraging regularization to the controller’s REINFORCE objective, with a coefficient of . For training the shared weights, we use an SGD optimizer with momentum 0.9 and weight decay 1e4, the learning rate is scheduled by a cosine annealing scheduler [loshchilov2016sgdr], started from . Each architecture search process is run for 100 epochs. Note that all these are typical settings that are similar to [enas]
. We build the neural architecture search framework and fault injection framework upon the PyTorch framework.
VB Defend Against MiBB Feature Faults
Arch  Training 

Accuracy with feature faults (%)  #FLOPs  #Params  

3e6  1e5  3e5  1e4  3e4  
ResNet20  clean  94.7  89.1  63.4  11.5  10.0  10.0  1110M  11.16M  
VGG16  clean  93.1  78.2  21.4  10.0  10.0  10.0  626M  14.65M  
MobileNetV2  clean  92.3  10.0  10.0  10.0  10.0  10.0  182M  2.30M  
FFTNet  clean  91.0  71.3  22.8  10.0  10.0  10.0  234M  0.61M  
ResNet20  =1e4  79.2  79.1  79.6  78.9  60.6  11.3  1110M  11.16M  
VGG16  =3e5  83.5  82.4  77.9  50.7  11.1  10.0  626M  14.65M  
MobileNetV2  =3e4  71.2  70.3  69.0  68.7  68.1  47.8  182M  2.30M  
FFTTNet  =3e4  88.6  88.7  88.5  88.0  86.2  51.0  245M  0.65M 
As described in Sec. IV, we conduct neural architecture searching without/with faulttolerant training (i.e., FTNAS and FTTNAS, correspondingly). The perMAC injection probability used in the search process is 1e4. The reward coefficients in Eq. 12 is set to . In FTTNAS, the loss coefficient in Eq. 13 is also set to . As the baselines for FTNAS and FTTNAS, we train ResNet20, VGG16, MobileNetV2 with both normal training and FTT. For each model trained with FTT, we successively try perMAC fault injection probability in {3e4, 1e4, 3e5}, and use the largest injection probability with which the model could achieve a clean accuracy above 50%. Consequently, the ResNet20 and VGG16 are trained with a perMAC fault injection probability of 1e4 and 3e5, respectively.
The discovered cell architectures are shown in Fig. 7, and the evaluation results are shown in Table IV. The discovered architecture FFTTNet outperforms the baselines significantly at various fault ratios. In the meantime, compared with the most efficient baseline MobileNetV2, the FLOPs number of FFTTNet is comparable, and the parameter number is only 28.3% (0.65M versus 2.30M). If we require that the accuracy should be kept above , MobileNetV2 could function with a perMAC error rate of 3e6, and FFTTNet could function with a perMAC error rate larger than 1e4. That is to say, while meeting the same accuracy requirements, FFTTNet could function in an environment with a much higher SER.
We can see that FTTNAS is much more effective than its degraded variant, FTNAS. We conclude that, generally, NAS should be used in conjunction with FTT, as suggested by Eq. 11. Another interesting fact is that, under the MiBB fault model, the relative rankings of the resilience capabilities of different architectures change after FTT: After FTT, MobileNetV2 suffers from the smallest accuracy degradation among 3 baselines, whereas it is the most vulnerable one without FTT.
VC Defend Against adSAF Weight Faults
We conduct FTNAS and FTTNAS under the adSAF model. The overall SAF ratio is set to 8%, in which the proportion of SAF0 and SAF1 is 83.7% and 16.3%, respectively (=6.7%, =1.3%). The reward coefficient is set to . The loss coefficient in FTTNAS is set to .
The discovered cell architectures are shown in Fig. 8. As shown in Table V, the discovered WFTTNet outperforms the baselines significantly at various test SAF ratios, with comparable FLOPs and less parameter number. We then apply channel augmentation to the discovered architecture to explore the performance of the model at different scales. We can see that models with larger capacity have better reliability under the adSAF weight fault model, e.g., (WFTTNet40) VS. (WFTTNet20) with 10% adSAF faults.
Arch  Training 

Accuracy with weight faults (%)  #FLOPs  #Params  

0.04  0.06  0.08  0.10  0.12  
ResNet20  clean  94.7  64.8  34.9  17.8  12.4  11.0  1110M  11.16M  
VGG16  clean  93.1  45.7  21.7  14.3  12.6  10.6  626M  14.65M  
MobileNetV2  clean  92.3  26.2  14.3  11.7  10.3  10.5  182M  2.30M  
WFTNet20  clean  91.7  54.2  30.7  19.6  15.5  11.9  1020M  3.05M  
ResNet20  =0.08  92.0  86.4  77.9  60.8  41.6  25.6  1110M  11.16M  
VGG16  =0.08  91.1  82.6  73.3  58.5  41.7  28.1  626M  14.65M  
MobileNetV2  =0.08  86.3  76.6  55.9  35.7  18.7  15.1  182M  2.30M  
WFTTNet20  =0.08  90.8  86.2  79.5  69.6  53.5  38.4  919M  2.71M  
WFTTNet40  =0.08  92.1  88.8  85.5  79.3  69.2  54.2  3655M  10.78M 
To investigate whether the model FTTtrained under the adSAF fault model can tolerate other types of weight faults, we evaluate the reliability of WFTTNet under 1bitadSAF model and the iBF model. As shown in Fig. 9 (b)(c), under the 1bitadSAF and iBF weight fault model, WFTTNet outperforms all the baselines consistently at different noise levels.
VD The Effectiveness of The Learned Controller
To demonstrate the effectiveness of the learned controller, we compare the performance of the architectures sampled by the controller, with the performance of the architectures random sampled from the search space. For both the MiBB feature fault model and the adSAF weight fault model, we random sample architectures from the search space, and train them with FTT for 100 epochs. A perMAC fault injection probability of 3e4 is used for feature faults, and an SAF ratio of (=6.7%, =1.3%) is used for weight faults.
As shown in Table VI and Table VII, the performance of different architectures in the search space varies a lot, and the architectures sampled by the learned controllers, FFTTNet and WFTTNet, outperform all the random sampled architectures. Note that, as we use different preprocess operations for feature faults and weight faults (ReLUConvBN 3x3 and SepConv 3x3, respectively), there exist differences in FLOPs and parameter number even with the same cell architectures.
Model  clean acc  =3e4  #FLOPs  #Params 

sample1  60.2  19.5  281M  0.81M 
sample2  79.7  29.7  206M  0.58M 
sample3  25.0  32.2  340M  1.09M 
sample4  32.9  25.8  387M  1.23M 
sample5  17.4  10.8  253M  0.77M 
FFTTNet  88.6  51.0  245M  0.65M 
Model  clean acc  =8%  #FLOPs  #Params 

sample1  90.7  63.6  705M  1.89M 
sample2  84.7  36.7  591M  1.54M 
sample3  90.3  60.3  799M  2.33M 
sample4  90.5  64.0  874M  2.55M 
sample5  85.2  45.6  665M  1.83M 
WFTTNet  90.7  68.5  919M  2.71M 
VE Inspection of the Discovered Architectures
Feature faults: From the discovered cell architectures shown in Fig. 7, we can observe that the controller obviously prefers SepConv and DilConv blocks over ReluConvBN blocks. This observation is consistent with our anticipation. As under the MiBB feature fault model, operations with smaller FLOPs will result in a lower equivalent fault rate in the OFM.
Under the MiBB feature fault model, there is a tradeoff between the capacity of the model and the feature error rate. As the number of channels increases, the operations become more expressive, but the equivalent error rates in the OFMs also get higher. Thus there exists a tradeoff point of for the number of channels. Intuitively, depends on the perMAC error rate , the larger the is, the smaller the is.
Besides the choices of primitives, the connection pattern and combination of different primitives also play a role in making the architecture faulttolerant. To verify this, first, we conduct a simple experiment to confirm the preference of primitives: For each of the 4 different primitives (SepConv 3x3, SepConv 5x5, DilConv 3x3, DilConv 5x5), we stack 5 layers of the primitives, get the performance of the stacked NN after FTT training it with =3e4. The stacked NNs achieve the accuracy of 60.0%, 65.1%, 50.0% and 56.3% with 1e4, respectively. The stacked NN of SepConv 5x5 blocks achieves the best performance, which is of no surprise since the most frequent block in FFTTNet is SepConv5x5. Then, we construct six architectures by random sampling five architectures with only SepConv5x5 connections and replacing all the primitives in FFTTNet with SepConv 5x5 blocks. The best result achieved by these six architecture is 77.5% with 1e4 (versus 86.2% achieved by FFTTNet). These illustrative experiments indicate that the connection pattern and combination of different primitives all play a role in the fault resilience capability of a neural network architecture.
Weight faults: Under the adSAF fault model, the controller prefers ReLUConvBN blocks over SepConv and DilConv blocks. This preference is not so easy to anticipate. We hypothesise that the weight distribution of different primitives might lead to different behaviors when encountering SAF faults. For example, if the quantization range of a weight value is larger, the value deviation caused by an SAF1 fault would be larger, and we know that a large increase in the magnitude of weights would damage the performance severely [hacene2019training]. We conduct a simple experiment to verify this hypothesis: We stack several blocks to construct a network, and in each block, one of the three operations (a SepConv3x3 block, a ReLUConvBN 3x3 block, and a ReLUConvBN 1x1 block) is randomly picked in every training step. The SepConv 3x3 block is constructed with a DepthwiseConv 3x3 and two Conv 1x1, and the ReLUConvBN 3x3 and ReLUConvBN 1x1 contain a Conv 3x3 and a Conv 1x1, respectively. After training, the weight magnitude ranges of Conv 3x3, Conv 1x1, and DepthwiseConv 3x3 are 0.0360.043, 0.1120.121, 0.1400.094, respectively. Since the magnitude of the weights in 3x3 convolutions is smaller than that of the 1x1 convolutions and the depthwise convolutions, SAF weight faults would cause larger weight deviations in a SepConv or DilConv block than in a ReLUConvBN 3x3 block.
Vi Discussion
Orthogonality: Most of the previous methods are exploiting the inherent fault resilience capability of existing NN architectures to tolerate different types of hardware faults. In contrast, our methods improve the inherent fault resilience capability of NN models, thus effectively increase the algorithmic fault resilience “budget” to be utilized by hardwarespecific methods. Our methods are orthogonal to existing faulttolerance methods, and can be easily integrated with them, e.g., helping hardwarebased methods to reduce the overhead largely.
Limitation of applicationlevel fault model: There are faults that are hard or unlikely to model and mitigate by our methods, e.g., timing errors, routing/DSP errors in FPGA, etc. A hardwareintheloop framework could be established for a thorough evaluation of the systemlevel fault hazards. Anyway, since the correspondence between these faults and the applicationlevel elements are subtle, it’s more suitable to mitigate these faults in the lower abstraction layer.
Hardware: In the MiBB feature fault model, we assume that the add operations are spatially expanded onto independent hardware adders, which applies to the templatebased designs [venieris2017convnet]. For ISA (Instruction Set Architecture) based accelerators [qiu2016going], the NN computations are orchestrated using instructions, timemultiplexed onto hardware units. In this case, the accumulation of the faults follows a different model and might show different preferences among architectures. Anyway, the FTTNAS framework could be used with different fault models. We leave the exploration and experiments of this model for future work.
Data representation: In our work, an 8bit dynamic fixedpoint representation is used for the weights and features. As pointed out in Sec. VE, the dynamic range has impacts on the resilience characteristics against weight faults. The data format itself obviously decides or affects the data range. [yan2019whense] found out that the errors in exponent bits of the 32bit floatingpoint weights have large impacts on the performance. [li2017understanding] investigated the resilience characteristics of several floatingpoint and nondynamic fixedpoint representations.
Vii Conclusion
In this paper, we analyze the possible faults in various types of NN accelerators and formalize the statistical fault models from the algorithmic perspective. After the analysis, the MACi.i.d BitBias (MiBB) model and the arbitrarydistributed StuckatFault (adSAF) model are adopted in the neural architecture search for tolerating feature faults and weight faults, respectively. To search for the faulttolerant neural network architectures, we propose the multiobjective FaultTolerant NAS (FTNAS) and FaultTolerant Training NAS (FTTNAS) method. In FTTNAS, the NAS technique is employed in conjunction with the FaultTolerant Training (FTT). The fault resilience capabilities of the discovered architectures, FFTTNet and WFTTNet, outperform multiple manually designed architecture baselines, with comparable or fewer FLOPs and parameters. And WFTTNet trained under the 8bitadSAF model can defend against several other types of weight faults. Generally, FTTNAS is more effective and should be used. Since operation primitives differ in their MACs, expressiveness, weight distributions, they exhibit different resilience capabilities under different fault models. The connection pattern is also shown to have influences on the fault resilience capability of NN models.
Comments
There are no comments yet.