Single event upsets (SEU) is the change in state of a storage element inside a device or system, typically Static Random Access Memory (SRAM), caused by cosmic radiation effects. This state is a soft error and can often be fixed by changing the state of the storage element back to its original value with no permanent damage to the device itself. Because of the unintended memory state, the device may operate erroneously until this upset is fixed.
The Soft Error Rate (SER) is expressed as Failure-in-Time (FIT) units, defined as one soft error occurrence every billion hours of operation. Often SEU mitigation is not required because of the low chance of occurrence. However, for highly complex systems, such as with multiple high-density components, error rate may be a significant system design factor. If your system includes multiple FPGAs and requires very high reliability and availability, you should consider the implications of the soft errors, and use the available techniques for detecting and recovering from these types of errors.
Stratix® 10 SEU mitigation feature can help to ensure the system is functioning properly all the time, to avoid the system being malfunction from an SEU event, or to handle the SEU event if it is critical to the system. Common systems that requires SEU mitigation feature are as follows:
- Military or aerospace—flight systems
- Automotive or industrial—safety applications
- Telecom, data center or cloud computing—high system up-time
FPGAs use memory both in user logic (bulk memory and registers) and in Configuration RAM (CRAM). CRAM is the memory loaded with the user's design. The CRAM configures all logic and routing in the device. If an SEU strikes a CRAM bit, the effect can be harmless if the CRAM bit is not in use. However, functional error is possible if it affects critical logic internal signal routing such as a lookup table bit.
The Stratix 10 devices contain two types of memory blocks: M20K blocks and memory logic array blocks (MLABs).
- 20-kilobit (Kb) M20K blocks
- Blocks of dedicated memory resources.
- Ideal for larger memory arrays, while providing a large number of independent ports.
- 640-bit MLABs
- Enhanced memory blocks configured from dual-purpose logic array blocks (LABs).
- Ideal for wide and shallow memory arrays.
- Optimized for implementation of shift registers for digital signal processing (DSP) applications, wide and shallow FIFO buffers, and filter delay lines.
- Each MLAB is made up of ten adaptive logic modules (ALMs).
For more information about Stratix 10 embedded memory, refer to Stratix 10 Embedded Memory User Guide.
Stratix® 10 devices feature various single-event upset (SEU) mitigation approaches for different application areas.
|Area||SEU Mitigation Approach|
|Silicon design: CRAM/SRAMs/flip flops||Various design techniques to reduce upsets and/or limit to correctable double-bit errors.|
|Error Detection and Correction (EDC) / Internal Scrubbing||You can enable the error detection and correction (EDC) feature for detecting CRAM SEU events and automatic correction of CRAM contents.|
|M20K SRAM block||Stratix® 10 features implements layout techniques and Error Correction Code (ECC) that reduces SEU failures in time (FIT) rate to almost zero.|
|Sensitivity processing||You can use sensitivity processing to identify if the SEU in CRAM bit is a used or unused bit.|
|Fault injection||You can use fault injection feature to validate the system response to the SEU event by changing the CRAM state to trigger an error.|
|Hierarchical tagging||A complementary capability to sensitivity processing and fault injection for reporting SEU and constraining injection to specific portions of design logic.|
|Triple Modular Redundancy (TMR)||You can implement TMR technique on critical logic such as state machines.|
The Stratix® 10 device overlays the rows and columns with sectors to address the core logic for configuration and security. Each sector has a Local Sector Manager (LSM). The LSMs provide low-level control and management of the error detection of a sector and communicate with the Secure Device Manager (SDM) through the configuration network.
Stratix® 10 devices feature on-chip EDC circuitry to detect soft errors. If an error caused by SEU event is correctable, it will be corrected when you enable the internal scrubbing feature.
|Single bit error||Yes||Yes|
|Double adjacent errors||Yes||Yes|
|Multiple bit errors||Detect up to 8 CRAM bits that fit in a rectangular box of 8 CRAM bits (8x1, 4x2, 1x8 or 2x4 errors)||Correct up to 4 CRAM bits that fit in a square of 4 bits (2x1, 1x2 or 2x2)|
Reconfiguring a running FPGA has a significant impact on the system using the FPGA. When planning for SEU recovery, account for the time required to bring the FPGA to a state consistent with the current state of the system. For example, if an internal state machine is in an illegal state, it may require reset. In addition, the surrounding logic may need to account for this unexpected operation.
Often an SEU impacts CRAM bits not used by the implemented design. Many configuration bits are not used because they control logic and routing wires that are not used in a design. Depending on the implementation, 40% of all CRAM bits can be used even in the most heavily utilized devices. This means that only 40% of SEU events require intervention, and you can ignore 60% of SEU events. The utilized bits are considered as critical bits while the non-utilized bits are considered as non-critical bits.
You can determine that portions of the implemented design are not utilized in the FPGA’s function. Examples may include test circuitry implemented but not important to the operation of the device, or other non-critical functions that may be logged but do not need to be reprogrammed or reset.
Hierarchy tagging is the process of classifying the sensitivity of the portions of your design.
You can perform hierarchy tagging using the Quartus® Prime software by creating a design partition, and then assigning the parameter Advanced SEU Detection (ASD) Region to that partition. The parameter can assume a value from 0 to 15, so there are 16 different classifications of system responses to the portions of your design.
The design hierarchy sensitivity processing depends on the contents of the Sensitivity Map Header file (.smh). This file determines which portion of the FPGA's logic design is sensitive to a CRAM bit flip. You can use sensitivity information from the .smh file to determine the correct (least disruptive) recovery sequence.
To generate the functionally valid .smh, you must designate the sensitivity of the design from a functional logic view, using the hierarchy tagging procedure.
You can use fault injection to aid in SEU recovery response. The fault injection feature allows you to operate the FPGA in your system and inject random CRAM bit flips to test the ability of the FPGA and the system to detect and recover fully from an SEU. You should be able to observe your FPGA and your system recover from these simulated SEU strikes. You can then refine your FPGA and system recovery sequence by observing these strikes. You can determine the SEFI rate of your design by using the fault injection feature.
After correcting a bit flip in CRAM, the device is in its original configuration with respect to logic and routing. However, the internal state of the FPGA may be illegal.
The state of the device may be invalid because it may have been operating while SEUs corrupted its configuration. The errors form faulty operation may have propagated elsewhere within the FPGA or to the system outside of the FPGA.
Forcing the FPGA into a known state is system dependent. Determining the possible outcomes from SEU, and designing a recovery response to SEU should be part of the FPGA and system design process.
Only M20K blocks and eSRAM blocks support the ECC feature.
If you engage the ECC feature, you cannot use the following features:
- Byte enable
- Coherent read
For M20K blocks, ECC performs single-error correction, double-adjacent-error correction, and triple-adjacent-error correction in a 32-bit word. However, ECC cannot guarantee detection or correction of non-adjacent two-bit or more errors.
The M20K blocks have built-in support for ECC when in ×32-wide simple dual-port mode.
- When you engage the ECC feature, the M20K runs slower than the non-ECC simple-dual port mode. However, you can enable optional ECC pipeline registers before the output decoder to achieve higher performance compared to non-pipeline ECC mode at the expense of one-cycle latency.
- Two ECC status flag signals—e (error) and ue (uncorrectable error) indicate the M20K ECC status. The status flags are part of the regular outputs from the memory block. When you engage ECC, you cannot access two of the parity bits because the ECC status flag replaces them.
For eSRAM blocks, ECC performs single-error correction and double-error correction in a 64-bit word.
- Two ECC status flag signals—e (error) and ue (uncorrectable error) indicate the eSRAM ECC status. .
TMR is an established technique for improving hardware fault tolerance. In TMR, three identical instances of hardware are supplied, along with voting hardware at the output of the hardware. If an SEU affects one of the instances, the voting logic notes the majority in a vote of the separate instances of the module to mask out any malfunctioning module.
The advantage of TMR is that there is no downtime in the case of a single SEU; if a module is found to be in faulty operation, that module can be scrubbed of its error by reprogramming it. The error detection and correction time is many orders of magnitude less than the mean time between failures (MTBF) due to SEU events. Therefore, you can repair a soft interrupt before another SEU affects another instance in the TMR triple.
The disadvantage of TMR is its extreme cost in hardware resources: it requires three times as much hardware, in addition to voting logic. This hardware cost can be minimized by judiciously implementing TMR only for the most critical part of the design.
There are several automated ways to generate TMR designs by automatically replicating designated functions and synthesizing the required voting logic. Synthesis vendors offering automated TMR synthesis include Synopsys and Mentor Graphic.
To enable the internal scrubbing feature, perform the following steps:
- On the Assignments menu, click Device.
- In the Device and Pin Options select the Error Detection CRC category.
- Turn on Enable internal scrubbing.
- Click OK.
To set the SEU_ERROR pin function, perform the following steps:
- On the Assignments menu, click Device.
- In the Device and Pin Options select the Configuration category and click Configuration Pins Options.
- In the Configuration Pin window, turn-on the USE SEU_ERROR output.
- Select any unused SDM pin from the drop-down selection to implement the SEU_ERROR pin function.
- Click OK to confirm and close the Configuration Pin window.
|October 2016||2016.10.31||Initial release.|