Abstract
This paper presents an innovative approach for predicting timing errors tailored to near-/sub-threshold operations, addressing the energy-efficient requirements of digital circuits in applications, such as IoT devices and wearables. The method involves assessing deep path activity within an adjustable window prior to the root clock’s rising edge. By dynamically adapting the prediction window and supply voltage based on error detection outcomes, the approach effectively mitigates false predictions—an essential concern in low-voltage prediction techniques. The efficacy of this strategy is demonstrated through its implementation in a near-/sub-threshold 32-bit microprocessor system. The approach incurs only a modest 6.84% area overhead attributed to well-engineered lightweight design methodologies. Furthermore, with the integration of clock gating, the system functions seamlessly across a voltage range of 0.4 V–1.2 V (5–100 MHz), effectively catering to adaptive energy efficiency. Empirical results highlight the potential of the proposed strategy, achieving a significant 46.95% energy reduction at the Minimum Energy Point (MEP, 15 MHz) compared to signoff margins. Additionally, a 19.75% energy decrease is observed compared to the zero-margin operation, demonstrating successful realization of negative margins.
Keywords: dynamic voltage scaling (DVS), negative design margin, error detection and correction (EDaC), ultra-low voltage, near-threshold operation, high stability, error-aware capability
1. Introduction
The demand for energy-efficient digital circuits has surged, driven by the burgeoning applications such as Internet of Things (IoT) devices and wearables, prompting the emergence of near-threshold computing (NTC) [1,2,3]. Despite its merits, NTC brings about considerable degradation in path delay due to process, voltage, and temperature variations, as well as aging effects. The operational point of a chip with minimal stability margin is referred to as the zero-margin point. Traditional chip signoff methods, when applied under worst-case scenarios, result in a significant margin increment for NTC chips, hindering the realization of projected power reduction advantages [1,2,3]. As such, achieving reduced system energy consumption by compressing margins through voltage reduction is contingent upon maintaining chip operational speed.
The simplest recourse for margin alleviation entails the use of a replica delay line for on-chip performance monitoring [4]. Nevertheless, replica systems are inadequate in negating margins for local variations, such as intra-die discrepancies, local resistive (IR) drops, and localized temperature hotspots that elude capture.
To surmount margin concerns across various forms of variations, a multitude of error detection and correction (EDaC) strategies have been postulated [5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20]. These approaches pivot around detecting data transitions within a predefined window following the rising edge of the clock signal. A match within this window denotes a timing error occurrence, necessitating the application of architectural-level measures to correct the identified error. Differences in the specifics of implementation may arise due to factors such as the nature of timing elements (flip-flops or latches), the method of window generation, the technique for data transition detection (current-based or voltage-based), and the strategy for timing error correction (pipeline flushing, clock borrowing, instruction replay, etc.). Two predominant issues intertwine with EDaC methodologies: firstly, the tension between detection window (DW) width—indicative of error-aware capacity—and the subsequent addition of hold buffers, leading to area overhead; secondly, the substantial performance drop stemming from frequent timing error corrections, obstructing the attainment of a zero-margin state.
In response to these prevailing challenges, this study offers a novel rendition called Deep Path Activity Detection–EDaC (DPAD–EDaC). This novel approach integrates a timing error prediction circuit into the conventional EDaC framework, adeptly addressing pivotal concerns and achieving negative margin operation at a controlled error rate. A programmable width-prediction window is established just prior to the root clock’s rising edge. Detection cells, interleaved with combination logic units in the timing path, emit pulse signals upon discerning data alterations. The presence of a pulse signal within the prediction window signifies a potential timing error occurrence. Since the timing error remains latent when predicted, rectification becomes feasible through a simplistic one-clock cycle enablement mechanism. Thus, the system can function at a pre-determined predicted error rate, characterized by marginal performance loss. This operational schema, operational at diminished power supply voltages, ultimately realizes a design replete with negative margins.
In practical chip operations, the precision of timing error prediction is significantly impacted by variations in clock network delay and data path delay, exacerbated in near-threshold conditions. To counteract these effects, a dynamic voltage regulation circuitry is introduced, calibrating the system’s timing error prediction capability and power supply voltage according to EDaC circuitry’s error detection outcomes. This adaptive calibration ensures stable system operation at the specified error rate, accommodating clock network and data path delay influences, particularly in near-threshold contexts.
This dual-pronged strategy, amalgamating error prediction enhancement and dynamic voltage regulation, epitomizes our proposed approach, culminating in the achievement of stable operations with negative margins. Empirical evaluation and experimental validation are undertaken via the deployment of a near-threshold 32-bit ARM Cortex M0 microprocessor system. The ensuing outcomes underscore the effectiveness of our methodology in realizing negative design margins while concurrently optimizing system performance and energy efficiency. The proposed approach is realized through a 55 nm CMOS implementation within a near-threshold 32-bit ARM Cortex M0 microprocessor system, demonstrating a modest 6.84% area overhead compared to the baseline non-EDaC system. Experimental findings corroborate a 46.95% energy reduction at the Minimum Energy Point (MEP, 15 MHz) when contrasted against signoff margins, and a 19.75% energy decrease relative to the zero-margin operation, underscoring the achievement of negative margins.
2. Overview of EDaC Techniques
2.1. Error Detection
In most EDaC systems, timing error detection is implemented by double sampling (DS). It relies on the fact that the sample at T2 is more likely to be correct than the sample at T1 (T2 > T1). Several works [7,8,9] implement DS by adding a shadow latch to a sequential element. The rising and falling edge of the clock are used as T1 for the sequential element and T2 for the shadow latch. If both samples do not match, this is flagged as a timing error. In order to alleviate area overhead and constraints on the clock, DS is gradually replaced by transition detection (TD). A late activity in DW detected by the TD cell is flagged as a timing error. Several works [11,12,13,14,15,16,17,21,22,23,24] insert a TD cell at the data port of the timing element or integrate it into the element. It should be noted that the error-aware capability of the DS-/TD-based system is limited to timing errors on FF- or latch-based endpoints. This leaves critical paths toward input–output (IO) cells and macros (e.g., SRAM) unprotected. Generally, this limitation is overcome by splitting unprotected critical paths into several non-critical paths, which adds to the design complexity.
2.2. Detection Window
In DS or TD EDaC systems, DW provides a timing range for error detection. However, this brings additional hold constraints for all monitored endpoints to avoid false errors due to short-paths’ conduction. Some works [9,12,13,14,21,24] reuse the system clock as DW, and DW width adjustment is achieved by clock duty tuning. This implementation imposes a large number of hold buffer insertions and continues to grow as supply decreases. Several works [10,11,17] add a dedicated DW circuit to balance the conflict between area overhead and error-aware capability, which is difficult to quantify.
2.3. Error Correction
All error correction strategies can be divided into the following three categories: (1) correction, (2) prediction, (3) masking. Strategy (1) is usually used in DS- or TD-based EDaC systems to ensure operation near the point of first failure (PoFF) with little margin. In [11,24], the corrections were achieved through instruction replay or bubble insertion at the cost of significant cycle overhead, wmaking them difficult to apply to different processors. Both (2) and (3) are preferred in systems operating beyond the PoFF to achieve more energy savings with little overall loss owing to clock gating/stretching. In (2), timing errors are predicted and eliminated before the clock rising edge. In [25,26], the completion of logic operations in the datapath was performed at the end of the clock period to determine, instantaneously, the presence of late-arriving signals. This strategy omits the correction circuit but brings out the number of false errors due to mismatch between the PW and the clock of endpoints caused by clock latency. Finally, in (3), late arrivals are passed to the next cycle, which allows systems to operate without disturbance due to timing errors. Usually, this approach is combined with a latch-based pipeline, which is achieved using the clock-borrowing feature of the latch. In [13,19,20], this approach was combined with a latch-based pipeline to enable time borrowing. However, the maximum borrowing cascade limit needed to be imposed to avoid the risk of system failures. Meanwhile, it is difficult to eliminate the adverse impact of glitches on latch-based systems [27,28].
3. Presented Concept and Analysis
To address the shortcomings in the existing EDaC strategies, this work proposes the deep path activity detection–EDaC (DPAD–EDaC) strategy, consisting of lightweight error prediction, detection, and correction circuits. The benefits of error prediction before the clock edge (e.g., simple correction and strong error-aware capability) are retained. And the critical problems (e.g., false prediction errors in EPC and critical tradeoff in EDC) are addressed. Finally, the DPAD–EDaC system operates at a certain error rate with the support of DVS to achieve negative design margin.
3.1. Error Prediction Circuit (EPC) Concept
As shown in Figure 1, timing error prediction relies on the following two operations. First, a TD cell (yellow shadow) consisting of a delay chain, and an XOR-gate flags toggle activity in the critical path by outputting a high signal with pulse width . Next, the DYN-OR TREE (blue shadow) evaluates the outputs of all TDs at the end of the clock cycle (PW) and reduces them to a single prediction error signal (). The generation process of the signal is shown in the timing diagram, shaded in purple. It is obvious that PW is limited to a time range exactly before the rising edge of the root clock () due to the clock gating/stretching method for error prevention.
Figure 1.
Detailed description of the specific implementation of the DPAD–EDaC strategy, where TDd0 and TDc0 identify late-arriving data changes that generate high prediction and detection error signals, respectively.
In an ideal clock network, coincides with the clock of a sequential endpoint (). A high signal indicates that ongoing critical activity in PW is close to the edge of , which is equivalent to an actual timing error. However, there is a latency () between PW and the edge of due to the clock network latency in an actual chip, which increases rapidly with voltage decrease. The resulting inequivalence between high signal and timing error brings a plethora of false errors for the EP systems in works [25,26]. To address this problem, we first push the monitored cells deeper into critical paths to precisely compensate for the clock latency and re-establish the equivalence. However, the variable datapath propagation delay and clock latency will continue to destroy the equivalence, increasing the difficulty of monitored cell selection. Next, the PW width is no longer fixed, but is instead precisely adjusted by the results of EDC. The process of adjustment is also considered as the equivalence re-establishment process. Meanwhile, ECC is necessary to ensure correct system operations during the adjustment process.
In this paper, PW is generated by the PW_GEN module (top left in Figure 1) and adjusted by ( in this paper) signal from the dynamic window scaling (DWS) module. As stated above, the DYN-OR TREE executes the dynamic behavior, which takes care of the instant sampling of TDs’ outputs during PW. To guarantee reliable sampling, two constraints for PW width are proposed:
- As the supply decreases after power-up, timing errors are gradually detected by EDC, and increases accordingly until one of the TDs’ outputs is covered. Therefore, the difference between adjacent PW width should be greater than a TD’s output pulse width , which needs to be satisfied at different PVT corners:
(1) - A minimum pulse width constraint applies on the . This constraint equals the sum of the worst case propagation delay for DYN-OR TREE and setup time for Integrated Clock Gating (ICG) cell:
(2)
It is clear that the interval between and is the safe propagation range (SRR) for each TD’s output. This means that TDs’ outputs covered by SRR can be accurately evaluated and propagated through DYN-OR TREE during PW adjustment, that is, these cells can be monitored.
As shown in Figure 2, the path is placed on a timeline with the rising edge of as the origin. For each path, (arrival time for path), (latest arrival time limit), and (the left boundary of PW) can be expressed by constants (i.e., , , , and ) at defined PVT conditions:
| (3) |
Figure 2.
TIMELINE with clk_root as the origin. represents the clock tree latency of the source endpoint. represents the clock period. represents the propagation time of the critical path. represents the setup time needed for proper operation. represents the corresponding PW width.
Firstly, the when timing error occurs coincidentally can be inferred by letting ; then, the in the timeline can be further obtained for each path:
| (4) |
Next, the SRR and the left boundary of PW at each PW_status can be determined by changing the value of i. Finally, monitored cells can be further selected. In this work, we select the cell just covered by (the middle of SRR) as the FMC (First Monitored Cell) for each critical data path. Then, additional monitored cells are selected with a maximum interval constraint to maintain continuous error-aware capability under severe variations. This constraint applies to the adjacent monitored cells, which equals the minimum interval between and :
| (5) |
This ensures that there are always monitored cells within SRR to maintain continuous error-aware capability. It follows that this capability is guaranteed by discontinuous monitored cells.
3.2. Error Detection Circuit (EDC) Concept
As suggested above, the PW width is precisely adjusted by the results of EDC, which relies on the following three operations, as shown in Figure 1. First, a TD cell flags toggle activity of the critical endpoint’s data input by outputting a pulse signal. Next, all TDs are grouped and the TD outputs in each group are evaluated and latched by the shared error latch (SEL in Figure 3) in Error Latch Line (gray shallow in Figure 1) within the DW. The DW generation (DW_GEN) cell (green shadow in Figure 1), consisting of a delay chain and a NAND2 gate, flags a timing range for error detection by outputting a high signal with pulse width at the start of the clock. Finally, the Error-OR Tree (orange shadow in Figure 1) ORs all outputs to a single detection error signal (Fd_error). The generation process of the Fd_error signal is shown in the timing diagram shaded in orange in Figure 1.
Figure 3.
Schematic and timing diagram of the SEL cell. The fan-in of the cell changes with the number of NMOS transistors in the parrallel network.
In more detail, the number of TD cells in each group and the fan-in of the corresponding SEL cell should be equal. We can also configure it flexibly according to design requirements. The design of multiple endpoints sharing one SEL cell can effectively reduce the area overhead of EDC.
As described in Section 2.2, the conflict between DW width and area overhead brings great difficulties to conventional EDaC systems. In this paper, this conflict is naturally attenuated. The results of EDC help with the adjustment of PW and, more specifically, the re-establishment of the equivalence between prediction result and actual timing error. Thus, the system’s accurate operation is guaranteed by prediction, which weakens the constraint on DW width imposed by error-aware capability. Despite that, the process of voltage decreasing from standard voltage (1.2 V) to the appropriate low voltage needs to be specially considered. During the process, the DW must be wide enough to detect all errors caused by any voltage decrease step. As shown in Equation (6), the minimum DW width (FF corner) needs to be greater than the maximum delay difference (SS corner) between the critical paths at any two adjacent voltages:
| (6) |
In addition, considering the meta-stability of the timing elements [29,30], a TD’s output pulse width should be greater than the sum of the setup time for the critical endpoint and hold time for the connected SEL:
| (7) |
In this way, the meta-stability event can be detected.
3.3. Error Correction Circuit (ECC) Concept
The error correction operation is necessary when a timing error occurs. However, the correction methods proposed in works [11,24] are strongly correlated with the processor architecture, which imposes limitations on the implementation in other systems [31]. To address this problem, we propose an instruction replay method based on ARs’ backup for error correction. The concrete implementation of ECC is shown in Figure 4, where the changes in ARs are backed up at each instruction boundary. In this work, each instruction modifies up to four ARs (PC, PSR, C bit, and one of R0–R15) at the same time, which is an inherent characteristic of the ARM Thumb-2 ISA. Thus, the backup block consists of four memory rows with a low area overhead. A high signal flushes and stalls the global pipeline until completing voltage or PW regulation with a high replay signal. Then, the modified ARs are overwritten with the corresponding value in the backup block, and the processor returns to normal operation.
Figure 4.
Illustration of the presented ECC. The valid bit in each row represents whether change in the corresponding AR has occured.
The validity of this correction technique can be generalized to more processors by distinguishing instruction boundaries, the maximum number, and categories of modified ARs for each instruction.
3.4. DVS and DWS Module
As shown in Figure 5, the integration of the DPAD–EDaC strategy can be abstracted as the implementation of a distributed error processor (DEP), consisting of distributed EPC, EDC, and the error rate comparison circuit. The signals from DEP are involved in DVS and DWS modules to ensure correct system operation at a certain prediction error rate for negative margins.
Figure 5.
Detailed description of the DVS and DWS modules and the distributed error processor. The DEP obtains data activities from critical paths, and then generates prediction/detection error signals and prediction error rate for DVS and DWS module operation.
The error rate comparison circuit consists of a timer, counter, and comparator, which are driven by the ungated clock () to calculate the prediction error rate and compare it with high and low thresholds. The registers in the timer (counter) start self-incrementing (counting signal) from the configured initial value until the overflow signal () or reset signal () are generated. In more detail, the high signal represents timing completion and the high signal represents the error rate exceeding the high threshold. Meanwhile, the high signal enables comparator to compare the counting result with the low threshold, and a high MER signal is generated for voltage decrease if it is lower.
The high signal from EDC controls the DWS module to increase and generate the high replay signal for instruction replay. However, if is about to overflow or the signal is high, the DVS module is controlled to increase supply voltage. Then, the high replay signal is generated when receiving the completion signal from LDO.
Considering the situation of voltage regulation loop with impact of temporary variations (i.e., temperature variations, IR drop), a voltage–PW_status pair dictionary is added to reduce the resulting cycle overhead. When the voltage is adjusted from level A to B, the will be stored in the dictionary with index A and read out with index B, and then PW width will be adjusted correctly within a cycle.
4. Implementation
As shown in Figure 6, the DPAD–EDaC strategy was implemented in a near-threshold 32-bit microprocessor system. It consisted of a CORTEX-M0 core, 2KB self-designed stable 7T SRAM, distributed error processor, DVS module, DWS module, and several peripherals.
Figure 6.
The 32-bit microprocessor system with DPAD–EDaC overview. The three colored dashed boxes represent the three voltage domains.
4.1. Ultra-Low Voltage Implementation
This implementation targets 5–100 MHz operation speed, with voltage swept from 0.4 to 1.2 V in steps of 25 mV. To facilitate the ultra-low voltage implementation, the operation of all standard cells is verified at the concerned voltage. Cells with functional errors or extremely large delays are excluded, and retained cells are recharacterized at these voltages to obtain new timing libraries. Next, a low-power synthesis and place and route (P&R) flow translates the microprocessor register transfer level (RTL) to silicon. During this process, the system is split into three power domains, as shown in Figure 6. It is noteworthy that SRAM is placed in the regulated low-voltage domain for lower energy consumption. Considering the read failure problem of conventional 6T-SRAM at low voltages [32,33,34], a customized stable 7T-SRAM with error detection capability suited to EDaC systems was used.
The integration of DPAD–EDaC can also be interpreted as the process of transforming design to integration for EPC and EDC. Further, the design flow was expanded with a series of automated steps, as shown in Figure 7. Firstly, static statistical timing analysis was performed after the initial P&R to determine the design details of EDC and EPC.
Figure 7.
Automated flow for DPAD–EDaC integration. Baseline P&R design is the same as the conventional design.
4.2. EDC Design Details
Monitored cell selection is one of the main focuses in EDC design. The maximum clock frequency is determined by critical paths. Thus, only the most critical endpoints need to be monitored, allowing a limited overhead. Figure 8 shows a histogram of all the endpoints, ordered with respect to the smallest timing slack path they serve. The number of monitored endpoints is determined by the chance of false positive monitoring of the EDC, e.g., the chance that a non-monitored path propagates more slowly than all monitored paths owing to various variations, as described in Equation (8):
| (8) |
where represents all non-monitored paths and represents all monitored paths. When false monitoring occurs, it is probable that non-monitored paths have failed but monitored paths have not, thus causing a system operation failure. To avoid this occurrence, enough timing slack (12.2% Tperiod in this paper) is covered, with 343 out of 4025 endpoints being monitored. The probability of such an event was determined by the delay distribution of a subset of paths from 1000 Monte Carlo (MC) simulations at 500 mV (see Figure 9), which was less than 1 × in this paper, decreasing with increased voltage.
Figure 8.
Histogram of the path with the smallest timing slack at each endpoint.
Figure 9.
Boxplot of the slack distribution of a subset of paths, obtained from 1000 MC simulations at 0.5 V, TT corner.
Then, the TD cells are inserted at the data points of all monitored endpoints. Equation (7) constrains the minimum pulse width of the TD cell. The pulse width is determined by its internal delay chain. An initial TD implementation with existing standard cell results in an area footprint of 10.08 µm2. To minimize area, stacked transistors with long gate-length due to RSCE (Reverse Short-Channel Effect) are used in the delay chain. This custom TD cell reduces the area footprint to 6.72 µm2.
Another focus of EDC design is the DW_GEN (DW generation) cell design. The EDC is equipped with a 7% Tperiod DW width satisfying the constraint in Equation (6). Then, the DW_GEN cell can be implemented and optimized with the same internal delay optimization methods as in a TD cell. Meanwhile, this results in 608 hold time buffers as overhead, as shown by the selected point in Figure 10. Compared to conventional EDaC systems with 10% Tperiod DW (1789 hold buffers), a 68.59% reduction in hold buffers is achieved. This owes to the weakening correlation between DW width and error aware capability in the DPAD–EDaC system.
Figure 10.
Hold buffers’ insertion number versus DW width.
4.3. EPC Design Details
One of the main focuses in EPC design is the PW_GEN (PW generation) cell design. As suggested above, Equation (2) constrains the minimum width of , and Equation (1) constrains the minimum PW width difference between adjacent . Then, the PW_GEN cell can be implemented and optimized similarly using the DW_GEN cell.
Another focus in EPC design is the monitored cell selection. Cells can be selected from the static timing reports, which directly influence the performance of the prediction system. For each path towards monitored endpoints, the defined values obtained from the timing report are substituted into Equations (4) and (5). Then, all monitored cells can be selected discontinuously. We keep in mind that the DVS loop is able to re-scale overall circuit performance to a pre-defined error rate. The system does not require excessive error prediction capability. In total, when 125 monitored cells are selected, the sparse selection in this paper achieves 87.5% area reduction compared to the 1000 monitored cells selected in works [25,26].
Next, all TDs are connected to the monitored cells and endpoints for integration. To minimize extra wire loading on the existing paths, the TDs are placed as close as possible to the selected cells. All layout modifications are made using engineering change of order (ECO) commands to avoid large changes to the existing layout. In more detail, 343 TDs in the EDC are divided into 57 groups based on the position of endpoints in the clock network. Each group is connected to an SEL cell that is placed centrally in the group, which helps to minimize the overall routing overhead. The same procedure is repeated to obtain other connections in EDC. Then, 125 TDs in the EPC are divided into groups of 5 by a k-means clustering algorithm based on their location, and the centrally placed procedure is repeated for other connections in EPC. All new placement and routing are carried out using ECO commands, and a new retiming is verified with all constraints. Further iterations of clustering/placement are performed until timing is closed. In Figure 11, the DPAD–EDaC system is implemented in 55 nm CMOS process and the die photograph is shown. Here, the EDaC insertion infers a 6.84% area overhead and the details are shown in Figure 12.
Figure 11.
Chip micrograph.
Figure 12.
The components of DPAD–EDaC area overhead. The EDC accounts for 46.2% of DPAD–EDaC’s area overhead, which is the maximum.
5. Experimental Results
This section reports the voltage and energy experimental results at different operating conditions by the post-simulation and the power analyses.
5.1. Experimental Setup
The system was established to be fully functional by running a Dhrystone benchmark in a frequency range from 5 to 100 MHz under four voltage scaling conditions: (1) zero-margin critical voltage scaling (); (2) signoff voltage scaling (); (3) PoFF voltage scaling (); (4) 5% error rate voltage scaling ().
In the first condition (), the system operates with a zero margin at the most critical point. This can be obtained from the static timing analysis (STA) results by lowering the voltage until the slack of the most critical path equals zero at TT corner and 20 C. In the second condition (), the system operates with a conventional signoff margin. The margin is obtained from the same procedure as , except that the condition is changed to SS corner, 125 C, and 10% supply drop. In the third condition (), the system operates more critically than the signoff margin. Here, the supply voltage is adjusted according to error detection results from EDC, with EPC closed. The PoFF is determined as 1 detection error per 10,000 cycles at TT corner and 20 C. In the fourth condition (), the system operates the most critically, with negative margins at a 5% error rate tolerance point. Here, the supply voltage is adjusted by the DVS module according to the error rate at TT corner and 20 C.
5.2. Operation Comparison under Four Conditions
Figure 13 compares the voltage scaling of , , , and over the frequency range. The area between and (pink shadow) represents the voltage margin reduction compared to a conventional baseline design, which equals 195 mV at the MEP (15 MHz). And the area between and (slope shadow) represents the voltage margin reduction compared to a zero-margin design, which equals 10 mV at the MEP. Looking at the curve shows that the DPAD–EDaC system demonstrates less voltage margin reduction at near-/sub-threshold voltage. This trend is caused by an over 5% error rate increase due to propagation delay deterioration of critical paths when the voltage status decreases (25 mV drop) in the near-/sub-threshold region.
Figure 13.
Voltage scaling and margin over frequency at four operation conditions. The voltage values in MEP are specially marked.
Next, Figure 14 shows the energy consumption analysis under these voltage scaling conditions. The area between and (purple shadow), , and (diagonal shadow) represent the energy savings compared to a conventional baseline design and a zero-margin design, which equal 14.02 pJ/cycle and 3.74 pJ/cycle at the MEP, respectively.
Figure 14.
Energy scaling and margin over frequency at four operation conditions. The energy values in MEP are specially marked.
5.3. Comparison
A more detailed comparison between the proposed DAPD-EDaC and existing near-threshold capable EDaC-based systems is provided in Table 1. The top part of the table focuses on strategy integration details (e.g., DW width, area overhead, and low-voltage stability). This shows that our work has a significantly high error-aware capability with low DW width and minimal area overhead. At the same time, the widespread failures existed in most prediction systems due to large clock latency at low supply voltage are eliminated in this work.
Table 1.
Summary and comparison with existing near-threshold-capable EDaC systems.
| TVLSI’17 [31] | JSSC’18 [18] | JSSC’17 [23] | JSSC’19 [20] | JSSC’22 [25] | This Work | |
|---|---|---|---|---|---|---|
| Method | EDFF | EDFF | Half path EP | EDL | CD TD | CD TD |
| DW(%TCLK) | - | 5% | 50% | 50% | 20% | 7% |
| Correction | Replay | Borrowing | Predictive clock gating | Borrowing + Replay | Clock gating/Clock stretching | Clock gating + replay |
| Low-voltage failures 1 | None | None | Failures | None | Failures | None |
| Area overhead | 8.70% | 7% | 3.10% | 4.17% | 4.90% | 6.84% |
| Technology | 40 nm | 40 nm | 40 nm | 28 nm | 28 nm | 55 nm |
| Gate count | 145 K | M0/12 K | 5 K | 12 K | 69 K | 12 K |
| F-range (MHz) | 27.4–286 | 5–30 | 40–750 | 18–68 | 1–200 | 1–100 |
| V-range | 0.6 V–1 V | 0.29V–0.47V | 0.44 V–1 V | 0.4 V–0.9 V | 0.25 V–0.65 V | 0.4 V–1 V |
| Vdecrease (wrt Vsign) (wrt Ecritical) 2 | 23.10% | 42% | 18% | 40% | 22% | 29.10% |
| Esave (wrt Esign) | 44% | 75% | 50% | 61% | 33% | 47.97% |
| Emargin remained (wrt Ecritical) 3 | - | 37% | - | - | 12% | −19.75% |
1 There are significant failures in strategy at both voltage and higher values. 2 The degree of voltage reduction attained by each method at the minimum energy point is compared to the supply voltage determined by conventional signoff criteria. 3 The conserved energy margin achieved by each method is assessed relative to the zero-margin reference point. Negative values denote further reductions in energy consumption.
Next, the middle part of the table shows the process node, architecture, voltage and frequency range for functional operation. This indicates that the proposed design is able to operate within a large voltage and frequency range that goes from nominal to subthreshold region.
Finally, the bottom part shows the achieved energy savings. Setting the predicted error rate at 5%, we subjected the chip to Dhrystone benchmark test programs for subsequent post-simulation validation. The simulation outcomes indicate that, at the MEP (15 MHz), the chip can reliably operate at 0.45 V. In comparison, this voltage value is 0.67 V for the conventional baseline (non-EDaC) design and 0.48 V for critical operations. Moreover, this research attains a 47.87% reduction in energy consumption compared to the conventional baseline (non-EDaC) design and a 19.75% energy reduction in comparison to the zero-margin critical operation. Benefitting from the real-time adjustments made by the DVS module to power supply voltage and the prediction window, our design can lower the power supply voltage below the zero-margin point with a commensurate performance loss, as per the preset error rate. Notably, this margin is present in other works, remaining at a minimum of 12%.
6. Conclusions
In conclusion, our research introduces an innovative approach to enhance energy efficiency in near-/sub-threshold computing. The DPAD-EDaC strategy, with its deep path activity evaluation and adaptable prediction windows, effectively addresses the energy challenges in digital circuits. Its integration into a 32-bit microprocessor system demonstrates its robustness in achieving negative-margin operation with minimal overhead. The strategy’s adaptability, accentuated by clock gating, ensures effective error mitigation across voltage ranges. The significant energy savings at the MEP affirm its effectiveness in reducing consumption without compromising performance. Our study offers a valuable avenue for optimizing energy in near-/sub-threshold computing scenarios.
Acknowledgments
This work was supported in part by the Ultra-Low Power Research Center, Huazhong University of Science and Technology, China; and in part by Wuhan Top-AI Semiconductor Co., Ltd.
Abbreviations
The following abbreviations are used in this manuscript:
| EDaC | Error Detection and Correction |
| DVS | Dynamic Voltage Scaling |
| MEP | Minimum Energy Point |
| NTC | Near-Threshold Computing |
| DW | Detection Window |
| EP | Error Prediction |
| PW | Prediction Window |
| EDC | Error Detection Circuit |
| EPC | Error Prediction Circuit |
| ECC | Error Correction Circuit |
| ARs | Architecture Registers |
| SoTA | State-of-The-Art |
| DS | Double Sampling |
| TD | Transition Detection |
| IO | Input–Output |
| PoFF | Point of First Failure |
| DPAD–EDaC | Deep Path Activity Detection–EDaC |
| SEL | Shared Error Latch |
| DEP | Distributed Error Processor |
| RTL | Register Transfer Level |
| MC | Monte Carlo |
| STA | Static Timing Analysis |
Author Contributions
Conceptualization, R.-Z.Y. and Z.-L.L.; methodology, R.-Z.Y. and X.D.; software, R.-Z.Y. and Z.-H.L.; validation, R.-Z.Y., X.D. and Z.-H.L.; formal analysis, R.-Z.Y.; investigation, Z.-L.L.; resources, Z.-L.L.; writing—original draft preparation, R.-Z.Y.; writing—review and editing, R.-Z.Y.; supervision, R.-Z.Y. All authors have read and agreed to the published version of the manuscript.
Institutional Review Board Statement
Not appliable.
Informed Consent Statement
Not appliable.
Data Availability Statement
Not appliable.
Conflicts of Interest
The authors declare no conflict of interest.
Funding Statement
This work was supported in part by the National Natural Science Foundation of China under Grants 62274068.
Footnotes
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
References
- 1.Kim J.K., Knag P., Chen T., Zhang Z. A 640M pixel/s 3.65 mW sparse event-driven neuromorphic object recognition processor with on-chip learning; Proceedings of the 2015 Symposium on VLSI Circuits (VLSI Circuits); Montpellier, France. 8–10 July 2015; pp. C50–C51. [Google Scholar]
- 2.Jun J., Song J., Kim C. A near-threshold voltage oriented digital cell library for high-energy efficiency and optimized performance in 65 nm CMOS process. IEEE Trans. Circuits Syst. Regul. Pap. 2017;65:1567–1580. doi: 10.1109/TCSI.2017.2758793. [DOI] [Google Scholar]
- 3.Jain S., Khare S., Yada S., Ambili V., Salihundam P., Ramani S., Muthukumar S., Srinivasan M., Kumar A., Gb S.K., et al. A 280 mV-to-1.2 V wide-operating-range IA-32 processor in 32 nm CMOS; Proceedings of the 2012 IEEE International Solid-State Circuits Conference; San Francisco, CA, USA. 19–23 February 2012; pp. 66–68. [Google Scholar]
- 4.Dasgupta A., Sepulveda J.L. Accurate Results in the Clinical Laboratory: A Guide to Error Detection and Correction. Elsevier; Amsterdam, The Netherlands: 2019. [Google Scholar]
- 5.Bowman K.A., Tschanz J.W., Lu S.L.L., Aseron P.A., Khellah M.M., Raychowdhury A., Geuskens B.M., Tokunaga C., Wilkerson C.B., Karnik T., et al. A 45 nm resilient microprocessor core for dynamic variation tolerance. IEEE J. Solid-State Circuits. 2010;46:194–208. doi: 10.1109/JSSC.2010.2089657. [DOI] [Google Scholar]
- 6.Bull D., Das S., Shivashankar K., Dasika G.S., Flautner K., Blaauw D. A power-efficient 32 bit ARM processor using timing-error detection and correction for transient-error tolerance and adaptation to PVT variation. IEEE J. Solid-State Circuits. 2010;46:18–31. doi: 10.1109/JSSC.2010.2079410. [DOI] [Google Scholar]
- 7.Das S., Roberts D., Lee S., Pant S., Blaauw D., Austin T., Flautner K., Mudge T. A self-tuning DVS processor using delay-error detection and correction. IEEE J. Solid-State Circuits. 2006;41:792–804. doi: 10.1109/JSSC.2006.870912. [DOI] [Google Scholar]
- 8.Shan W., Dai W., Zhang C., Cai H., Liu P., Yang J., Shi L. TG-SPP: A one-transmission-gate short-path padding for wide-voltage-range resilient circuits in 28-nm CMOS. IEEE J. Solid-State Circuits. 2019;55:1422–1436. doi: 10.1109/JSSC.2019.2948164. [DOI] [Google Scholar]
- 9.Pawlowski R., Krimer E., Crop J., Postman J., Moezzi-Madani N., Erez M., Chiang P. A 530 mV 10-lane SIMD processor with variation resiliency in 45 nm SOI; Proceedings of the 2012 IEEE International Solid-State Circuits Conference; San Francisco, CA, USA. 19–23 February 2012; pp. 492–494. [Google Scholar]
- 10.Choudhury M., Chandra V., Mohanram K., Aitken R. TIMBER: Time borrowing and error relaying for online timing error resilience; Proceedings of the 2010 Design, Automation & Test in Europe Conference & Exhibition (DATE 2010); Dresden, Germany. 8–12 March 2010; pp. 1554–1559. [Google Scholar]
- 11.Das S., Tokunaga C., Pant S., Ma W.H., Kalaiselvan S., Lai K., Bull D.M., Blaauw D.T. RazorII: In situ error detection and correction for PVT and SER tolerance. IEEE J. Solid-State Circuits. 2008;44:32–48. doi: 10.1109/JSSC.2008.2007145. [DOI] [Google Scholar]
- 12.Bowman K.A., Tschanz J.W., Kim N.S., Lee J.C., Wilkerson C.B., Lu S.L.L., Karnik T., De V.K. Energy-efficient and metastability-immune resilient circuits for dynamic variation tolerance. IEEE J. Solid-State Circuits. 2008;44:49–63. doi: 10.1109/JSSC.2008.2007148. [DOI] [Google Scholar]
- 13.Kim S., Seok M. Variation-tolerant, ultra-low-voltage microprocessor with a low-overhead, within-a-cycle in-situ timing-error detection and correction technique. IEEE J. Solid-State Circuits. 2015;50:1478–1490. doi: 10.1109/JSSC.2015.2418713. [DOI] [Google Scholar]
- 14.Drake A., Senger R., Deogun H., Carpenter G., Ghiasi S., Nguyen T., James N., Floyd M., Pokala V. A distributed critical-path timing monitor for a 65nm high-performance microprocessor; Proceedings of the 2007 IEEE International Solid-State Circuits Conference; San Francisco, CA, USA. 11–15 February 2007; pp. 398–399. [Google Scholar]
- 15.Felice M., Briscoe T. Towards a standard evaluation method for grammatical error detection and correction; Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies; Denver, CO, USA. 31 May–5 June 2015; pp. 578–587. [Google Scholar]
- 16.Kim S., Cerqueira J.P., Seok M. A 450 mV timing-margin-free waveform sorter based on body swapping error correction; Proceedings of the 2016 IEEE Symposium on VLSI Circuits (VLSI-Circuits); Honolulu, HI, USA. 15–17 June 2016; pp. 1–2. [Google Scholar]
- 17.Kwon I., Kim S., Fick D., Kim M., Chen Y.P., Sylvester D. Razor-lite: A light-weight register for error detection by observing virtual supply rails. IEEE J. Solid-State Circuits. 2014;49:2054–2066. doi: 10.1109/JSSC.2014.2328658. [DOI] [Google Scholar]
- 18.Reyserhove H., Dehaene W. Margin elimination through timing error detection in a near-threshold enabled 32-bit microcontroller in 40-nm CMOS. IEEE J. Solid-State Circuits. 2018;53:2101–2113. doi: 10.1109/JSSC.2018.2821121. [DOI] [Google Scholar]
- 19.Kim S., Cerqueira J.P., Seok M. A Near-Threshold Spiking Neural Network Accelerator With a Body-Swapping-Based In-Situ Error Detection and Correction Technique. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 2019;27:1886–1896. doi: 10.1109/TVLSI.2019.2910792. [DOI] [Google Scholar]
- 20.Hong C.Y., Liu T.T. A variation-resilient microprocessor with a two-level timing error detection and correction system in 28-nm CMOS. IEEE J. Solid-State Circuits. 2019;55:2285–2294. doi: 10.1109/JSSC.2019.2951692. [DOI] [Google Scholar]
- 21.Whatmough P.N., Das S., Bull D.M. A low-power 1-GHz razor FIR accelerator with time-borrow tracking pipeline and approximate error correction in 65-nm CMOS. IEEE J. Solid-State Circuits. 2013;49:84–94. doi: 10.1109/JSSC.2013.2284364. [DOI] [Google Scholar]
- 22.Hiienkari M., Teittinen J., Koskinen L., Turnquist M., Kaltiokallio M. A 3.15 pJ/cyc 32-bit RISC CPU with timing-error prevention and adaptive clocking in 28 nm CMOS; Proceedings of the IEEE 2014 Custom Integrated Circuits Conference; San Jose, CA, USA. 15–17 September 2014; pp. 1–4. [Google Scholar]
- 23.Zhang Y., Khayatzadeh M., Yang K., Saligane M., Pinckney N., Alioto M., Blaauw D., Sylvester D. irazor: Current-based error detection and correction scheme for pvt variation in 40-nm arm cortex-r4 processor. IEEE J. Solid-State Circuits. 2017;53:619–631. doi: 10.1109/JSSC.2017.2749423. [DOI] [Google Scholar]
- 24.Fojtik M., Fick D., Kim Y., Pinckney N., Harris D.M., Blaauw D., Sylvester D. Bubble razor: Eliminating timing margins in an ARM cortex-M3 processor in 45 nm CMOS using architecturally independent error detection and correction. IEEE J. Solid-State Circuits. 2012;48:66–81. doi: 10.1109/JSSC.2012.2220912. [DOI] [Google Scholar]
- 25.Uytterhoeven R., Dehaene W. Design Margin Reduction Through Completion Detection in a 28-nm Near-Threshold DSP Processor. IEEE J. Solid-State Circuits. 2021;57:651–660. doi: 10.1109/JSSC.2021.3106245. [DOI] [Google Scholar]
- 26.Uytterhoeven R., Dehaene W. Completion detection-based timing error detection and correction in a near-threshold RISC-V microprocessor in FDSOI 28 nm. IEEE Solid-State Circuits Lett. 2020;3:230–233. doi: 10.1109/LSSC.2020.3006626. [DOI] [Google Scholar]
- 27.Tadros R.N., Hua W., Moreira M.T., Calazans N.L., Beerel P.A. A low-power low-area error-detecting latch for resilient architectures in 28-nm FDSOI. IEEE Trans. Circuits Syst. Ii Express Briefs. 2016;63:858–862. doi: 10.1109/TCSII.2016.2536179. [DOI] [Google Scholar]
- 28.Hua W., Tadros R.N., Beerel P.A. Low area, low power, robust, highly sensitive error detecting latch for resilient architectures; Proceedings of the 2016 International Symposium on Low Power Electronics and Design; Airport, CA, USA. 8–10 August 2016; pp. 16–21. [Google Scholar]
- 29.Cannizzaro M., Beer S., Cortadella J., Ginosar R., Lavagno L. SafeRazor: Metastability-robust adaptive clocking in resilient circuits. IEEE Trans. Circuits Syst. I Regul. Pap. 2015;62:2238–2247. doi: 10.1109/TCSI.2014.2365878. [DOI] [Google Scholar]
- 30.Beer S., Cannizzaro M., Cortadella J., Ginosar R., Lavagno L. Metastability in better-than-worst-case designs; Proceedings of the 2014 20th IEEE International Symposium on Asynchronous Circuits and Systems; Potsdam, Germany. 12–14 May 2014; pp. 101–102. [Google Scholar]
- 31.Wang S., Chen C., Xiang X.Y., Meng J.Y. A variation-tolerant near-threshold processor with instruction-level error correction. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 2017;25:1993–2006. doi: 10.1109/TVLSI.2017.2676026. [DOI] [Google Scholar]
- 32.Kong W., Venkatraman R., Castagnetti R., Duan F., Ramesh S. High-density and high-performance 6T-SRAM for system-on-chip in 130 nm CMOS technology; Proceedings of the 2001 Symposium on VLSI Technology; Kyoto, Japan. 9–10 June 2001; pp. 105–106. Digest of Technical Papers (IEEE Cat. No. 01 CH37184) [Google Scholar]
- 33.Wen L., Cheng X., Zhou K., Tian S., Zeng X. Bit-interleaving-enabled 8T SRAM with shared data-aware write and reference-based sense amplifier. IEEE Trans. Circuits Syst. II Express Briefs. 2016;63:643–647. doi: 10.1109/TCSII.2016.2530881. [DOI] [Google Scholar]
- 34.Do A.T., Lee Z.C., Wang B., Chang I.J., Liu X., Kim T.T.H. 0.2 V 8T SRAM with PVT-aware bitline sensing and column-based data randomization. IEEE J. Solid-State Circuits. 2016;51:1487–1498. doi: 10.1109/JSSC.2016.2540799. [DOI] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
Not appliable.














