Abstract
Field programmable gate array (FPGA) is widely used in the acceleration of deep learning applications because of its reconfigurability, flexibility, and fast time-to-market. However, conventional FPGA suffers from the trade-off between chip area and reconfiguration latency, making efficient FPGA accelerations that require switching between multiple configurations still elusive. Here, we propose a ferroelectric field-effect transistor (FeFET)–based context-switching FPGA supporting dynamic reconfiguration to break this trade-off, enabling loading of arbitrary configuration without interrupting the active configuration execution. Leveraging the intrinsic structure and nonvolatility of FeFETs, compact FPGA primitives are proposed and experimentally verified. The evaluation results show our design shows a 63.0%/74.7% reduction in a look-up table (LUT)/connection block (CB) area and 82.7%/53.6% reduction in CB/switch box power consumption with a minimal penalty in the critical path delay (9.6%). Besides, our design yields significant time savings by 78.7 and 20.3% on average for context-switching and dynamic reconfiguration applications, respectively.
A FeFET-based context-switching FPGA that can reconfigure while executing well-supports adaptive learning machines.
INTRODUCTION
Deep neural networks (DNNs) have dominated artificial intelligence (AI) applications due to their cutting-edge performance in a wide range of applications in many domains, such as image classification (1, 2), object detection (3, 4), and natural language processing (5, 6). However, with more sophisticated models and more voluminous data to process (7), these DNN workloads are becoming more compute-intensive and data-intensive, requiring hardware accelerators to achieve lower latency, higher throughput, and higher energy efficiency. Field programmable gate array (FPGA) devices, with the capabilities of flexible reconfiguration for arbitrary logic functions while maintaining high performance, are gaining popularity as accelerators for such complex deep learning applications (8–10). The reconfigurability of FPGA is enabled by its unique architecture, as illustrated in Fig. 1A, which consists of a sea of configuration logic blocks (CLBs), connection blocks (CBs), switch boxes (SBs), configuration memory, and I/O blocks (11). In particular, CLBs are the main components that can be programmed to perform different logic operations and CBs and SBs are controlled by configuration bits loaded from the configuration memory. A variety of routing networks can be achieved by loading different configuration bits. Above all, FPGA’s aforementioned properties including reconfigurability, flexibility, high performance, and fast time-to-market make it a promising choice for DNN accelerators. The basic structure of FPGA and mechanisms of primitives are shown in fig. S1.
Fig. 1. Overview of the proposed context-switching FPGA and potential applications.
Architectures of (A) a conventional static random-access memory (SRAM)–based FPGA, (B) SRAM-based FPGA supporting partial reconfiguration, and (C) Our proposed ferroelectric field-effect transistor (FeFET)–based context switching FPGA supporting dynamic reconfiguration. (D) An example of a deep learning network: Two-stage super-sub network (12), where at first the superclass Dog is identified and then the subclass “husky” is identified. (E) Conventional FPGA incurs area overhead or significant reconfiguration latency. This figure shows two main approaches to implementing the super-sub network in conventional FPGA. (F) Our approach provides fast reconfiguration speed and compact solutions. LUT, look-up table.
As a concrete and highly important example of DNN acceleration on FPGA, a two-stage super-sub network is adopted for image classification (12). In this model, a superclass is first inferred using a generalist superclass-level network, and the network output is then passed to a specialized network for final subclass-level classification. In this way, the overall classification accuracy has been proved to increase over that of common inference methods when evaluating the “Superclassing ImageNet dataset”, which is a subset of ImageNet (13) and consists of 10 superclasses, each containing 7 to 116 related subclasses (e.g., 52 bird types and 116 dog types) (12). Figure 1D shows one specific example of this framework. In the first stage, the superclass “Dog” is identified by the generalist superclass network. Then, fine-grain inference in the subclass network is performed in the second stage and outputs the final result “husky” of the target image.
Numerous hardware accelerators have been proposed to implement DNNs, such as customized application-specific integrated circuits (14–16), application-driven optimization on graphics processing units (17, 18), and FPGAs (19, 20). However, among these various types of DNN accelerators, FPGA, which can provide more flexibility while maintaining high performance, is particularly suitable for implementing the accelerators of DNNs such as for the super-sub network model. Figure 1E shows two main approaches when considering implementing this super-sub network into FPGA. One distinguished feature of the implementation is the requirement of multiple configurations in FPGA to map the superclass and subnetworks, respectively. The straightforward approach is to use more than one chip to process different networks (i.e., configurations). As shown in Fig. 1E, chip 1 is configured to process the general inference task for superclasses, whose outputs are then sent to chip 2 which is configured to map the subclass networks to identify the specific subclass. This approach, although fast, incurs penalties in chip area and cost. Another compact and cost-efficient approach is to leverage the reconfiguration capability of FPGA by simply reconfiguring chip 1 to the subclass network after it finishes execution of the superclass network. In this way, contexts, i.e., FPGA configurations, can be swapped in or out of the FPGA upon the demands of application requirements without the need for additional chips (21). Therefore, this approach saves the area cost but comes with a penalty in the reconfiguration latency. Above all, although FPGA offers an attractive choice for the acceleration of the super-sub network model (Fig. 1E), an ideal implementation with high area efficiency and low latency is still elusive with current FPGA technologies and architectures.
Many relevant works have explored design options to address the aforementioned issues at different granularities of reconfiguration and from different angles of applications. However, all of them are still limited by the dilemma or might incur other overheads. For example, a full context-switching FPGA was first proposed as a time-multiplexer FPGA based on the Xilinx XC4000E FPGA in 1997 (22), where eight configurations of the FPGA are stored in on-chip memory and the contexts can be switched in a single cycle. With preloaded contexts, reconfiguration is not needed but it comes with a large area penalty. The more configurations to be supported, the more area overhead to store those configurations. To save area while still speeding up the reconfiguration process, dynamic partial reconfiguration appears as another solution to support multiple configurations, by which only a portion of the hardware region (called the reconfigurable region) can be reconfigured while the remainder is static (23). Partial reconfiguration brings several advantages over conventional context-switching FPGA (24), including less reconfiguration time compared to full-region reconfiguration and a smaller area with its increased logic density. However, partial reconfiguration only provides a compromised solution between the area cost and the reconfiguration latency, incapable of fundamentally solving the problem. In the end, it is possible to support fine-grain reconfiguration at the bit level, as demonstrated by consecutive works on the “NATURE” FPGA architecture to support fine-grain temporal logic folding (25, 26), which is either based on complementary metal-oxide semiconductor (CMOS) [e.g., logic and static random-access memory (SRAM)] and carbon nanotube random-access memory (NRAM) (25) or based entirely on CMOS circuits (26). In the former work, NRAM and SRAM work together to support dynamic reconfiguration for temporal logic folding of circuits, which is to realize different logic functions in the same logic elements through dynamic reconfiguration every few cycles, thereby significantly increasing the logic density. In the latter work, the dynamic reconfiguration delay is hidden behind the computation delay through the use of shadow SRAM cells (i.e., two SRAM copies). However, both works suffer from high area costs which are mainly caused by extra NRAM cells and 10 T-SRAM cells, respectively. Therefore, to date, a context-switching FPGA that can break the trade-off between the area cost and the reconfiguration latency remains elusive and the goal of the proposed research is to bridge the gap.
To mitigate the aforementioned issues in terms of area, latency, and power, we propose a dynamic context-switching FPGA architecture based on ferroelectric field-effect transistors (FeFETs) which can implement DNN accelerators more efficiently. With joint innovations from technology, circuit, and architecture levels, our proposed design has several advantages over prior context-switching works. First, from a technology perspective, FeFET is unique in that it behaves both as a transistor switch and a nonvolatile memory cell such that FPGA basic logic circuits [e.g., look-up tables (LUTs)] and routing elements (e.g., CBs and SBs) can be implemented compactly. Moreover, these FPGA basic elements have no leakage power dissipation because of the nonvolatility of FeFET, which hugely reduces the total power consumption of the entire FPGA. Second, from the circuit’s perspective, a CB composed of two parallel branches is proposed, which stores two configurations while still consuming much less area than a single-configuration SRAM-based CB. Third, the proposed FPGA is dynamically reconfigurable with the capability to load one configuration without interrupting the execution of another configuration. As a result, the reconfiguration time can be completely hidden as long as it is smaller than the computation time of the current active configuration. Therefore, our proposed solution can achieve dynamic context-switching with zero penalty in reconfiguration latency and significant area reduction compared to SRAM-based design, breaking the trade-off between area cost and reconfiguration latency the existed in conventional CMOS implementations.
With the proposed context-switching FPGA, the aforementioned super-sub network can be efficiently implemented, as shown in Fig. 1F. Considering one case in which we are interested in having an accurate classification of one specific superclass (e.g., Dog), the proposed design can perfectly fit in it and reduce the reconfiguration latency. Specifically, these two configurations including the superclass network and subclass network can be preloaded into the FPGA. First, the general inference with the superclass network is performed. As long as the output of the general inference is Dog, the configuration corresponding to Dog’s subclass network would be activated and executed for further inference. In this way, compared to long reconfiguration time, the switching time is much less or even negligible, which leads to almost zero latency overhead. In addition, the total area cost could also be heavily reduced by leveraging dense FeFETs. Note that the proposed context-switching FPGA enables applications in various domains that need switching between different contexts, beyond the super-sub network discussed here. The reconfiguration functionality is especially helpful in various dynamic adaptation applications such as changing communication encoders or decoders on demand to the appropriate protocols (27), changing the data rates to vary bandwidths (28), scaling the computation based on available energy needs (29). Moreover, with no limitation on the number of configurations, our design can also be scaled to implement multiple configurations depending on the demand of applications. Some potential applications are illustrated in fig. S3.
RESULTS
Overview of the proposed FPGA architecture
For a deeper look into the design of the proposed context-switching FPGA, details of the architecture and components to support multiple configurations are shown in Fig. 2. Figure 2A shows primitive components of the proposed context-switching FPGA which supports dual configurations, including CLBs, CBs, and SBs. For each component, it is controlled by the configuration information stored in configuration memory. By loading the configuration bits, the logic (LUT) and routing elements (CB/SB) can be connected to form a functional circuit to perform the desired computation. In the proposed context-switching FPGA, there are two local copies of each LUT, CB, and SB, which correspond to two configurations. In this way, when one configuration is active for computation, any other configuration can be loaded without interrupting the execution, thereby significantly reducing the reconfiguration latency. In contrast, in conventional context-switching FPGA, they would either require hardware resources for supporting multiple configurations on-chip or require long serial reconfiguration time. To support run-time reconfiguration and reduce the area cost incurred by the need for an extra copy of FPGA primitive components, FeFET technology, due to its programmability, nonvolatility, and compactness, is chosen in this work to implement basic programmable FPGA components such as LUTs, CBs, and SBs.
Fig. 2. The proposed dynamic context-switching FPGA architecture.
(A) Primitive FPGA components with dual configuration support. (B) Existing memory technology–based single-configuration switch implementations. (C) Proposed FeFET-based switches. In the multi-configuration switch, we achieve dynamic reconfiguration by turning the pass transistors on/off to select an active branch/reconfigure branch. (D) Proposed FeFET-based LUT for dual configuration. Basically, it consists of two single-configuration LUTs and one extra multiplexer for selecting the proper configuration when needed.
In recent years, the switches in FPGA can be realized with various embedded memory technologies as the basic elements of routing elements (CBs and SBs). Figure 2B presents existing mainstream memory technology–based single-configuration switches including SRAM, spin transfer torque magnetic RAM (STT-MRAM), flash memory, resistive RAM (ReRAM), phase change memory (PCM), and FeFET. Because of its logic compatibility, superior write and read performance, and excellent reliability, SRAM is the most straightforward memory to use by combining an SRAM cell with an n-type pass transistor. However, SRAM-based switches suffer from two crucial overheads. One is low area density due to its complex cell structure; the other is high leakage power, which accounts for 60 to 70% of total FPGA power dissipation due to long routing tracks (30–33). Recently, emerging embedded nonvolatile memory technologies have been actively investigated as promising alternatives to SRAM due to their density, energy, and performance advantages. However, each of them comes with its own challenges. For example, a flash memory–based switch is nonvolatile and compact (34), but memory programming is slow (in approximately milliseconds) and requires a high programming voltage (∼10 V). Two terminal resistive memories, including ReRAM, PCM, and STT-MRAM, are nonvolatile and dense but usually require large conduction current to program the devices, consuming a significant write power. In addition, the limited ON/OFF resistance ratio (∼100 for ReRAM/PCM and ∼5 for STT-MRAM) usually requires additional circuitry, such as the 1T2R structure for ReRAM/PCM (35, 36) and an even more complex supporting structure for STT-MRAM (37) to realize a single switch.
In this regard, we propose the FPGA architecture which adopts FeFETs to implement logic and routing elements. Ever since the discovery of ferroelectricity in doped HfO2, significant progress has been made in the integration of HfO2-based FeFET due to its nonvolatility, high density, large ON/OFF ratio, and excellent CMOS compatibility (38, 39). In addition, switching of ferroelectric polarization is induced by an applied electric field, rather than a large conduction current, making FeFET a highly energy-efficient nonvolatile memory (40). Since the ferroelectric film is integrated into the gate stack of a FeFET, when its polarization is set to point at the channel/metal gate, the FeFET threshold voltage (VTH) will be programmed to the low-VTH/high-VTH, respectively, thus realizing a compact nonvolatile routing element (41). Leveraging this technology, a mixed FeFET/CMOS switch unit (i.e., 1T-1FeFET) has been proposed as a routing element in FPGA (42), which takes advantage of but does not fully exploit FeFET. In this work, leveraging the intrinsic nonvolatile switch structure of FeFET, we propose a 1FeFET routing switch for single-configuration FPGA and a 2T-2FeFET routing switch for dynamic reconfiguration context-switching FPGA, as shown in Fig. 2C, which achieve optimal area efficiency. The critical design difference in our FeFET switch compared to the flash switch and prior FeFET switch (42), despite their similarities in the device structure, is that our switch is composed of only one FeFET, which significantly improves the integration density. The flash switch requires a pair of n-type and p-type flash devices controlling one normal n-type metal-oxide semiconductor (NMOS) pass transistor. By applying proper biases on the word line (WL) and bit line (BL), only one of the flash devices would be conducted to turn ON/OFF the pass transistor. The reason why it cannot be replaced with one flash transistor might be its relatively poor pass gate performance due to its thick gate stack. Compared to flash devices, FeFET shows great scalability and compatibility with Si CMOS, making a single FeFET feasible as a one-pass transistor. Moreover, FeFET allows lower operation voltages for both writes and reads. Besides, for the 1T-1FeFET switch (42), in addition to FeFET, they need an access transistor to coordinate with operation and programming. However, in our FeFET switch design, we leverage a novel program mechanism to write through gate and body terminals and program disturb inhibition schemes (43). In this way, our design can eliminate the access transistor at a lower area cost. For the context-switching FPGA, a serial CMOS transistor is added to each branch, which is used to cut off the branch that is loading a new configuration to minimize the disturbance to the other active branch. Figure 2D shows our proposed circuit of LUT array for dual configuration. A compact LUT cell can be efficiently implemented using a single FeFET such that the high-VTH/low-VTH states (HVT/LVT) of FeFET store bit “1”/“0” for the LUT cell, respectively. Besides, as shown in Fig. 2D, the proposed LUT can support dynamic reconfiguration—when the branch of configuration 1 is operating, the branch of configuration 2 can load a new configuration.
Block design and functional verification
In this section, experimental verification of the proposed LUT and routing elements (CB/SB) for context-switching FPGA is performed. For experimental demonstration, FeFET devices integrated into the 28-nm high-κ metal gate (HKMG) technology are tested. Figure 3 (A and B) shows the transmission electron microscopy (TEM) and schematic cross section of the device, respectively. The device features an 8-nm-thick doped HfO2 as the ferroelectric layer and around 1-nm SiO2 as the interlayer in the gate stack. The FeFET memory performance is characterized by standard pulsed ID−VG measurements after applying ±4 V, 1-μs write pulses on the gate. Figure 3C shows a memory window of about 1.2 V, i.e., the VTH separation between the low-VTH and high-VTH states, which enables a large ON/OFF conductance ratio. It also exhibits a well-tempered cycle-to-cycle variation. Figure 3D shows the switching dynamics of the FeFET under different pulse amplitudes and pulse widths, which also shows a trade-off between the write speed and pulse amplitude and that it is possible to program FeFET with sub-10 ns with 4-V write amplitude. It follows the classic nucleation-limited switching model in the thin film polycrystalline HfO2 (38, 44), where domain switching is mainly limited by the nucleation process and the nucleation time follows an exponential dependence on the applied electric field. These results suggest that HfO2-based FeFET exhibits high performance, showing great promise of this technology in many applications including the context-switching FPGA in this work. The endurance and retention characteristics of FeFETs are measured in fig. S2.
Fig. 3. Experimental verification of the LUT cell operation.
(A and B) TEM and schematic cross section. (C and D) ID−VG characteristics for FeFET measured after ±4 V, 1-μs write pulses and the switching dynamics of FeFET under different pulse amplitudes and pulse widths. (E and F) Operations of the LUT cell for storage with bit “0”/“1” by exploiting the dynamic high-V TH/low-V TH states (LVT/HVT) programming capability. (G) Proposed k-bit LUT. (H) The experimental setup of functional verification of the LUT cell operation. (I) Experimental waveforms of proposed LUT cells in (E) and (F). (J) The circuitry of a LUT array for multiple configurations.
Figure 3 (E and F) shows the operation principle of our proposed LUT cells that store bits 1 and 0, respectively. Each cell consists of one single FeFET and one p-channel metal-oxide semiconductor (PMOS) transistor, where the PMOS is shared among all the cells and is part of the sense amplifier used to convert the read current to logic voltage levels. Bits 1 and 0 are stored by programming the FeFET into the high-VTH and low-VTH states, respectively. Then, in the LUT read mode, the stored bit can be read by asserting the appropriate read voltage, VREAD, to the gate terminal of the FeFET, as shown in Fig. 3E. Because of the large ON/OFF resistance ratio of FeFET at VREAD, the output voltage will be close to VDD and ground for bits 1 and 0, respectively. This is achieved by choosing an appropriate PMOS gate bias (VB) such that its resistance is between the FeFET high-VTH and low-VTH states, thereby setting the output voltage rail-to-rail. Figure 3G demonstrates the main structure of the single-configuration LUT integrated with 2k FeFET-based bit cells (cell 0/cell 1), different logic functions can be successfully achieved by applying different combinations of select signals. In this structure, a sense amplifier composed of one pull-up PMOS transistor and two inverters is used for converting FeFET read current to voltage and amplifying the output voltage to full swing. The LUT cell operation is then verified in an experiment using the setup shown in Fig. 3H, which includes the major components in Fig. 3G. The operation waveforms are presented in Fig. 3I, which shows the write and read phases of the LUT cell. After programming the FeFET into high-VTH/ low-VTH states using −4 V/+4 V, 1-μs write pulse, the output voltage shows a logic high and low, respectively. This verifies the successful cell operation, but due to the discrete experimental setup, performance is limited by the parasitics. To predict the fully integrated FeFET LUT performance, SPICE simulations using a calibrated FeFET model (45) and 45-nm predictive technology model (PTM) for logic transistors (46) are performed. Figure S4 shows the simulated waveform, indicating that for a six-input LUT cell, the read delay is 124.3 ps and consumes 13.1-μW power. In the subsequent section, FeFET-based primitive components, including LUTs, CBs, and SBs, are also compared with other technology implementations using consistent SPICE simulations, as will be studied in Fig. 5.
Fig. 5. Area comparison and simulation results.
(A) Area impact of FeFET LUT cells (storage) and CBs over SRAM-based structures. (B) Delay and power comparison of main components of FPGA based on different memory technologies. (C) Critical path delay of different memory technology–based FPGA designs.
To support dynamic reconfiguration, two LUTs forming an array are designed and an additional multiplexer is used to select which configuration should be active in the current operating period, as shown in Fig. 3J. Programming in a bulk planar single FeFET array has been extensively investigated (43, 47). The applicable programming schemes depend on the number of accessible terminals during memory writing. In the proposed FPGA architecture, the source/drain terminals are not simultaneously accessible from outside, which limits the possibility of applying write schemes that need to apply the source/drain voltages. In this case, a convenient solution is shown in Fig. 3J, where the gate and the body terminals are used for programming. The WL is shared among all FeFETs in a configuration block and the body is shared across different configuration blocks. Two-step programming will then be performed where all the FeFETs in a configuration are set to the low-VTH states first by applying a positive write voltage (i.e., VW) on the WL and keeping all the other terminals grounded. Then, those FeFETs that need to be in the high-VTH states are applied with a negative gate-to-body voltage (i.e., −VW). To avoid write disturb to those low-VTH states FeFETs during the second step, the standard inhibition bias scheme (e.g., VW/2) can be applied, which is verified in fig. S5.
Next, the functionality of the routing elements is verified, as shown in Fig. 4. Using CB as an example, Fig. 4A shows the array structure, where BL and source line (SL) route the actual signal, and WL and the column-wise body contact are used to program FeFETs. As introduced in Fig. 2C, to support the run-time reconfiguration of one branch without interrupting the normal operation of the other branch, a serial transistor is added to each branch and is off/on during configuration loading/execution, respectively. The swap between configurations can be easily and swiftly conducted by applying corresponding read gate biases, as shown in Fig. 4B, such that when one configuration is deactivated, the FeFET will be cut off, irrespective of its state. Figure 4C shows an example waveform applied on a testing unit (Fig. 4D), where branch 1 is first configured to be the low-V TH state while branch 2 is executed, and then branch 1 is activated while branch 2 is configured to the high-VTH state using the two-step programming. Figure 4E shows the experimental results applied to the voltage sequence shown in Fig. 4C for three repeated cycles. The zoomed-in programming waveforms for branches 1 and 2 are shown in Fig. 2 (F and G, respectively). Because of the configurations used in this testing scenario, where the branch 1/branch 2 is in the low-VTH/high-VTH states, respectively, the output signal will therefore switch between 0.7 V (i.e., when branch 1 is active) and 0 V (i.e., when branch 2 is active). The experimental results therefore confirm successful operations. Figures S6 to S8 show experimental results of the other three configuration combinations of two branches, which further verify the successful run-time reconfiguration operation. Similar to the LUT cell case, SPICE simulations are conducted to predict the speed of a fully integrated CB, where the simulated transient waveform of the proposed multi-configuration CB is shown in fig. S9.
Fig. 4. Experimental verification of the multi-configuration CB operation.
(A) The structure of one 2 × 2 CB array. (B) By applying different read gate voltages, the swap between configurations can be achieved. (C) An example waveform applied to set branch 1/branch 2 to be at the low-VTH/high-VTH states, respectively without interrupting normal operation. (D) The circuitry of one CB test unit. (E) The experimental transient waveforms of run-time context configuration and switching repeated for three cycles. (F and G) The zoomed-in programming waveform for branch 1/branch 2 in tests, respectively. The zoomed-in programming waveform is shown due to its small write pulse width.
Evaluation and application case study
To evaluate the feasibility and performance of the proposed FeFET-based context-switching FPGA architecture, simulations are performed and a comprehensive comparison with other relevant works based on different memory technologies is shown in terms of area, delay, and power consumption. Moreover, at the system level, the capability of the proposed architecture to successfully achieve dynamic reconfiguration is demonstrated and the evaluation results show that the design presents a significant power reduction and area efficiency improvement with slightly increased critical path delay as the trade-off. To estimate the area of FeFET-based CB and LUT cells and compare them with other works, the layouts are drawn and the area is calculated using the design rules of the GPDK 45-nm library in fig. S10. All relevant area numbers are shown in Fig. 5A. Our layout analysis shows that the proposed CB and LUT cells are more compact compared to SRAM CBs and LUT cells. For example, the proposed FeFET-based single-configuration CB and LUT cells occupy an area that is only 12.6 and 18.5% of their respective SRAM-based counterparts, while the prior FeFET-based CB and LUT cells (42) require 77.0 and 97.0% of that area, respectively. Even the proposed multi-configuration FeFET CB and LUT cell area is only 25.3 and 37.0% of that of the SRAM-based single-configuration design. Therefore, the proposed design shows a significant area reduction compared to the SRAM-based design and previous FeFET-based design (42).
Figure 5B summarizes the basic structures of six-input LUT/CB/SB based on existing memory technologies (SRAM, STT-MRAM, RRAM, and FeFET) and compares their corresponding read delay and read power consumption. All circuits are simulated with HSPICE. The 45-nm PTM (46) is adopted for all MOSFETs in this work and a calibrated FeFET model (45) is used for the proposed design. For resistive memories, the corresponding low-resistance and high-resistance levels are used for simulation (48, 49). According to the simulation results (Fig. 5B), we observe that for a six-input LUT, the proposed single-configuration LUT shows the smallest read power consumption, which is 13.1 μW, and for multiple configurations, this number increases slightly but still less than the power consumed by magnetic tunnel junction (MTJ)-based single-configuration LUT. This is due to the large ON/OFF ratio of FeFET obviating the need for a high read current to differentiate its two states, unlike MTJ designs. As for the read delay, RRAM-based single-configuration LUT has the longest latency. The proposed FeFET-based single-configuration LUT shows the second-best latency in all considered nonvolatile LUTs. Besides, the delay of the proposed FeFET-based multi-configuration LUT is less than that of RRAM-based single-configuration LUT even though considers one extra multiplexer for selecting configurations. The switching current through the sense amplifier for FeFET is larger than RRAM due to its higher on/off ratio (lower Ron), resulting in less LUT delay than RRAM. For CBs, our 1FeFET single-configuration CB and 2T-2FeFET multi-configuration CB show much less power consumption during operation, which consume ∼95%/∼85% less power than the SRAM-based CB. For SBs, both FeFET-based single-configuration SB and multi-configuration SB show much less power consumption than others since our circuit contains fewer transistors. However, the delay of 1FeFET CB is around two times that of an SRAM-based CB. The delay of FeFET-based SB is the worst among different memory technology–based designs. That is because FeFET’s transmission speed is not as high as a conventional MOSFET, resulting in poorer performance than CB. In conclusion, the proposed FeFET-based designs (CB/SB) show significant advantages in power consumption over SRAM/STT-MRAM/RRAM–based designs but with a slight penalty in delay. Note that the penalty in the routing elements’ (CB/SB) delay does not necessarily mean that the overall system will be affected as the routing delay may be a small portion of the overall system delay, which is investigated below (Fig. 5C).
To investigate the impact of the primitive (i.e., LUT/SB/CB) delay on the latency of the whole FPGA, the critical path delay is studied with the Verilog-to-Routing (VTR) tool (50). The VTR tool is a popular open-source CAD tool for FPGA architecture development and evaluation. For fair comparison, all the SRAM-/RRAM-/STT-MRAM-/FeFET–based FPGAs use a well-optimized and commercial FPGA architecture using 45-nm technology in VTR. To get the critical path delay of different memory technology–based FPGAs, seven circuitry benchmarks (stereovision0, blob_merger, sha, spree, boundtop, diffeq2, and or1200) included in VTR are conducted (50, 51). These represent popular applications in diverse domains, such as image processing, math, cryptography, and computer vision. Figure 5C compares the critical path delay measured from SRAM-/RRAM-/STT-MRAM-/FeFET–based FPGAs. Compared with SRAM-based FPGA, the FeFET-based single-configuration FPGA presents an 8.6% reduction in the critical path delay on average, and it is also better than RRAM-based architecture. However, the proposed FeFET-based multi-configuration FPGA shows a 9.6% increment in the critical path delay compared to SRAM-based FPGA. The simulation confirms that the delay of LUTs is dominant in the overall delay of the entire FPGA, therefore explaining the aforementioned performance of these FPGAs. More details on the simulation are explained in fig. S11.
In addition, to show the feasibility of implementing the whole design in deep learning applications, three case studies under different scenarios are investigated. The first case is presented to show the benefit provided by dynamic reconfiguration in image classification. In the evaluation, two approaches of inference are considered—static inference and dynamic inference. For static inference, the input image is classified by the generalist classifier. However, for dynamic inference, the input image is first classified by the superclass classifier to identify the superclass. If the superclass is supported by the specialist subclass classifier network, then the configuration of the subclass classifier would be switched and executed for enhanced accuracy. Otherwise, a generalist classifier is invoked to complete the subclass identification. The whole workflow is shown in Fig. 6A. Figure 6B shows that dynamic inference for superclass classification improves the accuracy by up to 3.0% over static inference. Only context-switching FPGA can efficiently realize dynamic inference. In the last two cases, the feasibility and advantages of the proposed design over the conventional FPGA design are evaluated in terms of timing when considering various application scenarios. Basically, three neural networks (ResNet50, CNV, and MobileNetv1) are deployed into FPGA through the Xilinx Vitis AI platform (52). In the second case study, a case scenario that needs to switch between two neural networks frequently (Fig. 6C) is considered. In conventional FPGA, it is necessary to load new configurations before switching contexts, which is time-consuming. However, for this context-switching design, our approach can preload two configurations, and then freely switch between them without the reconfiguration latency. The switch time of the proposed design is less than 1 ns which is much smaller than the reconfiguration time and the proposed design shows significant speed up (from 39.0 to 97.5%; Fig. 6D). The last case study is related to dynamic reconfiguration. It is assumed that there are three different neural networks to implement and switch between. Thus, in this case, there would be six situations corresponding to six combinations of these three networks (ResNet50→CNV→MobileNetv1, ResNet50→MobileNetv1→CNV, CNV→ResNet50 →MobileNetv1, CNV→MobileNetv1→ResNet50, MobileNetv1→ResNet50→CNV, and MobileNetv1→CNV→ResNet50). As is well known, latency is one of the most critical criteria when evaluating a neural network accelerator. Hence, for all these six situations, the total consumed time, including both the execution time and the reconfiguration time for each network, is compared under two different conditions—one is in conventional FPGA and the other is in the proposed architecture with dynamic reconfiguration. As shown in Fig. 6E, as the capability of dynamic reconfiguration means that the architecture is able to operate and reconfigure simultaneously, some parts of or even the complete reconfiguration time of the following network can be overlapped and hidden by the execution time of the current network, which helps to reduce the total latency. As shown in Fig. 6F, the results demonstrate that the proposed design with dynamic reconfiguration offers time-saving for all these situations which varies from 2.4 to 37.4%. One thing that should be noticed is that the maximum time saving of the ideal case would be 50%, in which the execution time of the first network is equal to the configuration time of the second network. The maximum improvement of the proposed design (37.4%) is very close to this number. In addition, the proposed FPGA architecture is adaptive to implement more deep learning frameworks, and the relevant improvements and benefits are investigated in fig. S12. Above all, the case studies demonstrate that the proposed FeFET-based context-switching FPGA design shows the best adaptability in various types of deep learning applications.
Fig. 6. Application case studies of the proposed multi-configuration FPGA for different application scenarios.
(A) Image classification workflow. (B) Dynamic inference for image classification improves accuracy. (C) A diagram shows the experimental setup of the second case study: Our design preloads two configurations in the FPGA, and then switches between them as needed. (D) Compared to conventional FPGA, the capability of switching between two configurations of our design yields significant time savings varying from 39.0 to 97.5% in our case (in an ideal case, the maximum time saving would be 100%). (E) A diagram shows the experimental setup of the third case study: The proposed FPGA implements and performs three different neural networks using dynamic reconfiguration which achieves operating and reconfiguring simultaneously. (F) Switching between three neural networks with dynamic reconfiguration offers time savings varying from 2.4 to 37.4% compared to traditional FPGA (in an ideal case, the time saving would be 50%).
DISCUSSION
In summary, we propose a FeFET-based context-switching FPGA architecture with the capability of dynamic reconfiguration, which can mitigate the trade-off in conventional FPGA between the chip area cost and reconfiguration latency. In addition, we experimentally verify the functionality of the primitive blocks of the proposed FPGA. The simulation results reveal that by leveraging FeFETs, the proposed primitives of the FPGA show huge area and power reduction compared to conventional SRAM-based design. Moreover, three representative application scenarios are investigated and studied. The evaluation results show the proposed context-switching FPGA supporting dynamic reconfiguration offers significant time-saving in these application scenarios. Our design provides an efficient solution to bridge the gap and makes FPGA more competitive in accelerating complex deep learning applications.
MATERIALS AND METHODS
Device fabrication
Here, the fabricated FeFET features a polycrystalline Si/TiN/doped HfO2/SiO2/p-Si gate stack. The devices were fabricated using a 28-nm node gate-first HKMG CMOS process on 300-mm silicon wafers. Detailed information can be found in (45, 53). The ferroelectric gate stack process module starts with removing the native oxide through wet etch, then the growth of a thin SiO2-based interfacial layer through wet chemical oxidation, followed by the deposition of the doped HfO2 film through atomic layer deposition. A TiN metal gate electrode was deposited using physical vapor deposition, on top of which the poly-Si gate electrode was deposited. The source and drain n+ regions were activated by rapid thermal annealing (RTA) at approximately 1000°C. The reason that 1000°C is used is because the source/drain dopant activation and the ferroelectric phase stabilization are performed at the same step. This is the gate-first process. Of course, lower temperature can be used if the gate last process is adopted. With Hf1-xZrxO2, annealing at the backend-of-line compatible temperature is even possible (≤450°C). This step also results in the formation of the ferroelectric orthorhombic phase within the doped HfO2. After RTA, the HfO2 becomes polycrystalline, where multiple crystalline phases can coexist, including the monoclinic dielectric phase, orthorhombic ferroelectric phase, and tetragonal antiferroelectric phase. For future suppression of device variation, further optimization for phase-pure orthorhombic HfO2 is necessary. For all the devices electrically characterized, they all have the same gate length and width dimensions of 0.5 μm by 0.5 μm, respectively.
Electrical characterization
The experimental verification was performed with a Keithley 4200-SCS Semiconductor Characterization System (Keithley system), a Tektronix TDS 2012B Two Channel Digital Storage Oscilloscope (oscilloscope), and a Keysight 81150A Pulse Function Arbitrary Generator (waveform generator). Two 4225-PMUs (pulse measurement units) were used to generate proper waveforms. The FeFETs used in experimental verification were connected with devices (inverters, p-type MOSFET, and/or n-type MOSFET) externally on a breadboard. In the experimental verification of the LUT cell operation, VDD was given by the waveform generator. Output pulses were captured by the oscilloscope. Write and read operations were provided by the Keithley system. In the experimental verification of the multi-configuration CB operation, input voltage was given by the waveform generator. Output pulses were captured by the oscilloscope. WL and Enable (EN) signals were generated by the Keithley system. Three repeated cycles were performed for each configuration. State initialization (+4 V or −4 V to both WL1 and WL2 with a pulse width of 1 μs) was added at the beginning of the waveforms to generate the desired output in the first cycle.
Acknowledgments
Funding: This work is mainly supported by the U.S. Department of Energy, Office of Science, Office of Basic Energy Sciences Energy Frontier Research Centers program under Award Number DESC0021118. The experimental characterization is partially supported by the Army Research Office under grant number W911NF-21-1-0341 (to K.N.). The architecture benchmarking is partially supported by ND EPSCoR (to S.G.) and partially by NSF 2132918 (to V.N.) and 2008365 (to V.N.). The device fabrication is funded by the German Bundesministerium für Wirtschaft (BMWI) and by the State of Saxony in the frame of the “Important Project of Common European Interest (IPCEI).”
Author contributions: V.N., S.G., and K.N. proposed and supervised the project. Y. Xu, Y.Xi., T.Y., and E.H. conducted circuit and architectural simulations. Z.Z. performed experimental verification. H.M., D.K., S.D., and S.B. fabricated the FeFET devices. X.G. helped with FeFET array programming. R.J. and X.H. helped with FPGA benchmarking. S.W., A.S.R., K.L., and L.I. conducted the benchmarking of FPGA implementation of super-sub networks. All authors contributed to the write-up of the manuscript.
Competing interests: A patent application was submitted for this work on 9 November 2023 with the names of Y. Xu, Z.Z., Y.Xi., V.N., K.N., and S.G. on it. It has been issued by the Office of Technology Management of Pennsylvania State University. A cover page patent was applied on 30 November 2023 for Penn State invention disclosures (2023-5715) to V.N. The other authors declare that they have no competing interests.
Data and materials availability: All data needed to evaluate the conclusions in the paper are present in the paper and/or the Supplementary Materials.
Supplementary Materials
This PDF file includes:
Figs. S1 to S12
References
REFERENCES AND NOTES
- 1.K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016, pp. 770–778. [Google Scholar]
- 2.G. Huang, Z. Liu, L. Van Der Maaten, K. Q. Weinberger, Densely connected convolutional networks, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21 July 2017, pp. 4700–4708. [Google Scholar]
- 3.Ren S., He K., Girshick R., Sun J., Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 39, 1137–1149 (2015). [DOI] [PubMed] [Google Scholar]
- 4.J. Redmon, S. Divvala, R. Girshick, A. Farhadi, You only look once: Unified, real-time object detection, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016), pp. 779–788. [Google Scholar]
- 5.J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805 [cs.CL] (2018).
- 6.Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov, Roberta: A robustly optimized bert pretraining approach. arXiv:1907.11692 [cs.CL] (2019).
- 7.Mehonic A., Kenyon A. J., Brain-inspired computing needs a master plan. Nature 604, 255–260 (2022). [DOI] [PubMed] [Google Scholar]
- 8.Chang J.-W., Kang K.-W., Kang S.-J., An energy-efficient fpga-based deconvolutional neural networks accelerator for single image super-resolution. IEEE Trans. Circuits Syst. Video Technol. 30, 281–295 (2020). [Google Scholar]
- 9.Li J., Un K.-F., Yu W.-H., Mak P.-I., Martins R. P., An fpga-based energy-efficient reconfigurable convolutional neural network accelerator for object recognition applications. IEEE Trans. Circuits Syst. II Express Briefs 68, 3143–3147 (2021). [Google Scholar]
- 10.Guo K., Zeng S., Yu J., Wang Y., H. Yang, A survey of fpga-based neural network inference accelerators. ACM Trans. Reconfigurable Technol. Syst. 12, 1–26 (2019). [Google Scholar]
- 11.S. D. Brown, R. J. Francis, J. Rose, Z. G. Vranesic, Field-Programmable Gate Arrays, vol. 180 (Springer Science & Business Media, 1992). [Google Scholar]
- 12.S. Wen, A. S. Rios, K. Lekkala, L. Itti, What can we learn from misclassified ImageNet images? arXiv:2201.08098 [cs.CV] (2022).
- 13.Russakovsky O., Deng J., Su H., Krause J., Satheesh S., Ma S., Huang Z., Karpathy A., Khosla A., Bernstein M., Berg A. C., Fei-Fei L., ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. 115, 211–252 (2015). [Google Scholar]
- 14.N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, S. Bates, S. Bhatia, N. Boden, A. Borchers, R. Boyle, P. Cantin, C. Chao, C. Clark, J. Coriell, M. Daley, M. Dau, J. Dean, B. Gelb, T. V. Ghaemmaghami, R. Gottipati, W. Gulland, R. Hagmann, C. Richard Ho, D. Hogberg, J. Hu, R. Hundt, D. Hurt, J. Ibarz, A. Jaffey, A. Jaworski, A. Kaplan, H. Khaitan, A. Koch, N. Kumar, S. Lacy, J. Laudon, J. Law, D. Le, C. Leary, Z. Liu, K. Lucke, A. Lundin, G. M. Kean, A. Maggiore, M. Mahony, K. Miller, R. Nagarajan, R. Narayanaswami, R. Ni, K. Nix, T. Norrie, M. Omernick, N. Penukonda, A. Phelps, J. Ross, M. Ross, A. Salek, E. Samadiani, C. Severn, G. Sizikov, M. Snelham, J. Souter, D. Steinberg, A. Swing, M. Tan, G. Thorson, B. Tian, H. Toma, E. Tuttle, V. Vasudevan, R. Walter, W. Wang, E. Wilcox, D. H. Yoon, In-datacenter performance analysis of a tensor processing unit. arXiv:1704.04760 [cs.AR] (2017).
- 15.Lee J., Kim C., Kang S., Shin D., Kim S., Yoo H.-J., Unpu: An energy-efficient deep neural network accelerator with fully variable weight bit precision. IEEE J. Solid-State Circuits 54, 173–185 (2019). [Google Scholar]
- 16.Machupalli R., Hossain M., Mandal M., Review of asic accelerators for deep neural network. Microprocess. Microsyst. 89, 104441 (2022). [Google Scholar]
- 17.D. Franklin, Nvidia jetson agx xavier delivers 32 teraops for new era of AI in robotics. https://developer.nvidia.com/blog/nvidia-jetson-agx-xavier-32-teraops-ai-robotics/.
- 18.Nvidia t4. www.nvidia.com/en-us/data-center/tesla-t4/.
- 19.J. Qiu, J. Wang, S. Yao, K. Guo, B. Li, E. Zhou, J. Yu, T. Tang, N. Xu, S. Song, Y. Wang, H. Yang, Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, Monterey, CA, USA, February 21–23, 2016, pp. 26–35. [Google Scholar]
- 20.X. Zhang, A. Ramachandran, C. Zhuge, D. He, W. Zuo, Z. Cheng, K. Rupnow, D. Chen, 2017 IEEE/ACM International Conference on Computer-Aided Design (ICCAD), Irivine, CA, USA, November 13–17, 2017, pp. 894–901. [Google Scholar]
- 21.S. Scalera, J. Vazquez, Proceedings. IEEE Symposium on FPGAs for Custom Computing Machines (Cat. No.98TB100251), Napa Valley, CA, April 15–17, 1998, pp. 78–85. [Google Scholar]
- 22.S. Trimberger, D. Carberry, A. Johnson, J. Wong, Proceedings. The 5th Annual IEEE Symposium on Field-Programmable Custom Computing Machines Cat. No.97TB100186), Napa Valley, CA, April 16–18, 1997, pp. 22–28. [Google Scholar]
- 23.Vipin K., Fahmy S. A., Fpga dynamic and partial reconfiguration: A survey of architectures, methods, and applications. ACM Comput. Surv. 51, 1–39 (2018). [Google Scholar]
- 24.Babu P., Eswaran P., Reconfigurable fpga architectures: A survey and applications. J. Inst. Eng. India Ser. B 102, 143–156 (2021). [Google Scholar]
- 25.Zhang W., Jha N. K., Shang L., A hybrid nano/cmos dynamically reconfigurable system–part i: Architecture. J. Emerg. Technol. Comput. Syst. 5, 1–30 (2009). [Google Scholar]
- 26.Lin T.-J., Zhang W., Jha N. K., Sram-based nature: A dynamically reconfigurable fpga based on 10t low-power srams. IEEE Trans. Very Large Scale Integr. 20, 2151–2156 (2012). [Google Scholar]
- 27.W. Aubry, B. Le Gal, D. Negru, S. Desfarges, D. Dallet, Proceedings of the 2012 Conference on Design and Architectures for Signal and Image Processing, Karlsruhe, Germany, October 23–25, 2012, pp. 1–7. [Google Scholar]
- 28.J. Delorme, A. Nafkha, P. Leray, C. Moy, 2009 International Conference on Reconfigurable Computing and FPGAs, Cancun, Quintana Roo, Mexico, December 9, 2009, pp. 386–391. [Google Scholar]
- 29.Hosseinabady M., Nunez-Yanez J. L., Dynamic energy management of fpga accelerators in embedded systems. ACM Trans. Embed. Comput. Syst. 17, 1–26 (2018). [Google Scholar]
- 30.A. Rahman, V. Polavarapuv, Proceedings of the 2004 ACM/SIGDA 12th International Symposium on Field Programmable Gate Arrays, FPGA ‘04, Monterey, CA, February 22–24, 2004, pp. 23–30. [Google Scholar]
- 31.L. Shang, A. S. Kaviani, K. Bathala, Proceedings of the 2002 ACM/SIGDA Tenth International Symposium on Field-Programmable Gate Arrays, FPGA ‘02, Monterey, CA, February 24–26, 2002, pp. 157–164. [Google Scholar]
- 32.F. Li, D. Chen, L. He, J. Cong, Proceedings of the 2003 ACM/SIGDA Eleventh International Symposium on Field Programmable Gate Arrays, FPGA ‘03, Monterey, CA, February 23–25, 2003, pp. 175–184. [Google Scholar]
- 33.FPL ‘02: Proceedings of the Reconfigurable Computing Is Going Mainstream, 12th International Conference on Field-Programmable Logic and Applications (Springer-Verlag, 2002). [Google Scholar]
- 34.J. Greene, S. Kaptanoglu, W. Feng, V. Hecht, J. Landry, F. Li, A. Krouglyanskiy, M. Morosan, V. Pevzner, Proceedings of the 19th ACM/SIGDA International Symposium on Field Programmable Gate Arrays, Monterey, CA, USA, Feb 27–March 1, 2011, pp. 87–96. [Google Scholar]
- 35.Tanachutiwat S., Liu M., Wang W., Fpga based on integration of cmos and rram. IEEE Trans. Very Large Scale Integr. 19, 2023–2032 (2011). [Google Scholar]
- 36.Huang K., Ha Y., Zhao R., Kumar A., Lian Y., A low active leakage and high reliability phase change memory (pcm) based non-volatile fpga storage element. IEEE Trans. Circuits Syst. I: Regul. Pap. 61, 2605–2613 (2014). [Google Scholar]
- 37.Zhao W., Belhaire E., Chappert C., Mazoyer P., Spin transfer torque (stt)-mram–based runtime reconfiguration fpga circuit. ACM Trans. Embed. Comput. Syst. 9, 1–16 (2009). [Google Scholar]
- 38.Mulaosmanovic H., Breyer E. T., Dünkel S., Beyer S., Mikolajick T., Slesazeck S., Ferroelectric field-effect transistors based on hfo2: A review. Nanotechnology 32, 10.1088/1361-6528/ac189f (2021). [DOI] [PubMed] [Google Scholar]
- 39.Schroeder U., Park M. H., Mikolajick T., Hwang C. S., The fundamentals and applications of ferroelectric HfO2. Nat. Rev. Mater. 7, 653–669 (2022). [Google Scholar]
- 40.Khan A. I., Keshavarzi A., Datta S., The future of ferroelectric field-effect transistor technology. Nat. Electron. 3, 588–597 (2020). [Google Scholar]
- 41.Yu T., Xu Y., Deng S., Zhao Z., Jao N., Kim Y. S., Duenkel S., Beyer S., Ni K., George S., Narayanan V., Hardware functional obfuscation with ferroelectric active interconnects. Nat. Commun. 13, 2235 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Chen X., Ni K., Niemier M. T., Han Y., Datta S., Hu X. S., Power and area efficient fpga building blocks based on ferroelectric fets. IEEE Trans. Circuits Syst. I: Regul. Pap. 66, 1780–1793 (2019). [Google Scholar]
- 43.Jiang Z., Zhao Z., Deng S., Xiao Y., Xu Y., Mulaosmanovic H., Duenkel S., Beyer S., Meninger S., Mohamed M., Joshi R., Gong X., Kurinec S., Narayanan V., Ni K., On the feasibility of 1t ferroelectric FET memory array. IEEE Trans. Electron Devices 69, 6722–6730 (2022). [Google Scholar]
- 44.Mulaosmanovic H., Ocker J., Müller S., Schroeder U., Müller J., Polakowski P., Flachowsky S., van Bentum R., Mikolajick T., Slesazeck S., Switching kinetics in nanoscale hafnium oxide based ferroelectric field-effect transistors. ACS Appl. Mater. Interfaces 9, 3792–3798 (2017). [DOI] [PubMed] [Google Scholar]
- 45.S. Deng, G. Yin, W. Chakraborty, S. Dutta, S. Datta, X. Li, K. Ni, 2020 IEEE Symposium on VLSI Technology, June 14–19, 2020, pp. 1–2.
- 46.Cao Y. K., What is predictive technology model (ptm)? SIGDA Newsl. 39, 1 (2009). [Google Scholar]
- 47.Y. Xiao, Y. Xu, Z. Jiang, S. Deng, Z. Zhao, A. Mallick, L. Sun, R. Joshi, X. Li, N. Shukla, V. Narayanan, K. Ni, IEEE International Electron Devices Meeting, San Francisco, CA, USA, Dec. 3–7, 2022. [Google Scholar]
- 48.Yoon J.-H., Chang M., Khwa W.-S., Chih Y.-D., Chang M.-F., Raychowdhury A., A 40-nm 118.44-tops/w voltage-sensing compute-in-memory rram macro with write verification and multi-bit encoding. IEEE J. Solid-State Circuits 57, 845–857 (2022). [Google Scholar]
- 49.C. Lin, S. Kang, Y. Wang, K. Lee, X. Zhu, W. Chen, X. Li, W. Hsu, Y. Kao, M. Liu, W. Chen, Y. Lin, M. Nowak, N. Yu, L. Tran, 2009 IEEE International Electron Devices Meeting (IEDM), Baltimore, MD, USA, December 7–9, 2009, pp. 1–4. [Google Scholar]
- 50.Murray K. E., Petelin O., Zhong S., Wang J. M., ElDafrawy M., Legault J.-P., Sha E., Graham A. G., Wu J., Walker M. J. P., Zeng H., Patros P., Luu J., Kent K. B., Betz V., Vtr 8: High performance cad and customizable fpga architecture modelling. ACM Trans. Reconfigurable Technol. Syst. 13, 1–55 (2020). [Google Scholar]
- 51.J. Rose, J. Luu, C. W. Yu, O. Densmore, J. Goeders, A. Somerville, K. B. Kent, P. Jamieson, J. Anderson, Proceedings of the ACM/SIGDA International Symposium on Field Programmable Gate Arrays, FPGA ‘12, Monterey, CA, USA, February 22–24, 2012, p. 77–86. [Google Scholar]
- 52.Xilinx vitis ai platform, www.xilinx.com/products/design-tools/vitis/vitis-ai.html.
- 53.M. Trentzsch, S. Flachowsky, R. Richter, J. Paul, B. Reimer, D. Utess, S. Jansen, H. Mulaosmanovic, S. Müller, S. Slesazeck, J. Ocker, M. Noack, J. Müller, P. Polakowski, J. Schreiter, S. Beyer, T. Mikolajick, B. Rice., 2016 IEEE International Electron Devices Meeting (IEDM) (IEEE, 2016), pp. 11–5.
- 54.S. Beyer, S. Dünkel, M. Trentzsch, J. Müller, A. Hellmich, D. Utess, J. Paul, D. Kleimaier, J. Pellerin, S. Müller, J. Ocker, A. Benoist, H. Zhou, M. Mennenga, M. Schuster, F. Tassan, M. Noack, A. Pourkeramati, F. Müller, M. Lederer, T. Ali, R. Hoffmann, T. Kämpfe, K. Seidel, H. Mulaosmanovic, E. T. Breyer, T. Mikolajick, S. Slesazeck, 2020 IEEE International Memory Workshop, (IMW) Dresden, Germany, May 17–20, 2020, pp. 1–4.
- 55.S. Dutta, H. Ye, W. Chakraborty, Y.-C. Luo, M. S. Jose, B. Grisafe, A. Khanna, I. Lightcap, S. Shinde, S. Yu, S. Datta, 2020 IEEE International Electron Devices Meeting, (IEDM) December 12–18, 2020, pp. 36.4.1–36.4.4.
- 56.Tan A. J., Liao Y.-H., Wang L.-C., Shanker N., Bae J.-H., Hu C., Salahuddin S., Ferroelectric hfo2 memory transistors with high-κ interfacial layer and write endurance exceeding 1010 cycles. IEEE Electron Device Lett. 42, 994–997 (2021). [Google Scholar]
- 57.Vivado design suite user guide: Partial reconfiguration. https://docs.xilinx.com/v/u/2018.1-English/ug909-vivado-partial-reconfiguration (2018).
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Figs. S1 to S12
References






