Abstract
End-to-end encryption and reliability of the transmitted data are essential requirements in the present era of internet enabled smart devices. Adhering to current industry standards, the Advanced Encryption Standard (AES) and Cyclic Redundancy Check (CRC) are the two most utilized methods for ensuring security and reliability. To integrate AES and CRC functionality in ultralow-power embedded System on Chips (SoCs), dedicated computation engines/co-processors are often used, consuming valuable silicon area and additional battery power. This paper presents the design of an energy-efficient multipurpose encryption engine capable of processing both AES and CRC algorithms using a shared Galois Field Computation Unit (GFCU). By decomposing the necessary Galois Field operations of AES and CRC to their fundamental binary steps, it was possible to identify shared operations in these two algorithms. This approach allowed the development of a resource shared system architecture capable of computing AES-128 and CRC-32 using a single computation unit. The GFCU based design was implemented in an area of 151μm x 151μm in 90nm technology node. The energy consumption of the design operating at 0.8 V supply voltage for a 25.6 Mbps throughput was less than 280pJ and 140pJ for AES-128 encryption and CRC-32, respectively.
Keywords: Energy efficient encryption, energy efficient computing, sustainable computing, galois field, internet of things
1. Introduction
TRENDS in present technology and the forseeable future exhibit a steady evolution towards data-centric connected devices [1]. This trend promises a future where internet connectivity will be omnipresent in electronic products. Even at present, wireless connectivity is being introduced in a large variety of new applications, ranging from household electronics to critical biomedical devices such as pacemakers. This upsurge in data traffic creates new challenges around the security and the reliability of the information being transmitted. Privacy of information is of major importance in modern communication and various encryption standards are developed to address this need. Data Encryption Standard (DES) [2], Advanced Encryption Standard (AES) [3] and Blowfish [4] are some of the encryption algorithms that are currently in practice [5]. Among these standards, AES has seen the most widespread adoption [6] due to ease of implementation, computational efficiency and robustness. Alongside security, the need for reliable data transmission has been addressed by error checking methods such as Cyclic Redundancy Check (CRC) [7]. CRC has multiple variants and is typically employed in the hardware layer of the communication protocol. The IEEE 802.15.1 wireless protocol, also known as Bluetooth, is an ideal example where both AES and CRC are utilized [8] to achieve secure and robust connectivity. Other usage examples of AES and CRC include IEEE 802.11 (Wireless LAN) [7] and IEEE 802.15.4 (ZigBee) [8]. All three of these wireless communication protocols are extremely popular for IoT applications and are used in millions of devices. Improving energy-efficieny in these applications is important to sustain the next generation IoT devices, which is expected to grow exponentially in number [11]
Designing a product with a secure and reliable communication interface requires additional computation to processes AES, CRC or similar algorithms. These computational worklaods can be embedded in the SoC firmware [12] or can be integrated as a dedicated coprocessors. The later is the preferred approach in SoC design but can become challenging in ultra-low-power embedded systems where power and area are at a premium. In many SoCs targeted towards IoT applications, minimizing the power and area requirement of the integrated IPs take precedence over maximum throughput. In most IoT applications, high-speed data rates are not strongly required, allowing for additional headroom for design optimization. Some implementations of such AES or CRC can also be found in smartcards, where low power operation holds higher priority than data rate.
In this paper, we analyze the underlying computation steps of AES and CRC algorithm with an intent to develop a low-power and area efficient solution for computing both the algorithms. By identifying the arithmetic similarity of these two algorithms, it was possible to realize a micro-coded design that shared hardware resources. This approach facilitated the implementation of an area efficient solution, reducing the number of gates compared to a non-micro-coded design. The chosen micro-architecture also allowed for efficient clock gating and a smaller area, delivering significant reduction in dynamic and leakage power.
2. Theoretical Background
Although AES and CRC are two different algorithms developed for different use cases, their underlying operations are computed using the same finite field arithmetic, also known as Extended Galois Field (GF) [13]. The design presented in this paper is based on the simplification of Galois Field arithmetic using logical binary operations. A brief overview of GF representation of binary numbers, along with the computation steps of AES and CRC are presented here to assist better understanding of the proposed design.
2.1. Galois Field Representation
The Galois Field (GF) refers to any number space in which a finite set of unique elements exists, in contrast to the real number space (R) consisting of infinite unique elements [14]. AES and CRC are computed in GF(2) which contains only two unique numbers, 0 and 1, similar to a “bit” in binary space. The mapping of a byte (8-bits) to GF(2) is called GF(28) and is represented as a polynomial in the Galois Field. Each bit represents the value of a corresponding coefficient of the polynomial. For example, a byte with the value 01010011 is equivalent to the polynomial x6 + x4 + x + 1 in GF(28). This mapping is visualized in Fig. 1. All computation in the AES algorithm is performed on 8th order polynomials in GF(28) space.
Fig. 1.
Representation of bits as coefficients of a polynomial in GF(28).
2.2. Advanced Encryption Standard
AES is a special variant of the Rijndael cipher [15] where the cipher key is limited to three specific sizes, either 128, 192 or 256 bits. In this paper, the design of an AES-128 is investigated due to its widespread application and popularity. A hardware developed for AES-128 can be easily scaled to larger key sizes if needed [16]. The input and output of the AES-128 algorithm are always 128-bits or 16-bytes long. The algorithm outputs 16-bytes of encrypted data which eliminates any resemblance to the input by increasing the entropy. The original AES proposal [3] represents the 16-bytes as a two-dimensional array, referred to as the state. A sample state is presented in Fig. 2. The algorithm for a 128-bit AES Encryption consists of 10 rounds of processing. Each round of processing includes a single-byte based substitution step (SubBytes), a row-wise rotating shift step (ShiftRows), a column-wise mixing step (MixColumns), and the addition of the round key (AddRoundKey). These four operations are performed in each round, except for the last round which does not require the MixColumns step
Fig. 2.
Byte and word arrangement in an input array or “state”.
The order in which these four steps are executed is different for encryption and decryption. The decryption procedure requires the inverse of SubBytes, ShiftRows and MixColumns steps, termed as InvSubBytes, InvShiftRows and InvMixcolumns. The AddRoundKey is similar in both forward and reverse cipher. In each round of AES, the key is expanded using a “Key Schedule” algorithm. The overall flow of 128-bit cipher and inverse cipher for AES is shown in Fig. 3. The design proposed in this paper is responsible for computing all the 10 rounds of an AES-128. In a typical SoC with encryption ability [17] the key expansion in not handled by the AES Engine since the key is updated infrequently, usually after processing multiple streams of input states. The key is expanded and placed in memory by the host processor for the AES engine to use.
Fig. 3.
Encryption and decryption flow in AES for a 128-bit key.
Each round of the AES algorithm computes the four steps in sequence. The AddRoundKey step performs an addition of each byte in the state with each byte in the key. Addition in GF(28) is equivalent to a logical XOR between the bytes. The arithmetic rule for addition and subtraction of two 8-bit GF polynomials A(x) and B(x) is listed in
(1) |
where xi is the i-th bit of the result, ci is the coefficient of xi, ai is the ith coefficient in A(x) and bi is the ith coefficient in B(x). Equation (1) can be simplified to show that addition and subtraction in GF(28) is a bitwise XOR operation between the coefficients of two. The AddRoundKey step is also utilized in the inverse cipher of the AES decryption process. The SubBytes step performs a lookup in the SBOX table using the bytes of the state as addresses. The lookedup values are then used to replace the original bytes. In the reverse cipher, a reversed SBOX is used to perform the lookup in the InvSubBytes step. The ShiftRows step performs a circular left shift of each row of the state. The amount of shift is (r - 1) where r indicates the row number and ranges from 1 to 4. In contrast, the InvShiftRows step performs a circular right shift of each row with the same shift amount. The MixColumns step is the most complex step of the entire algorithm. In this step, a constant 4×4 matrix in GF(28) denoted by C is multiplied with each column of the state. Since each column consists of four bytes, it can be considered as a 4×1 matrix. Therefore, the matrix multiplication of each column with C results in a new 4×1 matrix. This result is used to replace the original column in the state. The arithmetic representation of MixColumns step is given in
(2) |
where, S’I = column i of the output state, Si =column iof the input state and
The constant matrix used in this step is defined in the AES specifications. Due to the limited variation in the C matrix, MixColumns only require doubling (multiply-by-2) and tripling operation (multiply-by-3) along with addition, all of which are performed on GF(28) polynomials. Multiplication in GF(28) is performed using different arithmetic rules and requires special hardware to implement. Rather than implementing an entire multiplier for GF(28) polynomials, it is efficient to implement only doubling and tripling circuits due to mathematical simplification that can be achieved. The simplification can be derived from the rule of multiplication in GF(28) is given in
(3) |
where,and A(x) and B(x) denotes the two polynomial inputs and P(x) is an irreducible polynomial in GF(28). P(x) can take on different values, but a fixed polynomial x8 + x4 + x3 + x +1 is defined by the specifications of AES. Based on Equation 3, a doubling operation can be simplified to a left shift and a conditional XOR operation with a constant value of 00011011. The step by step operation for doubling is given in Equations 4.a to 4.e
(4.a) |
(4.e) |
Following the rules of multiplication, the tripling operation can be simplified to a doubling operation followed by an addition of the same polynomial. The addition in GF(28) is nothing more than an unconditional XOR operation. The steps for the tripling operation are given in Equations 5.a to 5.e.
(5.a) |
(5.e) |
Based on the equations presented, the MixColumns step can be decomposed down to a conditional sequence of two simple logical operations: i)A logical left shift and ii)A logical XOR. The InvMixColumns step needed for the inverse cipher involves the same transformation given in Equation (2) using a different constant matrix defined in Equation (6). The computations involved to multiply the state polynomials by 0e, 0b, 0d and 09 can be broken down to a sequence of multiply-by-2 and multiply-by-3 operations.
(6) |
These simplifications to improve the computational efficiency of AES has been the focus of many researchers [18], [19], [20]. As a result, all the arithmetic calculations of AES can be decomposed into a set of basic binary operations. In this paper, these binary operations are further optimized with a focus to improve the reusability of hardware resources and to achieve maximum functionality per gate. To add more flexibility to the design, the possibility of computing the CRC algorithm on the same hardware was also investigated.
2.3. Cyclic Redundancy Check (CRC)
One of the most common computational operations that occur in many present-day digital devices is CRC or Cyclic Redundancy Check. CRC is a technique of ensuring the reliability of data transmission or data storage where there is a possibility of data loss or corruption due to noise or other causes. It works by attaching a few bits of additional encoded information with the actual data before transmission or storage. These extra bits are computed using a hashing algorithm and can be used by the receiver/retriever to ensure that the data integrity is maintained, or in other words, to check if the data is same as when the CRC hash was calculated. The output of the CRC calculation can range from a single bit to multiple bytes. Depending on the length of the CRC result length; there exist multiple CRC standards. The flow of calculation involving these CRC standards are relatively the same, and most of the differences are just in the endian-ness of the data, the chosen polynomial and the choice of an inverted result.
The CRC hash is based on GF(2) polynomial arithmetic similar to AES. The data for which CRC has to be computed is considered to be a long binary number or a polynomial in GF(2). The hash algorithm divides this number with an irreducible polynomial and saves the remainder as the CRC output. Depending on the length of this polynomial, the CRC is called either CRC-8, CRC-16 or CRC-32. Division in GF(2) is visualized as a simple long division of polynomials and can be simplified to repeated shift and subtract operations. Referring back to Equation (1), a subtraction in GF(2) is same as an addition and can be performed via an XOR operation. Due to the underlying finite field arithmetic, the fundamental operations of CRC are similar to AddRoundKey and ShiftRows steps of AES, allowing for a shared design approach.
The CRC-32 is a variant of the CRC algorithm that enforces the use of an initial value, inversion of the result, and processing of the input bit stream in a reversed manner [24]. The reverse processing requirement is usually achieved by processing the most significant byte first from a data stream. The overall implementation of CRC-32 can be broken down to three fundamental steps: CRCInit, CRCByteLoad and CRCComp. The CRCInit step loads the initial 32-bit CRC value of 0xFFFFFF to the system. The CRCByteLoad step extracts the most significant byte from the data stream and XOR’s with the initial value. The result of CRCByteLoad is used in the CRCComp step to compute the CRC of each byte using the “Shift and XOR” method mentioned earlier. To compute the CRC of a state, each byte of the state must be processed sequentially using a combination of CRCByteLoad and CRCComp.
3. Proposed Design
Looking at the low-level operations of AES and CRC, it is apparent that XOR operations, shift operations and dataflow pattern is similar between the algorithms. Therefore, a resourse shared computation unit is proposed in this paper that can be configured as needed to process either the four steps of AES or CRC. To reduce the complexity of the design, some modifications to the AES needs to be introduced in accordance with [20]. These modifications en- force a row-wise access to the state, eliminating the need for byte level operations in the ShiftRows and MixColumns step. The computation unit assumes that a row of the input state is provided at the input. The computation methods are designed to generate a row of the output state. This row-wise 32-bit access method is different than the column-wise algorithm described in the original specification document for AES [3]. From an arithmetic perspective, this circuit can be termed as a Galois Field Computation Unit (GFCU) with selectable output modes.
3.1. Galois Field Computation Unit (GFCU)
As discussed earlier, to compute AES and CRC, a set of operations needs to take place. The design is formulated to meet the needs of these operations. Fig. 4 shows the block diagram of the proposed design. The proposed GFCU supports twelve different operations according to the values present at the “mode_select” control pins. Most of these operations work with the inputs (data_in1 and data_in2) and generates an output (data_out). However, some of these operations work with the internal register values (temp_reg1 and temp_reg2) and updates the registers. The different modes of operation have been carefully developed with a focus to reuse the components throughout the computation steps of AES-128 and CRC-32. Table 1 summarizes the modes and functionality of each mode. It is important to point out that the GFCU does not cater for the SubBytes step of AES, overall system architecture. Using the available modes, it a is a simple lookup operation with no actual logic involved. Therefore, the SubBytes step is integrated as a part of the overall system architecture. The ShiftRows step is the first logical operation of AES. The GFCU can complete the task of circular left shift using mode IV, V, VI and IX. Depending on the row being processed, a variable length left shift of 8, 16 or 24 bits is performed. Each shift is followed by an XOR operation between the result of the shift and the byte(s) that is/are shifted out (msb_out). An example of the ShiftRows operation is shown in Fig. 5. The Bn notation is used to indicate the four bytes of the state rows. The result for each row of ShiftRows step requires a single clock to appear at the output.
Fig. 4.
Functional block diagram of the proposed galois field computation unit (GFCU) for AES and CRC. The “mode” pins allow for various output logic.
TABLE 1.
Selectable Functions for the Galois Field Computation Unit
Mode | Function | Shift (L/R) | Used Values | Output | Storage | Remarks |
---|---|---|---|---|---|---|
I | XOR | 0 | in1, in2 | in1 ⊕ in2 | none | Used for AddRoundKey |
II | T1_XOR | 0 | in1, temp1 | in1 ⊕ temp1 | temp1 | Used for MixColumns |
III | T12_XOR | 0 | temp1, temp2 | temp1 ⊕ temp2 | temp1 | Used for MixColumns, CRCInit |
IV | ROTATE_8 | 8 (L) | in1, msb_out | in1 ⊕ msb_out | none | Used for ShiftRows, InvShiftRows and CRCInit |
V | ROTATE_16 | 16 (L) | in1, msb_out | in1 ⊕ msb_out | none | Used for ShiftRows, InvShiftRows |
VI | ROTATE_24 | 24 (L) | in1, msb_out | in1 ⊕ msb_out | none | Used for ShiftRows, InvShiftRows |
VII | MULT_T1 | 1 (L) | in1, in2 | in1 ⊕ in2 | temp1 | Used for MixColumns, InvMixColumns |
VIII | MULT_T2 | 1 (L) | in1, in2 | in1 ⊕ in2 | temp2 | Used for MixColumns, InvMixColumns |
IX | PASS1 | 0 | in1 | in1 | temp1 | Used for ShiftRows |
X | PASS2 | 0 | in2 | in2 | temp2 | Used for CRCInit |
XI | BYTE_LOAD | 8 (L) | msb_out, temp2 | msb_out ⊕ temp2 | temp1, temp2 | Used for CRCByteLoad |
XII | CRC_XOR | 1 (R) | temp1, in2 | temp1 ⊕ in2, temp1 | temp1, temp2 | Used for CRCComp |
Fig. 5.
Dataflow within the GFCU (simplified) for computing the ShiftRows operation on the 2nd row of the state.
The MixColumns step requires a series of operations to be completed. The benefit of using a row-based access to the state is that the computation of each step can be carried out on 32-bits at a time, compared to the column-based access where computation must be done using individual bytes. However, when accessing the state in rows, the calculation of the MixColumns step is performed in a slightly different manner. Rather than multiplying a column from the input state with the constant matrix C to generate the elements of a column in the output state, the row-based access reads a row from the input state and create a partial result to fill all the rows of the output state. The step is repeated for four rows of the input state and the results of each row are accumulated in the output state to yield the complete result. To explain this with clarity, the matrix multiplication steps for the first-row elements of the output state is shown in Fig. 6. To calculate each output row of the MixColumns step, the system needs to compute a multiplication by 2, a multiplication by 3 and two multiplications by 1 for each row and accumulate the result in the designated row of the output state. The system can utilize the GFCU to perform each of these steps. The procedure must be repeated for each row to complete the entire MixColumns operation.
Fig. 6.
Row-wise computation of the MixColumns step shown for a single row.
It has been shown in Equation 4.e that multiply-by-2 in GF(28) is a left shift and XOR operation with a constant value. Mode VII or VIII of the GFCU can be used to achieve this operation. The constant will be provided at input-2 by the system controller. The operation of a multiply-by-2 is depicted in Fig. 7.
Fig. 7.
Dataflow within the GFCU (simplified) for computing the “multiplyby-2” operation for a row of the input state.
Multiplication by 3 is an additional XOR operation between the result of multiply-by-2 and the original input. Given that the result of multiply-by-2 will be stored in the temp register, a multiply-by-3 can be achieved by executing mode II right after mode VII or VIII has been executed. This is one of the reasons behind the inclusion of a temp register in the design of the GFCU. Although this approach requires that the multiply-by-3 must be computed after multiply-by-2, it keeps resource usage to a minimum at no extra cost in the performance. Due to the complexity, MixColumns consume the most clock cycles compared to the other steps. The remaining AddRoundKey step is a simple XOR operation between the state rows and round-key rows. This can be easily done using mode I of the GFCU.
The last requirement of the computation unit is to carry out the steps for the CRC computation. Mode X is designed to load the GFCU with the initial value of CRC-32. Each subsequent byte from the word sized inputs is loaded and XOR-ed with the existing CRC value using mode XI. This is the previously discussed CRCByteLoad operation and shown in Fig. 8. Computation of byte level CRC (CRCComp) is performed in mode XII, with the CRC polynomial provided by the system controller at the second input (in2). It is important to point out that the XOR performed in mode XII is conditional and depends on the most significant bit of the previous result available in temp_reg1. Mode III and IV is also utilized in the CRC computation flow for data arrangement and result inversion.
Fig. 8.
Dataflow within the GFCU (simplified) for computing the CRCByteLoad operation on a row of the input state.
3.2. Aditional Components and Integration
A system architecture was designed and developed around the GFCU to create a fully functioning AES/CRC engine. The system integrates all the supporting components needed for all the necessary computation. This includes a register file for storing input and output data and two registers for storing the GF multiplication constant and the constant CRC-32 polynomial. In this system architecture, the SBOX/ RSBOX values for byte substitution are stored in external memory and retrieved as needed. Since the substituion values are static, it is considerably efficient to store them in some form of non-volatile memory, rather than computing them in real time. A central control unit and a mux are also present in the architecture for administrating the data flow and the mode of the GFCU for either the AES or the CRC modes. The functional block diagram of the system in shown in Fig. 9. The programming interface of the system consists of two 32-bit input and output bus, a load/read signal and a control signal for triggering either AES or CRC computation. The system also has a memory interface for accessing the pre-calculated cipher key from the memory. The memory interface consists of an address bus, a data input bus, and a read enable signal.
Fig. 9.
The system architecture for efficient AES/CRC ASIC engine utilizing a galois field computation unit.
3.3. Priciple of Operation
The system presented in Fig. 9 has two separate flows of computation for AES and CRC computations. The AES is computed for 128 bits and therefore goes through the 10 rounds of AddRoundKey, SubBytes, ShiftRows and MixColumns by using the GFCU in different modes. The role of the system controller is to control the configuration of the functional unit in accordance with the flow of AES. In addition to controlling the functional unit, the control unit also has to send appropriate signals to the supporting registers and muxes to set up the proper datapath for the computed data and its storage. The flow for computing AES is presented in Fig. 10. Encryption of 128-bit or 16-bytes of data takes 52 clock cycles for each round and a total of 468 cycles for 10 rounds. The clock requirement for the last round is less as the AES algorithm does not require the MixCoulmns in the last round. Using the design presented, it is straightforward to extend the proposed design to AES-192 or AES-256. Incorporating the additional rounds will not require any significant changes.
Fig. 10.
Flow of computation for one round of AES-128 encryption.
The computation flow of CRC is relatively simpler than that of AES as it involves only three distinct operations. The input data can be 16-bytes or larger and the resulting hash value will be stored in the internal temporary register of the GFCU and may be used to continue calculation if more data is available at the input. Once the computation is done; the CRC result can be read from the output port. The computation flow for CRC is shown in Fig. 11. The cycles required for CRC computation depends on the size of the input data. For an input of n-bits, a total of n x 2 cycles will be required where each bit requires 2 cycles to complete.
Fig. 11.
Flow for CRC-32 computation for 1-byte of data.
4. RTL Description and Implementation
To verify the advantages of the proposed design, a hierarchical RTL description of the entire design was developed using Verilog. The module descriptions and synchronous procedural blocks were created with a focus towards arhitectrual clock gating and data gating to reduce power. The clocked modules were created with properly placed enable signal to allow the logic synthesis tool to identify and automatically insert clock gating cells wherever possible. In addition to clock gating, attention was given to disable the output of un-utilized modules. For example, the register file is kept disabled in all states except for the states with register read/write. Similarly, the entire GFCU is disabled when performing non-computational phases, such as SubBytes. The design was tested with a testbench in VCS simulator [21] and verified against pre-computed AES/CRC outputs to ensure the correct functionality.
4.1. Logic Synthesis and Design Constraints
The functionally verified RTL description of the design was synthesized with the Synopsys 90 nm Generic Library [22] using Synopsys Design Compiler. Synopsys Power Compiler was invoked to enable automatic clock gating and datapath gating during design elaboration step. Switching activity information from the RTL simulation was provided to the synthesis tool to perform activity-based logic optimization to enhance timing and power. A mixture of standard threshold and high threshold transistors were used to achieve a low-power implementation within the provided time constraints. The design was synthesized for a maximum clock frequency of 100.0 MHz and a target to achieve minimum silicon area. A 100.0 MHz clock can support a maximum data throughput of 25.6 Mbps, which exceeds the requiremnts of Bluetooth 4.0 Low Energy [23], the popular low power IEEE 802.15.4 protocol ZigBee [10] and most internet enabled IoT devices. The post synthesis netlist, parasitic information and standard delay format files were collected and imported to VCS [21] to ensure the circuit functioned properly with clock gating cells in place and at the intended operating speed.
4.2. Physical Implementation and Power Estimation
The synthesized netlist was imported to Cadence Encounter for automated placement and routing. The netlist was routed with 1 polysilicon layer and 5 layers of metal in a chip area of 22,724 μm2. Post route simulation and DRC verification were performed to ensure correct functionality.
Synopsys PrimeTime PX was used to determine an accurate power consumption of the circuit in its actual operating conditions. Fig. 12 shows the layout of the entire design. The synthesized netlist required 2,332 standard cells.
Fig. 12.
The layout of the proposed AES/CRC engine synthesized and routed with 90 nm standard cell library.
5. Results and Performance Analysis
The proposed design of the AES/CRC engine (GFCU) was created with a focus to achieve small area and to reduce active switching elements. The data and cipher key used for simulating the circuit was collected from the AES processing standards [3]. Given that the operating freuqncy of IoT devices have a wide range, the dynamic power consumption of the circuit was measured for computing both AES and CRC for a range of clock frequencies: 6.0, 12.0, 25.0, 50.0 and 100.0 MHz. At 0.8V operating voltage, the power characteristics of the circuit at different frequencies are shown in Fig. 13a. From 6.0 MHz to 100.0 MHz, the delta in power consumption is 28.13 μW and 41.22 μW for AES and CRC computation respectively. For AES computation, the rate of increase is 0.3 μW/MHz and 0.4 μW/MHz for CRC. This sustained low-power behavior enables extremely low energy consumption at higher operating speeds. For an input state, the 10 rounds of AES-128 require a total of 499 clocks and 176 clocks for a CRC-32. With each step-up in frequency, the computation time is reduced significantly for each workload. Corresponding energy consumption per AES and CRC operation is shown in Fig. 13b.
Fig. 13.
(a) Power consumption characteristics over frequency range. (b) The energy per operation for each opearing.
Optimization of AES for low-power applications is a well researched topic. In most approaches, the goal is to reduce the energy consumption of each AES operation by either improving the performance or lowering the power consumptuon. Streamlining the design of one or more substeps of AES and minimizing control logic complexity are effective ways to lower the overall power consumption. This approach is seen in the work by Bui et al in [24], which proposes a 32-bit datapath based on a simple permutation network for the ShiftRows step combined with a low area S-BOX implementation. The work by Hamalainen et al in [27] is also based on a 8-bit byte permutation network but the authors also made optimization to the MixColumn step to reduce overall power.
AES architectures focused on both low area and low power tend to minimize the computation data width and reuse resources as much as possible. This approach is practiced by Good and Benaissa in [25], where they propose an 8-bit datapath equipped with a shared SubByte-ShiftRow unit. The multi-purpose unit uses composite field arithmetic and alternates between two modes as required by the datapath. The implementation of datapath uses buffer registers to enable the use of low-power single port memories and rigorous clock gating. To the best of our knowledge, none of the work on AES take advantage of common galois field operations between AES and CRC.
The resource shared design approach towards AES and CRC presented in this paper was compared with existing state of the art AES solutions from different disciplines and is shown in Table 2. The work in [25] and [26] is also focused towards low resource usage and proposes the design of separate modules for computing SubBytes, MixColumns and AddRoundKey. However, the deep datapath for the design proposed in [25] limits the maximum operating speed and throughput to only 4.31 Mbps. The work in [26] focusses on developing dedicated modules optimized for MixColumns, ShiftRows and SubBytes at the cost of higher area usage and dynamic power. The work in [24] and [27] uses a byte permutation network which is responsible for the ShiftRow operation and temporary storage of intermediate data. MixColumns and SubBytes are implemented as dedicated modules in both. The low logic depth of the units allows a high operating speed of upto 130 Mhz. However, the multi-layer flop-based permutation network likely consumes significant sequential power and preventes efficient clock gating. When compared with existing work, the primary strength of the design proposed in this paper is the efficient use of a single 32-bit computation unit for all steps of AES. At the simulated data size of 128-bit encryption, the obtained results clearly highlight the beneficial effects in area utilization and dynamic power consumption against other implementations. In contrast to all other work listed, the proposed design also has the added functionality of computing CRC in the same silicon area.
TABLE 2.
Power and Area Comparison with Existing Work for Computing 128-bit AES
Technology | Voltage | Clock | Gates | Area | Dynamic Power | Clocks/AES | Energy/AES | CRC Capability | |
---|---|---|---|---|---|---|---|---|---|
This Work | 90 nm | 0.8 V | 100 MHz | 2.3K | 22724 μm2 | 55.2 μW | 499 | 0.28 nJ | CRC-32 |
Ref [24] | 65 nm | 1.32 V | 10 MHz | NA | NA | 150 μW | 44 | 0.66 nJ | None |
Ref [25] | 130 nm | 0.8 V | 12 MHz | 5.5K | 21000 μm2 | 99 μW | 356 | 2.94 nJ | None |
Ref [26] | 90 nm | 1.2 V | 10 MHz | 9.0K | 50377 μm2 | 224 μW | 500 | 11.2 nJ | None |
Ref [27] | 130 nm | NA | 130 MHz | 3.5K | NA | 3.9 mW | 160 | 4.82 nJ | None |
6. Conclusions
Low power and area efficient design of AES encryption engine are sought after and are beneficial for resource constrained embedded systems with security requirements. Due to the rapid growth of such systems in IoT devices, the development of low-area and low-energy solutions to support smaller chip area and longer battery rutime, both of which contributes to the sustainability of emerging technologies, is critical. In the context of AES and CRC operations, identification of the similarities of the algorithms allowed the merger of the two different functionalities into a single versatile design. Understanding the nature of the data flow in the system enabled the proper application of power saving techniques such as clock gating. In this paper, the arithmetic foundation of AES and CRC computation has been analyzed to discover binary operations that are similar or based on an identical principle of calculation. Grounded on these observations, the design of an efficient AES/CRC engine was proposed and developd that effectively reuses gate level logical resources to reduce the overall area requirement and the number of gates active at any given time. The design approach achieves a significantly small silicon area of 22; 724 μm2 when synthesized using a 90 nm process. Using low power techniques an ultra-low power consumption of 52:2 μW and 82:3 μW was achieved for AES-128 encryption and CRC-32 calculation respectively. At 0.8V operating voltage and at 100.0 MHz clock frequency, the energy consumption was measured to be 280pJ per encryption operation and 140pJ for CRC-32 calculation of 128bit data. The current limitation of the design is the inability to compute AES and CRC in parallel due to the use of a single GFCU for both tasks. However, the simple Galois Field Computation Unit (GFCU) presented in this paper is highly versatile, and multiple such units can be parallelized in future work to improve bandwidth and enable parallel processing of AES and CRC. The design is also scalable to lower operating frequncies and can achieve significantly lower power.
Acknowledgments
Research reported in this paper was supported in part by the National Institute of General Medical Sciences of the National Institutes of Health under award number SC3 GM122735-01A1. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.
Biography
Safwat M. Noor (M’15) received the MSc degree in embedded systems from the Nanyang Technological University Singapore, and the PhD degree in electrical engineering from the University of Texas at San Antonio. Currently, he is working as an ASIC Design Engineer specializing in low power design at Apple Inc, in Austin, TX. His research interest includes Low Power Embedded Architecture, ASIC Design and Computer Architecture. He has worked as a research scientist associate with the University of Texas at San Antonio and is the recipient of a research fellowship from the same institution. He is member of the IEEE.
Eugene B. John received the PhD degree in electrical engineering from Pennsylvania State University. He is currently a professor with the Department of Electrical and Computer Engineering, University of Texas at San Antonio. His research interests include low power circuits and systems, low power VLSI design, energy efficient computing, energy efficeint hardware for machine learning, computer architecture and computer performance analysis. He is a recipient of the University of Texas System Regents’ Outstanding Teaching Award (2014). He is a senior member of the IEEE.
Footnotes
For more information on this or any other computing topic, please visit our Digital Library at www.computer.org/csdl.
Contributor Information
Safwat Mostafa Noor, Apple Inc., Austin, TX 78746..
Eugene B. John, Department of Electrical and Computer Engineering, University of Texas at San Antonio, San Antonio, TX 78249..
References
- [1].Andersen M, “Trends in internet of things platforms,” XRDS, vol. 22, no. 2, pp. 40–43, December. 2015. [Google Scholar]
- [2].Standard DE, “Fips pub 46,” Append. Fed. Inf. Process. Stand. Publ, vol. 2, pp. 1–22, 1977. [Google Scholar]
- [3].Rijmen V and Daemen J, “Advanced encryption standard,” Proc. Fed. Inf.Process.Stand.Publ. Natl. Inst. Stand.Technol, pp. 19–22, 2001. [Google Scholar]
- [4].Schneier B, “The blowfish encryption algorithm,” Dr Dobbs J.-Softw. Tools Prof. Program, vol. 19, no. 4, pp. 38–43, 1994. [Google Scholar]
- [5].Elminaam DSA, Kader HMA, and Hadhoud MM, “Performance evaluation of symmetric encryption algorithms on power consumption for wireless devices,” Int. J. Comput. Theory Eng, vol. 1, no. 4, October. 2009, Art. no. 343. [Google Scholar]
- [6].Agrawal M and Mishra P, “A comparative survey on symmetric key encryption techniques,” Int. J. Comput. Sci. Eng, vol. 4, no. 5, 2012, Art. no. 877. [Google Scholar]
- [7].Sobolewski JS, “Cyclic redundancy check,” Encyclopedia of Computer Science. Chichester, UK: Wiley, pp. 476–479. [Google Scholar]
- [8].IEEE Standard for Telecommunications and Information Exchange Between Systems - LAN/MAN - Specific Requirements - Part 15: Wireless Medium Access Control (MAC) and Physical Layer (PHY) Specifications for Wireless Personal Area Networks (WPANs), IEEE; Std 802151–2002, pp. 1–473, June. 2002. [Google Scholar]
- [9].EEE Standard for Information Technology- Telecommunications and Information Exchange Between Systems-Local and Metropolitan Area Networks-Specific Requirements-Part 11: Wireless LAN Medium Access Control (MAC) and Physical Layer (PHY) Specifications, IEEE; Std 80211–1997, pp. i–445, 1997. [Google Scholar]
- [10].IEEE Standard for Local and metropolitan area networks–Part 15.4: Low-Rate Wireless Personal Area Networks (LR-WPANs), IEEE Std 802154–2011 Revis. IEEE; Std 802154–2006, pp. 1–314, September. 2011. [Google Scholar]
- [11].Lund D, MacGillivray C, Turner V, and Mo-rales M, “Worldwide and regional internet of things (iot) 2014–2020 forecast: A virtuous circle of proven value and demand,” Int. Data Corp. IDC Tech. Rep, no. 1, 2014. [Google Scholar]
- [12].Pramstaller N, Mangard S, Dominikus S, and Wolkerstorfer J, “Efficient AES Implementations on ASICs and FPGAs,” in Proc. Int. Conf. Advanced Encryption, 2004, pp. 98–112. [Google Scholar]
- [13].Daemen J and Rijmen V, “AES proposal: Rijndael (version 2),” NIST AES Website Csrc Nist Govencryp-tionaes, 1999. [Google Scholar]
- [14].Omura JK and Massey JL, “United States Pa-tent: 4587627 - Computational method and appa-ratus for finite field arithmetic,” 4587627, 06-May-1986. [Google Scholar]
- [15].Daemen J and Rijmen V, The Design of Rijndael: AES-the Advanced Encryption Standard. Berlin, Germany: Springer, 2013. [Google Scholar]
- [16].Thulasimani L and Madheswaran M, “A single chip design and implementation of aes-128/192/256 encryption algorithms,” Int. J. Eng. Sci. Technol, vol. 2, no. 5, pp. 1052–1059, 2010. [Google Scholar]
- [17].Daemen J and Rijmen V, “Advanced encryption standard (AES) (FIPS 197),” Kathol. Univ. LeuvenESAT Tech. Rep, 2001. [Google Scholar]
- [18].Didla S, Ault A, and Bagchi S, “Optimizing AES for embedded devices and wireless sensor networks,” in Proc. 4th Int.Conf. Testbeds Res. Infrastructures Develop. Netw. Communities, 2008, pp. 4:1–4:10. [Google Scholar]
- [19].Darnall M and Kuhlman D, “AES software implementations on ARM7TDMI,” in Proc. Int. Conf. Cryptology India, 2006, pp. 424–435. [Google Scholar]
- [20].Bertoni G, Breveglieri L, Fragneto P, Macchetti M, and Marchesin S, “Efficient software implementation of AES on 32-bit platforms,” in Proc. Int. Workshop Cryptographic Hardware Embedded Syst., 2002, pp. 159–171. [Google Scholar]
- [21].Synopsys V, “Verilog Simulator,” 2004. [Online]. Available: http://www.synopsys.com/products/simulation/simulation.html.
- [22].Goldman R, et al. , “Synopsys’ open educational design kit: Capabilities, deployment and future,” in Proc. IEEE Int. Conf. Microelectron. Syst. Edu., 2009, pp. 20–24. [Google Scholar]
- [23].Gomez C, Oller J, and Paradells J, “Overview and evaluation of bluetooth low energy: An emerging low-power wireless technology,” Sensors, vol. 12, no. 9, pp. 11734–11753, 2012. [Google Scholar]
- [24].Bui D, Puschini D, Bacles-Min S, Beigné E, and Tran X, “Ultra low-power and low-energy 32-bit datapath AES architecture for IoT applications,” in Proc. Int. Conf. Design Technol., 2016, pp. 1–4. [Google Scholar]
- [25].Good T and Benaissa M, “692-nW Advanced Encryption Standard (AES) on a 0.13- m CMOS,” IEEE Trans. Very Large Scale Integr. VLSI Syst, vol. 18, no. 12, pp. 1753–1757, December. 2010. [Google Scholar]
- [26].Sharma TM and Thilagavathy R, “Performance analysis of advanced encryption standard for low power and area applications,” in Proc. IEEE Conf. Inf.Commun. Technol., 2013, pp. 967–972. [Google Scholar]
- [27].Hamalainen P, Alho T, Hannikainen M, and Hamalainen TD, “Design and Implementation of Low-Area and Low-Power AES Encryption Hardware Core,” in Proc. 9th EUROMICRO Conf. Digital Syst. Design: Architectures Methods Tools, 2006, pp. 577–583. [Google Scholar]