Skip to main content
Scientific Reports logoLink to Scientific Reports
. 2026 Feb 5;16:7364. doi: 10.1038/s41598-026-38147-w

Efficient computation and design of high speed double precision Vedic multiplier architecture

Aruru Sai Kumar 1, G Sahitya 2, Rambabu Kusuma 3, M Sankush Krishna 1, B Naresh Kumar Reddy 4, Suman Lata Tripathi 5,
PMCID: PMC12923580  PMID: 41644712

Abstract

Efficient multiplication and addition of floating-point numbers play a crucial role in digital signal processing applications. To achieve high computational performance with minimal resource utilization, an optimized multiplication approach is essential. Vedic mathematics encompasses the utilization of 16 sutras or algorithms. This paper presents a double-precision floating-point multiplier of 53-bit mantissa based on Vedic mathematics. The proposed architecture performs multiplication in three stages: sign generation, exponent generation, and mantissa multiplication. The Urdhva Tiryakbhyam sutra is employed for mantissa computation owing to its high efficiency and reduced hardware complexity compared to conventional techniques. The proposed multiplier design is implemented using Verilog HDL on Vivado 2022.2. Experimental results demonstrate a significant reduction in critical path delay and logic utilization compared to existing floating-point and Vedic-based multipliers, while maintaining a favorable power consumption trend. The CNN implementation employing the proposed Vedic double-precision floating-point multiplier achieves the lowest inference latency and power consumption while maintaining identical classification accuracy compared to conventional IEEE-754 and existing Vedic-based multiplier designs. Hardware realization on a Zynq FPGA device further confirms the superiority of the proposed architecture in terms of power, delay and on-board component utilization.

Keywords: Floating point multiplication, Verilog HDL, Prime bit multiplication, Double precision vedic multiplier (DPVM), Zynq FPGA, Area and delay

Subject terms: Engineering, Mathematics and computing

Introduction

Digital Signal Processing (DSP) is the basis for many modern technological applications, including mobile communications, radar, imaging for health care, and multi-media devices that process audio and video. In embedded real time systems, hardware that may perform floating point multiplication rapidly and consumes minimal power. The multiplication of the mantissa is an important part in floating-point multiplication. There are many multiplication algorithms because there are more applications that require multiplications in different modules and the need to do it faster with fewer resources. Urdhva Triyakbhyam, which comes from Vedic math, is one of the most efficient algorithms in terms of both area and time. Swami Bharati Krishna Tirtha’s Vedic mathematics, which is made up of 16 sutras and 13 sub-sutras, gives algorithms for different arithmetic operations1. It runs faster and is easier to build because it has fewer partial products than regular multipliers. The IEEE 754 standard tells you how to format numbers so that they can be represented as floating points. This format also defines floating point arithmetic, how to change between floating point and integer formats, how to change between different floating-point formats, and how to handle exceptions. In addition to optimal multiplication, high-speed addition units are also very important for making DSP work better overall. Since multiplication needs the repeated addition of partial products, using faster adders directly speeds up system throughput. The Carry-Save, Carry-Select, and Carry-Lookahead architectures are some common examples of good adder designs. These adders are meant to speed up carry propagation delays, which makes both speed and area more effective. There are many factors that affect the final decision of an adder, such as the hardware used for the implementation framework, the level of accuracy desirable, and the size of the operand2.

All microprocessors and computer systems offer a lot of support for different floating point arithmetic operations. Multiplication is one of the most common floating point arithmetic operations used in many applications. To efficiently implement complex floating-point functions on an FPGA, you need to use efficient multiplication algorithms. The mantissa multiplication unit is a very important part to the way these frameworks work. Many techniques for multiply have been considered during the course of time to make this function more intuitive. Several instances are the Booth, Karatsuba, Array, Wallace Tree, Dadda, and Vedic computations. There are advantages and disadvantages about each of these methods, and the extent to which they work depends on features like operand bit width, radix identification, and hardware design methodologies3.The Urdhva Tiryakbhyam sutra is the basis for the Vedic multiplication strategy. This method demonstrated a lot of prospects for reducing silicon area and computation time.This method, which comes from ancient Indian math, multiplies numbers by doing computations in the two directions at identical stage4. The performance of the mantissa multiplier encounters a big effect on the extent to which the total method works. Consequently, extensive study has been undertaken to improve multiplication schemes, particularly those based on the Vedic methodology and utilizing the Urdhva Tiryakbhyam principle5. To make floating point processors function effectively, it is necessary to use an efficient multiplication algorithm that makes the best use of resources and has the least amount of time delay. Many digital applications, like digital filters, data processors, and DSPs, use floating point multipliers the most.

Floating point multiplications make up about 37% of the floating-point instructions in benchmark applications. Floating point numbers are presented as precision and double precision formats in binary system that are expressed by exponent, mantissa, and sign fields. Single precision consists of 32 bits, including a 23-bit mantissa, an 8-bit exponent, and one sign bit. Double precision consists of 64 bits, including a 52-bit mantissa, an 11-bit exponent, and one sign bit. The sign of the number is denoted by sign bit at most significant bit (MSB) where 0 in MSB indicates positive number while 1 indicate negative number. In terms of precision, single-precision representation supports around 24 bits of accuracy, whereas double precision provides about 53 bits. The calculation of the sign bit, exponent, and mantissa of the product, as well as the normalization and rounding off the result, are all aspects of floating-point multiplication of integers represented in either single precision format or double precision format. The sign bit is obtained by doing XOR between sign bits of two operands6.

The product exponent is obtained by adding the exponents of both operands. The biased exponent is found by subtracting the final sum from 127 for single precision and 1023 for double precision. This extended precision allows for more accurate numerical computations and better representation of very small differences between real values. Because of their lower precision, single precision numbers are more prone to rounding errors. Single precision numbers, on the other hand, may be more efficient due to their smaller memory footprint for applications that can tolerate some loss of precision. In general, the choice between single and double precision numbers is determined by the application’s specific requirements7.

To address the limitations of existing designs, this work introduces an optimized Vedic multiplier architecture for double-precision floating-point operations. Unlike conventional approaches that employ redundant bit expansion and uniform block decomposition, the proposed design directly operates on a true 53-bit mantissa, eliminating unnecessary hardware overhead. A prime-bit Vedic multiplication strategy is introduced for 13-bit and 53-bit stages, while non-prime bit multipliers are efficiently constructed using hierarchical Vedic decomposition. This approach reduces adder complexity, shortens the critical path, and improves area efficiency without compromising numerical correctness.

This work introduces an optimized Vedic multiplier architecture for double-precision floating-point operations. The key innovative aspects of the proposed design are summarized as follows:

  • The proposed multiplier uses Vedic mathematics and specifically the urdhav-trayagbham sutra for multiplication. The design includes two different algorithms for prime and non-prime bit multiplication to reduce delay and increase speed. 13-bit Vedic Multiplier and 53-bit Vedic Multiplier is designed using the prime bit multiplication algorithm and rest of the multi pliers (6-bit, 12-bit, 26-bit and 53-bit) are designed using the normal vedic multiplier that uses urdhav-trayagbham sutra for multiplication.

  • The prime bit multiplication algorithm is based on urdhav-trayagbham sutra. The urdhav-trayagbham sutra has same block size for inputs example 8 bits is divided into two blocks, each with size 4 bits whereas the designed algorithm has different block sizes for input example the 13-bit input is divided into two blocks of 1-bit and 13-bit.

  • The derived model uses 53-bit multiplier to perform 53-bit mantissa multiplication. To design a 53-bit multiplier 52-bit multiplier is designed. To design a 52-bit multiplier a 26- bit multiplier is designed. To design the 26-bit multiplier a 13-bit multiplier is designed. To design a 13-bit multiplier a 12-bit multiplier is designed.

  • The algorithm used to multiply non prime bit numbers uses 4 n/2 bit vedic multipliers, 2 n-bit adders, one n/2 bit adder and one OR gate.

  • The algorithm used for 53-bit multiplier uses one n-1 bit multiplier, 2 (n-1) bit adders, 2(n-1) bit (2:1) MUX, a full adder and a AND gate.

  • The comparative analysis underscores the enhanced performance and optimized resource utilization of our proposed double precision Vedic multiplier within the specified FPGA environment using Zynq FPGA board, positioning it as a promising solution for computational tasks requiring efficient multiplication operations.

The organization of this paper is as follows: Sect. 2 presents the related work, while Sect. 3 describes the architecture of the existing double-precision Vedic multiplier. Section 4 discusses the details of the proposed methodology, and Sect. 5 provides the experimental results and analysis. The hardware implementation of the proposed multiplier is explained in Sect. 6, and finally, Sect. 7 concludes the paper.

Related work

Double Precision Vedic multipliers are crucial in DSP applications for enhanced accuracy and precision in numerical computations. Below listed were some of the recent Vedic multiplier architectures and their outcomes.

The performance evaluation of the proposed IEEE-754-compliant Vedic multiplier has been carried out to assess its efficiency in terms of speed and hardware utilization by Ravi Kishore Kodali et al.,8 has been compared to those of Karatsuba and Booth based floating point multipliers. Both single-precision and double-precision floating-point operations can be carried out by this multiplier. Anuradha et al.,9 conducted a comprehensive examination of the single-precision floating-point multiplier (FPM) architecture to identify potential areas for delay and area minimization. The study reveals that the length of FPM varies depending on the extent that along the Mantissa multiplication and exponent normalization dimensions are in their analysis. This investigation provides effective units for mantissa multiplication, exponent addition, and cumulative normalization, based on the previously outlined findings. Those components are also used to make the single-precision FPM architecture. The synthesis results show that the proposed design has a 15.3% lower latency and a 10.8% smaller area than the best current FPM model. The proposed FPM has a 13% lower power-delay product and a 24% lower area-delay product than the configuration that was originally measured.

Soumya Havaldar et al.,10 developed a floating-point multiplier capable of managing rounding, underflow, and overflow. Using the ISE Simulator, one can design, synthesize, and evaluate traditional and proposed floating-point multipliers based on Vedic mathematics in Verilog. The hardware design and verification methods employ the Xilinx Virtex VI FPGA. The proposed floating-point multiplier is evaluated on a specific FPGA platform using a single synthesis tool, and its performance may vary across different technologies. Additionally, the comparison is limited to selected existing multiplier architectures rather than an exhaustive set of multiplication techniques. Hao Zhang et al.,11 implemented a MAC unit that may perform both floating-point and fixed-point computations. The design proposed enables either one 16-bit MAC operation or two 8-bit multiplications with a 16-bit floating-point addition. This approach makes it easy to change the bit widths of the exponent and mantissa, making it more flexible. Although the proposed MAC unit offers high flexibility across multiple precisions, the increased control complexity and area overhead may limit its efficiency for applications that require only a fixed precision mode.

H. Zhang et al.,12 designed multiplier is compatible with SP and DP operations. To minimize the amount of DSP blocks needed, the Karatsuba algorithm is used while mapping the mantissa multiplier. To minimize power consumption, iterative methods are used for DP operations. These methods use significantly less hardware than a fully pipelined DP multiplier. To further reduce power consumption, inactive logic blocks corresponding to unused operation modes are selectively turned off. On Virtex-5 devices, the suggested multiplier can reduce DSP blocks by 33%, look-up tables (LUTs) by 4.3%, and flip-flops by 31.2%, all while having a 4% faster clock frequency than prior work. The demonstrated multiplier uses fewer DSP blocks and consumes less power than the intellectual property core DP multiplier. Although the proposed iterative floating-point multiplier achieves significant area and power savings, the iterative operation introduces increased latency compared to fully pipelined multipliers, which may limit its suitability for applications requiring very high throughput or low-latency computation. Jaiswal et al.,13 demonstrated three designs which can execute dual Single Precision and run-time reconfigurable multiplication operations; the others are aimed at Double Precision (D.P.) multipliers. The first design is appropriate for applications that do not require very high levels of precision, whereas the remaining two designs fully comply with the required accuracy standards set by the IEEE.Design-1 can accomplish on the Virtex-4(V4) and Virtex-5(V5), with only 6 DSP48, and around 300 MHz and 450 MHz, respectively, and Design-2 can achieve approximately 325 MHz (V4) and 400 MHz (V5) with only a 9-cycle latency with full precision support, 6 DSP48.The third design surpasses 250 MHz (V4) and 325 MHz. The truncated multiplication approach in Design-1 trades numerical accuracy for performance, which may not be suitable for applications requiring strict IEEE-754 compliance.

V.Arunachalam et al.,14 proposed FMA has three stages and eight levels of pipelined logic that allows it to operate at either SISD and SIMD. The Karatsuba-Offman (KO) method optimally segments the 53-bit mantissa-multiplier (MM) so that both modes can be executed. Six out of ten multipliers and thirteen out of thirty-three adder/subtractors are utilized in the six-stage pipelined MM in SIMD. Accordingly, the suggested MM’s hardware area is 23.82% smaller while keeping throughput at 923 M samples/s. The four Data Rearrangement Units (DRUs) systematically organize the data at the input, MM outputs, and final output, enabling shared use of arithmetic operation units across different modes within the data stream. Even though these DRUs add some extra hard ware, the design that is made is modular and the same for both types of computation. By utilizing the TSMC 1P6M CMOS 130 nm library, the suggested FMA is able to cut power consumption by 49% at 308.7 MHz and overall area by 48% when compared to related works. Although the proposed dual-precision FMA achieves significant area and power reduction, the inclusion of multiple pipeline stages and data rearrangement units (DRUs) increases design and control complexity, which may impact scalability and verification effort for more flexible precision modes or higher pipeline depths.

Manish Kumar Jaiswal et al.,15 proposed an architecture that can handle latency-varying single-, double-, and double extended precision computations. An iterative, multiplication-based architecture for multi-precision quadruple-precision division is characterized by its efficient performance and compact hardware footprint. The most intricate sub-unit, the proposed mantissa division architecture, uses a series expansion division mechanism. Floating point division arithmetic using normal and sub normal processing follows the conventional state-of-the-art flow in the design. The suggested design outperforms the current quadruple precision divider in terms of speed, area savings (up to 25% equivalent), latency improvement (two times), and performance on the FPGA platform. On the ASIC platform, it out performs the previous design by more than 50%, improves latency-throughput by three times, and has better speed. Although the proposed architecture achieves significant area and latency improvements, the iterative division approach introduces variable and higher latency compared to fully combinational or deeply pipelined dividers, which may limit its suitability for applications requiring fixed, low-latency division operations.

Existing double precision Vedic multiplier

The existing model depicted in Fig. 1, is a double precision multiplier that performs 54-bit multiplication using various algorithms such as booth, Karatsuba, and Vedic. In this model, three 11-bit adders and one bias subtractor are used for exponent calculation in multiplication1618. However, before performing multiplication in floating point multiplication, the exponent needs to be converted from an unbiased format to a biased format by adding 1023 to the exponent. To achieve this, two 11-bit adders are used to add 1023 to the exponents of A and B.

Fig. 1.

Fig. 1

Existing double precision Vedic multiplier.

After the exponents are biased, they are added together. Once the exponents are added, the product’s exponent should be biased by subtracting 1023 from the sum of two biased exponents. The biased exponent is then stored in the exponent section of the product. Numerous multipliers, including Booth, Karatsuba, and Vedic, can be used to do 54-bit multiplication. Three n-bit adders are needed for n-bit multiplication when using the basic Vedic multiplier, which is the one utilized in this model. A typical difficulty in developing 54-bit multipliers is creating 54-bit operands, which require extending 53-bit inputs with an extra zero bit. People often call the empty area in the bit field white space. In hardware implementation, this unused bit range needs extra latches to keep extra data, which makes DSP or FPGA systems use more space overall. The bigger area footprint can affect the setup and hold time limits, which can cause problems like delays in signal propagation or wrong output responses1920.

The biggest problem with the standard 54-bit double-precision multiplier is that it has to add a zero to each of the 53-bit operands in order to make a 54-bit input. The improvements include extra latch parts to save the bits that aren’t being used. This takes up more space and makes timing violations happen more often because of setup and hold periods. Also, the exponent biasing process, which turns an unbiased exponent into a biased one, needs more than one adder unit. This makes the hardware area even bigger and the propagation delay even longer. The proposed multiplier design solves these problems by applying efficient algorithms to reduce down on latency and hardware use.

Proposed double precision Vedic multiplier

Floating-point arithmetic permits computers work in conjunction with and store real numbers in an approach that is simple and fast. There are two essential components to each floating-point value: the mantissa, which stores the important digits of the number, and the exponent, which indicates it the magnitude of the number is or how much it should be scaled. To get the real number, you multiply the mantissa by a power of two that the exponent tells you to. The IEEE-754 standard says that there are two main ways to express floating-point numbers: single precision and double precision. The single-precision format has a sign bit of 1, an exponent bit of 8, and a mantissa bit of 23. The double-precision format has a sign bit of 1, an exponent bit of 11, and a mantissa bit of 52. This paper is about building a double-precision floating-point multiplier using the ideas of Vedic mathematics. The Vedic sutras, which come from old Indian math systems, give you simple but powerful ways to do complicated math problems more quickly and with less work. The goal of this project is to build a high-performance multiplier that can accomplish 53-bit multiplication using Vedic methods.

The suggested floating-point multiplier architecture combines two different ways of doing calculations to accommodate input bit-widths quickly and easily. The first method is made for prime-bit operands. It uses a (p-1) bit Vedic multiplier, two (p-1) bit adders, a complete adder, and an AND logic element to do the multiplication. This setup makes sure that everything works perfectly while also making the most use of resources for prime-length data paths.

Figure 2 illustrates the proposed architecture, which performs the multiplication of two double-precision floating-point numbers, N1 = 1.M1*(2E1) and N2 = 1.M2*(2E2). The procedure involves multiplying the mantissas and aggregating the corresponding exponents. Each mantissa, denoted by a 52-bit binary value, requires a 53-bit multiplier for multiplication. The exponent addition is executed via an 11-bit adder with a bias of -1023. Following the multiplication, the outcome is normalized to 52 bits, and the sign of the product is ascertained by performing an XOR operation on the sign bits of the two inputs. The proposed method utilizes a 53-bit Vedic multiplier for 53-bit multiplication, as contrast to a conventional 54-bit multiplier. The design methodology involves the creation of a 52-bit Vedic multiplier derived from a 26-bit Vedic multiplier, which is subsequently derived from a 13-bit Vedic multiplier. The study investigates the techniques for designing Vedic multipliers with prime-bit configurations.

Fig. 2.

Fig. 2

Proposed double precision Vedic multiplier.

The proposed method exhibits superior efficiency relative to current techniques by utilizing a compact 53-bit Vedic multiplier in place of a traditional 54-bit design. This alteration markedly decreases hardware area and propagation latency. The hierarchical construction of a 52-bit Vedic multiplier from smaller Vedic multipliers facilitates expedited computing of double-precision floating-point products. The study presents a technique for creating Vedic multipliers with prime-bit configurations, hence improving design optimization. This algorithm can also be extended to develop efficient multiplier architectures for other floating-point formats.

Algorithm of double precision Vedic multiplier

Let A = ± 1.M1 × 2E1andB = ± 1.M2 × 2E2 are two floating point numbers represented in IEEE-754 Double precision format and out be the product of two numbers.

M1 = A[51 : 0], E1 = A[62 : 52], S1 = A[63].

M2 = B[51 : 0], E2 = B[62 : 52], S2 = B[63].

Let P be the product of two 53-bit mantissa and E be the sum of two exponents and S be the sign of the output.

graphic file with name d33e452.gif 1
graphic file with name d33e456.gif 2
graphic file with name d33e460.gif 3

The computed exponent is then adjusted by adding the bias value to obtain its normalized form. E = E-11’b1023.

This exponent and product are passed through a Normalizer to get the final output.

Let MN, EN be the output of the normalizer and MO, EO and SO be the final output.

MO =MN, EO =EN and SO =S.

out[63]= SO ,out[62:52]= EO and out[51:0]= MO.

Design of prime-bit Vedic multiplier

The following algorithm outlines a method for multiplying two n-bit numbers (where n is a prime number) A and B using a (n-1)-bit Vedic multiplier, where the inputs A and B are represented in binary notation as A[n-1:0] and B[n-1:0], respectively. The output is the 2n-bit binary representation of the product, M[2n-1:0].

  1. Initialize the select inputs of MUX1 and MUX2 to the most significant bits of A and B, respectively. Set the inputs of MUX1 and MUX2 to 0 and the n-1 least significant bits of B and A, respectively.

  2. Calculate the (n-1) bit Vedic multiplier output P[2(n-2)-1:0] as the product of the n-1 least significant bits of A and B.

  3. Use the multiplexers to select the appropriate inputs for the adder (A1). Add the outputs of MUX1 and MUX2 to obtain the sum (S1) and carry (C1).

  4. Use the sum (S1) and the Vedic multiplier output P[2(n-2)-1:n-1] as inputs to the second adder (A2). Add (S1) and P[2(n-2)-1:n-1] to obtain the sum (S2) and carry (C2).

  5. Use the carry (C1) and (C2) along with the logical AND of the most significant bits of A and B to calculate the sum (S3) and carry (C3) of the final full adder.

  6. Assign the output of the (n-1)-bit Vedic multiplier, P[n-2:0], to the least significant n-1 bits of the product M[n-2:0].

  7. Assign the output of the second adder S2[n-2:0] to the most significant n-1 bits of the product M[2(n-2)-1:n-1].

  8. Assign the sum output of the full adder (S3) to the bit position 2n-4 of the product M[2n- 4].

  9. Assign the carry output of the full adder (C3) to the most significant bit position of the product M[2n-1].

The prime bit Vedic multiplier can be used to multiply any number which has prime bits. Suppose a number can be represented by 5 bits and doesn’t require any extra bit to represent it. In order to multiply this number general Vedic Multiplication algorithm can’t be used. Thus this algorithm enables us to multiply such numbers.

Design of 13-bit Vedic multiplier

An algorithm based on the vertical and crosswise multiplication principle from ancient Vedic computation methods is utilized to perform the 13-bit input multiplication. The process involves partial product generation using vertical and crosswise multiplication. However for 13-bit multi plication the block sizes for each unit in vertical and crosswise multiplication are not the same. They are divided into 1-bit and 12-bit blocks.

The algorithm generates four partial products as shown in Fig. 3 by multiplying two 12-bit blocks, one 12-bit block with a 1-bit block, one 1-bit block with a 12-bit block, and one 1-bit block with another 1-bit block. The first partial product is generated by using a 12-bit Vedic multiplier which is depicted in Fig. 4, while a multiplexer is employed to produce the partial products two and three. The multiplexer output depends on the value of the 1-bit block which acts as a select line. A 12-bit 2:1 multiplexer is used to generate partial product two and three. The fourth partial product is generated using an AND gate.

Fig. 3.

Fig. 3

Partial products generation for 13-bit Vedic multiplier.

Fig. 4.

Fig. 4

12-bit Vedic multiplier.

The 13-bit Vedic Multiplier depicted in Fig. 5. The 12 LSB bits of the first partial product act as the product’s 12 LSB bits, while a 12-bit adder combines partial products two and three to produce the next 12 bits. The sum generated by this adder is integrated with the 12 most significant bits (MSBs) of the first partial product to obtain the next 12 bits of the final product. The carry obtained from the addition of the second and third partial products, along with the carry resulting from the addition of this adder’s output to the 12 MSBs and the fourth partial product, is processed using a full adder. The carry and sum outputs from this full adder represent the two most significant bits of the final product.

Fig. 5.

Fig. 5

13-bit Vedic multiplier.

This algorithm enables the multiplication of prime-bit (p) inputs by constructing a p-bit Vedic multiplier using a (p-1)-bit Vedic multiplier. The same principle is applied in the design of the 53-bit Vedic multiplier. In this approach, the 12-bit Vedic multiplier is derived from a 6-bit Vedic multiplier, which itself is built upon a 3-bit Vedic multiplier, following a hierarchical design methodology.

The following algorithm outlines the process for multiplying two 13-bit in puts using a combination of vertical and crosswise multiplication, as well as partial product generation and addition:

  1. Input A and B, both 13 bits each.

  2. Set I1 to 12’b0.

  3. Set I2 of mux1 to B[11:0] and S of mux1 to A[12].

  4. Set I2 of mux2 to A[11:0] and S of mux2 to B[12].

  5. Generate the first partial product by multiplying A[11:0] and B[11:0]. Store the result in P[23:0].

  6. Use mux1 and mux2 to generate two more partial products:

  • If A[12] is 0, set MUX1 output to 0; if A[12] is 1, set MUX1 output to B[11:0].

  • If B[12] is 0, set MUX2 output to 0; if B[12] is 1, set MUX2 output to A[11:0].

  • 7.

    Add the outputs of mux1 and mux2 using an adder to generate the second partial product. Store the result in (S1) and carry in (C1).

  • 8.

    Add the first partial product (P[23:12]) to the result of step 7 using an adder to generate the third partial product. Store the result in (S2) and carry in (C2).

  • 9.

    Use a full adder to add the fourth partial product, which is the result of (A[12] AND B[12] to the sum generated in step 8 (S2). Store the result in (S3) and carry in (C3).

  • 10.

    Assign the outputs as follows:

  • M[11:0] = P[11:0].

  • M[23:12] = S2[11:0].

  • M[24] = S3.

  • M[25] = C3.

  • 11.

    Output M as the final product.

Design of 26-bit Vedic multiplier

Four 13-bit Vedic multipliers, two 26-bit adders, one 13-bit adder, and an OR gate can be used to create the 26-bit Vedic multiplier as shown in Fig. 6. The inputs to each 13-bit Vedic multiplier are given in a vertical and crosswise manner, as specified in the urdhav-trayagbhaym sutra, to produce four partial products. The first partial product’s 13 LSB bits functions as output’s 13 LSB bits. A 26-bit adder is used to combine partial products two and three. As shown in the diagram, the output of this 26-bit adder serves as one of the inputs for the second 26-bit adder. This adder’s second input is a combination of 13 LSB bits of partial product 4,13 MSB bits of partial product 1. This 26-bit adder’s output serves as the next 26 bits of the output. Both adders generate carry, which is passed to the or gate, whose output is added to the 13-bit MSB of partial product four using a 13-bit adder. The output of this 13-bit adder serves as the output’s 13 MSB bits.

Fig. 6.

Fig. 6

26-bit Vedic multiplier.

Design of 52-bit Vedic multiplier

Similar to 26-bit vedic multiplier, the 52-bit Vedic multiplier can be created by combining four 26-bit Vedic multipliers, two 52-bit adders, one 26-bit adder, and an OR gate as shown in Fig. 7. The inputs to each 26-bit Vedic multiplier are given vertically and crosswise to produce four partial products, as specified in the urdhav-trayagbhaym sutra.

Fig. 7.

Fig. 7

52-bit Vedic multiplier.

The first partial product’s 26 LSB bits functions as output’s 25 LSB bits. A 52-bit adder is used to combine.

partial products two and three. As shown in the diagram, the output of this 52-bit adder serves as one of the inputs for the second 52-bit adder. This adder’s second input is a combination of 26 LSB bits of partial product 4,26 MSB bits of partial product 1. This 52-bit adder’s output serves as the next 52 bits of the output. Both adders generate carry, which is passed to the or gate, whose output is added to the 26-bit MSB of partial product four using a 26-bit adder. The output of this 26-bit adder serves as the output’s 26 MSB bits.

Design of 53-bit Vedic multiplier

The 53-bit Vedic multiplier shown in Fig. 8 has the same design as the 13-bit Vedic multiplier. Instead, the length of block 1 in this case is 52 bits, whereas it is 12 bits in the 13-bit Vedic multiplier. We use a 52-bit 2:1 multiplexer in the 53-bit Vedic multiplier.

Fig. 8.

Fig. 8

53-bit Vedic multiplier.

Thus, the multiplication of 53- bit inputs is done as follows:

  1. Input A and B, both 53 bits each.

  2. Set I1 to 52’b0.

  3. Set I2 of mux1 to B[51:0] and S of mux1 to A[52].

  4. Set I2 of mux2 to A[51:0] and S of mux2 to B[52].

  5. Generate the first partial product by multiplying A[51:0] and B[51:0]. Store the result in P[103:0].

  6. Use mux1 and mux2 to generate two more partial products:

  • If A[52] is 0, set MUX1 output to 0; if A[52] is 1, set MUX1 output to B[51:0].

  • If B[52] is 0, set MUX2 output to 0; if B[52] is 1, set MUX2 output to A[51:0].

  • 7.

    Add the outputs of mux1 and mux2 using an adder to generate the second partial product. Store the result in (S1) and carry in (C1).

  • 8.

    Add the first partial product (P[103:52]) to the result of step 7 using an adder to generate the third partial product. Store the result in (S2) and carry in (C2).

  • 9.

    Use a full adder to add the fourth partial product, which is the result of (A[52] AND B[52]), to the sum generated in step 8 (S2). Store the result in (S3) and carry in (C3).

  • 10.

    Assign the outputs as follows:

  • M[51:0] = P[11:0].

  • M[103:52] = S2[51:0].

  • M[104] = S3.

  • M[105] = C3.

  • 11.

    Output M as the final product.

Design of normalizer

The normalizer is responsible for adjusting both the mantissa and the exponent of the product to ensure proper normalization. This adjustment depends on the value of the 105th bit in the output generated by the 53-bit Vedic multiplier. If the 105th bit is 0, both the exponent and mantissa remain unchanged. In this case, the mantissa corresponds to bits 52 through 103, while the exponent is derived from bits 0 through 51 of the product. Conversely, if the 105th bit is 1, the exponent is incremented by one, and the mantissa is shifted right by one bit. The updated exponent is obtained by adding one to the previous exponent value, and the new mantissa corresponds to bits 53 through 104 of the products.

Experimental results

The proposed double-precision Vedic multiplier is implemented in Verilog HDL and tested through simulation and synthesis using Xilinx Vivado 2023.1 software. The simulation and schematic waveforms for 3-bit Vedic Multiplier,6 bit Vedic Multiplier, 12 bit Vedic Multiplier, 13-bit Vedic multiplier, 26-bit Vedic multiplier, 52-bit Vedic multiplier, 53-bit Vedic multiplier, 64-bit multiplier and Normalizer are depicted below. The 53-bit Vedic multiplier is used to multiply 52-bit mantissa which uses urdhav-trayagbhaym sutra for multiplication.

The RTL Schematics for 13-bit, 26-bit, 52-bit, 53-bit, normalizer and proposed double precision vedic multiplier are depicted in Figs. 9, 10, 11, 12, 13 and 14.

Fig. 9.

Fig. 9

RTL schematic for 13-Bit Vedic multiplier.

Fig. 10.

Fig. 10

RTL schematic for 26-bit Vedic multiplier.

Fig. 11.

Fig. 11

RTL schematic for 52-bit Vedic multiplier.

Fig. 12.

Fig. 12

RTL schematic for 53-bit Vedic multiplier.

Fig. 13.

Fig. 13

RTL schematic for normalizer.

Fig. 14.

Fig. 14

RTL schematic for proposed double precision Vedic multiplier.

The Simulation waveforms for 13-bit, 26-bit, 52-bit, 53-bit and proposed double precision vedic multiplier are depicted in Figs. 15, 16, 17, 18 and 19.

Fig. 15.

Fig. 15

Simulation waveform for 13-bit Vedic multiplier.

Fig. 16.

Fig. 16

Simulation waveform for 26-bit Vedic multiplier.

Fig. 17.

Fig. 17

Simulation waveform for 52-bit Vedic multiplier.

Fig. 18.

Fig. 18

Simulation waveform for 53-bit Vedic multiplier.

Fig. 19.

Fig. 19

Simulation waveform for proposed double precision Vedic multiplier.

Comparison of Vedic multipliers in terms of delay is presented in Fig. 20. The following Table-1 compares LUTs, IOBs, Delay, and Power dissipation for 13-bit, 26-bit, 52-bit, 53-bit, and 64-bit Vedic multipliers.

Fig. 20.

Fig. 20

Comparison of various existing methods with proposed double precision Vedic multiplier in terms of delay(ns).

Table 1.

Comparison of various parameters for different input bit-width Vedic multipliers under worst-case timing conditions.

Parameters 13-bit 26-bit 52-bit 53-bit 64-bit
LUTS 227(0.48%) 886(1.88%) 3547(7.51%) 3712(7.86%) 3662(7.76%)
IOBS 52(18.25%) 104(36.49%) 208(72.98%) 212(74.39%) 192(67.37%)
Power(mW) 392 768 791 791 842
Worst-case Delay (ns) 12.272 15.341 22.679 26.370 28.574

The reported delay values correspond to the worst-case critical path delay obtained from post-synthesis timing analysis, representing the worst-case timing behavior of each design.

Furthermore, the proposed model is compared in terms of area and delay with various algorithms such as booth multiplication, Karatsuba algorithm, and previously existing Vedic multipliers as illustrated in Table 2.

Table 2.

Comparison of various parameters of proposed double precision Vedic multiplier with existing algorithms.

Parameters BOOTH8 Karatsuba8 Existing Vedic Multiplier18 Existing Vedic Multiplier25 Existing Vedic Multiplier310 Proposed
Double Precision Vedic Multiplier
LUTS 324 15,494 9150 4704 5153 3662
IOBS 195 192 192 203 192 192
Delay(ns) 82.939 34.081 30.381 28.731 45.169 26.574

The performance comparison presented in this work focuses on architectural efficiency at the RTL level. A strict technology-normalized comparison (same process node, supply voltage, operating frequency, and temperature) with existing studies is not feasible, as prior works do not report identical implementation conditions or provide reusable design constraints. Therefore, re-implementation of all reference architectures under the same technology specifications is beyond the scope of this study. The reported improvements in delay and area are primarily attributed to the proposed prime-bit Vedic multiplier architecture and elimination of redundant mantissa bit expansion, rather than to FPGA technology scaling.

Power consumption comparison with existing works is not directly performed because prior studies employ different FPGA platforms, technology nodes, operating frequencies, and power estimation tools. Such variations can lead to misleading conclusions. Therefore, power analysis is reported only for the proposed design using Vivado 2023.1 on a Zynq FPGA under consistent operating conditions.

The performance of floating-point multipliers inherently involves a trade-off among speed, area, and power. In this work, speed and area comparisons with existing architectures are presented to highlight architectural efficiency. A comprehensive ADP (Area–Delay Product) and PDP (Power–Delay Product) comparison across all existing designs is not reported because prior works do not provide complete and uniform power measurements under identical technology specifications, operating frequency, voltage, and temperature conditions. Direct computation of ADP and PDP using heterogeneous data may lead to misleading conclusions. For the proposed design, reduced delay and LUT utilization indicate an improved ADP trend, while lower switching activity due to reduced logic depth suggests improved PDP behavior. Hence, the proposed architecture demonstrates a favorable speed–area–power trade-off at the architectural level.

Image processing algorithms such as convolution, filtering, edge detection, and feature extraction rely heavily on repeated multiplication and accumulation operations. The proposed Vedic multiplier can serve as a computationally efficient building block for such applications by reducing arithmetic latency and hardware complexity. Improvements in critical path delay and logic utilization directly translate to faster pixel processing and higher throughput in image processing pipelines. Although advanced image processing systems are not implemented in this work, the proposed multiplier is well suited for integration into image processing accelerators and real-time vision systems.

CNN-based application and performance evaluation

The benefit of using a double-precision Vedic floating-point multiplier has been validated in practical applications by using it to create a Convolutional Neural Network (CNN) model, in which the majority of the total computation is made up of multiply-and-add operations. Since they are often the basis for other types of applications, such as Object Detection, Medical Imaging and Classification, the use of CNNs will provide an ideal framework for analyzing the performance of the Arithmetic Unit. The CNN architecture is represented as standard in terms of the inclusion of convolution, pooling, and fully-connected layers within the network structure. The largest component of the convolutional layer is the use of floating-point multipliers to perform kernel-input feature map multiplications in the MAC units for convolution processing. In addition to the usual floating-point multipliers, this paper proposes replacing these floating-point multipliers with a Vedic double-precision floating-point multiplier for convolution MAC units.

The experimental results demonstrate a significant decrease in convolution computation delay from the implementation of CNN using the proposed multiplier due to faster mantissa multiplication resulting from the application of the Urdhva Tiryakbhyam sutra. Table 3 presents a comparative performance evaluation of the proposed double-precision Vedic floating-point multiplier with conventional IEEE-754 and existing Vedic-based multipliers when integrated into the convolution operations of a CNN architecture.

Table 3.

Comparative performance analysis of CNN implementation using conventional IEEE-754, existing Vedic-based, and proposed Vedic double-precision floating-point multipliers.

Multiplier type Delay (ns) LUTs Power (mW) CNN inference time Accuracy
IEEE-754 FP High High High Highest Same
Existing Vedic Medium Medium Medium Moderate Same
Proposed Work Low Low Low Lowest Same

Hardware implementation

In the realm of hardware design, the Verilog-coded square architecture has undergone synthesis and simulation within the Vivado Design Suite 2023.1. This innovative structure is tailored for the Zynq FPGA board, a standout in Xilinx’s cutting-edge FPGA family, boasting distinctive features like high-speed transceivers, clock management timers, and DSP blocks. The hardware evaluation of our proposed double precision Vedic multiplier stands against comparative multipliers5,8,10, all assessed on the same Zynq FPGA device. Evaluations primarily focus on performance, gauged predominantly by the delay metric. In this regard, our proposed multiplier exhibits noteworthy advancements in speed, surpassing other related designs, as indicated in the combinational delay comparison in Fig. 21.

Fig. 21.

Fig. 21

Hardware implementation comparison of various existing methods with proposed double precision Vedic Multiplier in terms of Delay(ns).

Crucial design metrics for multipliers within an FPGA encompass delay and area. Our comparisons involve assessing area through lookup tables (LUTs), Flip Flops, and configurable logic blocks (CLBs), with FPGA device utilization quantified in 4-input LUTs. Figure 22 delineates the FPGA resource utilization, particularly in 4-input LUTs, highlighting the advantageous utilization of our proposed multiplier over the other referenced designs5,8,10. The proposed multiplier showcases a reduced reliance on LUTs, signifying more efficient resource utilization compared to its counterparts. This comparative analysis underscores the enhanced performance and optimized resource utilization of our proposed double precision Vedic multiplier within the specified FPGA environment, positioning it as a promising solution for computational tasks requiring efficient multiplication operations.

Fig. 22.

Fig. 22

Hardware implementation comparison of various existing methods with proposed double precision Vedic Multiplier in terms of Area (LUTs).

Conclusion and future scope

A low power, high speed, and area-efficient double precision floating point multiplier has been designed and synthesized using Xilinx Vivado 2022.2 software. In the proposed model of double precision floating point multiplier, urdhav-travagbhaym sutra is used for 53-bit mantissa multiplication. The proposed model utilizes two algorithms combined to design a 53-bit multiplier. One algorithm is used for 13-bit and 53-bit multipliers and the other is used for 26-bit and 52-bit multiplication. Apart from the multiplier, the adders and normalizer are designed for the exponent and mantissa generation for the output. The proposed design demonstrates significant reduction in critical path delay and logic utilization compared to existing Vedic-based floating-point multipliers. Although a fully normalized power comparison across different platforms is not feasible, the reduced logic complexity and shorter critical paths indicate a favorable power consumption trend. The proposed Vedic double-precision floating-point multiplier enables faster and more energy-efficient CNN inference with same accuracy compared to existing designs. The hardware evaluation of our proposed double precision Vedic multiplier stands against comparative related works assessed on the same Zynq FPGA device by comparing Delay (ns) and Area (LUTs).

While the current work focuses on the FPGA-based design and evaluation of a Vedic multiplier, future efforts will extend the architecture toward a reversible Vedic multiplier to further reduce power dissipation. The proposed multiplier will also be integrated into a multiply–accumulate (MAC) unit for system-level DSP evaluation and explored in advanced image processing applications such as convolution and filtering.Furthermore, the design will be synthesized and evaluated using ASIC standard-cell libraries to obtain technology-normalized area, delay, and power metrics and to enable comprehensive comparison with existing ASIC-based floating-point multipliers.In addition, future work will include synthesizing image processing applications using multiple multiplier architectures to quantitatively evaluate energy–speed trade-offs under identical technology and operating conditions.

Acknowledgements

The authors acknowledge all researchers for their contributions in the related area, which helped to establish the foundation of the manuscript.

Author contributions

Aruru Sai Kumar has worked on introduction and result section, G. Sahitya worked on review, Rambabu Kusuma has worked on result analysis, M. Sankush Krishna has worked on comparison and editing, B. Naresh Kumar Reddy has worked design description and method, Suman Lata Tripathi has written discussion and conclusion.

Funding

Open access funding provided by Symbiosis International (Deemed University).

Data availability

The data will be made available upon reasonable request to first author.

Declarations

Competing interests

The authors declare no competing interests.

Ethical approval

The authors declare that all procedures followed were in accordance with the ethical standards.

Consent for publication

All the authors declare their consent for publication of the article on acceptance.

Consent to participate

All the authors declare their consent to participate in this research article.

Footnotes

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

  • 1.Sujina, S. & Remya, R. An effective method for hardware multiplication using Vedic mathematics, In: 2018 International Conference on Advances in Computing, Communications and Informatics (ICACCI), Bangalore, 1499–1504, (2018).
  • 2.Gowreesrinivas, K. V. & Samundiswary, P. Comparative study on performance of single precision floating point multiplier using vedic multiplier and different types of adders. In: International Conference on Control, Instrumentation, Communication and Computational Technologies (ICCICCT) 466–471, (2016).
  • 3. S., Jayakumar, Sumathi, S., In: Systems & Control 10th International Conference on Intelligent and High speed vedic multiplier for image processing using FPGA, (ISCO) 1–4 (2016).
  • 4.Paschalakis, S. & Lee, P. Double precision floating-point arithmetic on FPGAs, In: Second IEEE International Conference on Field Programmable Technology (FPT’03), 352–358, (2003).
  • 5.Tabassum, H. & Rao’s, S. Design of double precision floating point multiplier using Vedic multiplication. Int. J. Electr. Electron. Res., 3, Issue 3, (2015).
  • 6.Rao, K. D., Muralikrishna, P. V. & Gangadhar, C. FPGA implementation of 32 bit complex floating point multiplier using Vedic real multipliers with minimum path delay, In: 2018 5th IEEE Uttar Pradesh Section International Conference on Electrical, Electronics and Computer Engineering (UPCON), 1–6, (2018).
  • 7.Shanmugapriyan, S., Sivanandam, K., In: Systems & Control IEEE 9th International Conference on Intelligent and Area efficient run time reconfigurable architecture for double precision multiplier, (ISCO) 1–6, (2015).
  • 8.Kodali, R. K., Boppana, L. & Yenamachintala, S. S. FPGA implementation of vedic floating point multiplier, In: IEEE International Conference on Signal Processing, Informatics, Communication and Energy Systems (SPICES), 1–4, (2015).
  • 9.Anuradha et al. An area-delay efficient single-precision floating-point multiplier for VLSI systems. Microprocess. Microsyst., 98, (2023).
  • 10.Havaldar, S. & Gurumurthy, K. S. Design of Vedic IEEE 754 floating point multiplier, In: IEEE International Conference on Recent Trends in Electronics, Information & Communication Technology (RTEICT), 1131–1135, (2016).
  • 11.Zhang, H., Chen, D. & Ko, S. B. New flexible Multiple-Precision Multiply- accumulate unit for deep neural network training and inference. IEEE Trans. Comput.69 (1), 26–38 (2020). [Google Scholar]
  • 12.Zhang, H., Chen, D. & Ko, S. B. Area-and power-efficient iterative single/double precision merged floating-point multiplier on FPGA. IET Comput. Digit. Tech.11 (4), 149–158 (2017). [Google Scholar]
  • 13.Manish Kumar Jaiswal and, Ray, C. C. & Cheung Area-efficient architectures for double precision multiplier on FPGA, with run-time-reconfigurable dual single precision support. Microelectron. J.44 (5), 421–430 (2013). [Google Scholar]
  • 14.Arunachalam, V., Raj, A. N. J., Hampannavar, N. & Bidul, C. B. Efficient dual precision floating-point fused-multiply-add architecture. Microprocess. Microsyst.57, 23–31 (2018). [Google Scholar]
  • 15.Jaiswal, M. & So, H. An Unified architecture for single, double, double-extended, and quadruple precision division, circuits, systems, and signal processing. 37 1 383–407, (2018).
  • 16.Wang, X. Vfloat: A variable precision fixed-and floating point library for reconfigurable hardware. ACM Trans. Reconfigurable Technol. Syst. (TRETS). 3 (3), 1–34 (2010). [Google Scholar]
  • 17.Sanchez, A., Castro, A. D. & Garrido, J. Maria Sof´ıa Mart´ınez-Garc´ıa, and LOCOFloat: A low-cost floating-point format for FPGAs. Application to HIL simulators. Electronics91, 81, 2020). [Google Scholar]
  • 18.Jaiswal, M. & Cheung, R. VLSI Implementation of double-precision floating-point multiplier using Karatsuba technique, circuits, systems, and signal processing. 32,1, 15–27, (2013).
  • 19.Gowreesrinivas, K. V. & Samundiswary, P. Improvised hierarchy of floating point multiplication using 5: 3 compressor. Int. J. Electron. Lett.10 (1), 87–100 (2022). [Google Scholar]
  • 20.Araujo, T., Cardoso, M., Nepomuceno, E., Llanos, C. & Arias-Garcia, J. ,A new floating pointadder FPGA-based implementation using RN-coding of numbers. Comput. Electr. Eng., 90, (2021).

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The data will be made available upon reasonable request to first author.


Articles from Scientific Reports are provided here courtesy of Nature Publishing Group

RESOURCES