Abstract
This paper presents a parallel implementation of a kind of discrete Fourier transform (DFT): the vector-valued DFT. The vector-valued DFT is a novel tool to analyze the spectra of vector-valued discrete-time signals. This parallel implementation is developed in terms of a mathematical framework with a set of block matrix operations. These block matrix operations contribute to analysis, design, and implementation of parallel algorithms in multicore processors. In this work, an implementation and experimental investigation of the mathematical framework are performed using MATLAB with the Parallel Computing Toolbox. We found that there is advantage to use multicore processors and a parallel computing environment to minimize the high execution time. Additionally, speedup increases when the number of logical processors and length of the signal increase.
1. Introduction
Let l 2(ℤ n, ℂ d) be the space of vector-valued discrete-time signals with n samples, where each sample is a complex vector of length d. The vector-valued discrete-time signals are used very often in several applications in signal processing and electrical engineer, for example, vector quantization of images [1], time-frequency localization with wavelets [2], image coding [3], vector filter bank theory [4], linear time-dependent MISO [5], and analysis of MMSE estimation for compressive sensing of block sparse signals [6].
Now, to analyze the spectra of vector-valued discrete-time signals, a novel tool was developed, and it is called vector-valued DFT [7, 8]. This transform has applications in vector analysis in complex, quaternion, biquaternion, and Clifford algebras [8]. Additionally, the vector-valued DFT is used in digital signal processing, for example, the study of new complex valued constant amplitude zero autocorrelation (CAZAC) signals [9], which serve as coefficients for phase coded waveforms with prescribed vector-valued ambiguity function behavior, which is relevant in light of time-frequency analysis, vector sensor, and MIMO technologies [7].
The following paper presents a parallel framework of the vector-valued DFT. The major contributions of this paper are summarized as follows:
The construction of a new mathematical structure for the vector-valued DFT using block matrix theory such that it allows a parallel implementation in multicore processors.
Reducing the elapsed time to compute the vector-valued DFT of a vector-valued discrete-time signal using parallel computing through aforementioned new mathematical framework.
This new framework is developed with a set of block matrix operations, for example, Kronecker product, direct sum, stride permutation, vec operator, and vec inverse operator (see Section 2.1 for details). These block matrix operations contribute to analysis, design, and implementation of parallel algorithms in multicore processors [10–12]. This mathematical framework is inspired in the matrix representation of the Cooley-Tukey fast Fourier transform (FFT) algorithm for complex discrete-time signals, corresponding to the decomposition of the transform size n into the product of two factors r and s, which is developed in [10, 12, 13].
The present paper is organized as follows. Section 2 explains a mathematical background about block matrix operations and discrete Fourier transform. Section 3 defines the concept of vector-valued DFT for vector-valued discrete-time signals. Section 4 develops a mathematical framework of vector-valued DFT in terms of block matrix operations for vector-valued discrete-time signals with length n = rs. This mathematical framework contributes to implementation of parallel algorithms. Section 5 explains an implementation and experimental investigation of this mathematical framework using parallel computing in multicore processors with MATLAB. Finally, some conclusions are presented in Section 6.
Throughout the paper, the following notations are used. ℤ n = {0,1,…, n − 1} is the additive group ℤ of integers modulo n, ℂ m×n is the matrix space of m rows and n columns with complex numbers entries and ℂ n = ℂ n×1. The rows and columns of A ∈ ℂ m×n are indexed by elements of ℤ m and ℤ n, respectively. A(j, k), A(j, :), A(:, k), and A T represent entry (j, k), row j, column k, and transpose matrix of A, respectively. I n ∈ ℂ n×n is identity matrix.
2. Background
2.1. Block Matrix Operations
A block matrix A ∈ ℂ mp×nq with m row partitions and n column partitions and a block vector x ∈ ℂ mp with m row blocks are defined as
| (1) |
respectively, where A j,k ∈ ℂ p×q designates (j, k) block and x j ∈ ℂ p designates j block. In this paper, the following block matrix operations are used: Kronecker product, direct sum, stride permutation, vec operator, and vec inverse operator.
The Kronecker product of two matrices A ∈ ℂ m×n and B ∈ ℂ p×q is defined as A ⊗ B ∈ ℂ mp×nq and it replaces every entry (j, k) of A by the matrix A(j, k)B. In the special case A = I n, it is called parallel operation [12].
The direct sum of n matrices constructs a block diagonal matrix from a set of matrices, that is, for {C k}k∈ℤn, such that C k ∈ ℂ pk×qn:
| (2) |
where C ∈ ℂ p×q, p = ∑j∈ℤn p j, and q = ∑j∈ℤn q j.
Let n = rs. The stride permutation matrix is defined as L s n ∈ ℂ n×n such that it permutes the elements of the input signal x ∈ ℂ n as jr + k → ks + j, j ∈ ℤ s, and k ∈ ℤ r [12, 14]. This matrix permutation governs the data flow required to parallelize a Kronecker product computation [12]. We clarify that the superscript n is an index, not power.
The vec operator, 𝒱 : ℂ m×n → ℂ mn, transforms a matrix into a vector by stacking all the columns of this matrix one underneath the other. On the other hand, the vec inverse operator, ℛ m,n : ℂ mn → ℂ m×n, transforms a vector of dimension mn into a matrix of size m × n.
2.2. Discrete Fourier Transform
Let l 2(ℤ n) be the set of ℂ-valued signals on ℤ n; that is, x ∈ l 2(ℤ n) if and only if x ∈ ℂ n [9]. Additionally, for each k 1 ∈ ℤ, x(k 1) = x(k 2), where k 2 ∈ ℤ n and k 1 ≡ k 2modn. The discrete Fourier transform (DFT) of x ∈ l 2(ℤ n) is represented as ℱ x : ℤ n → ℂ such that ℱ x(k) = ∑m∈ℤn x(m)ω N −mk, where ω n = exp(2πi/n) and .
As mentioned in [14], there are two different approaches of representing the DFT: as matrix-vector products or using summations. Consequently, fast algorithms using parallel computing are represented with either a matrix formalism as in [10, 12–14] or summations as in most signal processing books. Below, the matrix formalism is introduced and used to express the Cooley-Tukey FFT algorithm, corresponding to the decomposition of the transform size n into the product of two factors r and s; that is, n = rs.
The matrix representation of DFT of x is ℱ x = F n x, where F n ∈ ℂ n×n such that F n(j, k) = ω N −jk. If n = rs, then the matrix formalism can be used to express F n as factorizations of matrices using block matrices operations [10, 12, 13]:
| (3) |
Here, T r n is a diagonal matrix containing the twiddle factors. We clarify that the superscript n is an index, not power. This factorization of F n is the matrix representation of the Cooley-Tukey FFT for n = rs. In addition, this representation of F n allows the implementation using parallel computing [14].
3. DFT for Vector-Valued Signals
Based on [2, 6–9, 15, 16], the space of vector-valued discrete-time signals with n samples is defined as
| (4) |
The space l 2(ℤ n, ℂ d) is the set of ℂ d-valued signals on ℤ n; that is, x ∈ l 2(ℤ n, ℂ d) if and only if x ∈ ℂ nd. Additionally, for each k 1 ∈ ℤ, x k1 = x k2, where k 2 ∈ ℤ n and k 1 ≡ k 2modn. Furthermore, if d = 1, then l 2(ℤ n, ℂ d) = l 2(ℤ n). Now, for x ∈ l 2(ℤ n, ℂ d), there is a kind of DFT for vector-valued signals called vector-valued DFT. This transform is defined as ℱ x d : ℤ n → ℂ d such that
| (5) |
where W n ∈ ℂ d×d is the matrix kernel. Algorithm 1 shows the implementation of (5). This implementation is a sequential algorithm.
Algorithm 1.

Vector-valued DFT (sequential algorithm).
From the reviewed literature, there are two kinds of kernels for this transform: the first one is hypercomplex DFT kernel [8]:
| (6) |
where J ∈ ℂ d×d such that J 2 = −I d, and the second one is DFT frame kernel [7]:
| (7) |
where 𝒜 = {α 0, α 1,…, α d−1} ⊂ ℤ n with α j < α k for j < k. It is called DFT frame kernel because {e j}j∈ℤn ⊂ ℂ d, where is a DFT frame. In this paper, subsets 𝒜 ⊂ ℤ + are used, such that card(𝒜) = d, although it does not represent a DFT frame.
Lemma 1 . —
Let W n ∈ ℂ d×d be a hypercomplex DFT kernel or DFT frame kernel. Then
- (1)
W n j+r = W n j · W n r.
- (2)
W n 0 = W n n = I d.
- (3)
If k ∈ ℤ and r ∈ ℤ N, then W n nk+r = W n r.
- (4)
If n = rs, then W n rk = W s k.
Proof —
For hypercomplex DFT kernel, the proof of each case is similar to proof of nth roots of unity. For DFT frame kernel, W n is a diagonal matrix, and then the proof of each case is straightforward.
4. A Parallel Framework for n = rs
In this section, the main results of this paper are presented. Firstly, a block matrix representation of the vector-valued DFT is given. Secondly, a new mathematical framework from matrix representation of vector-valued DFT is derived, using a block matrix formalism (i.e., Theorem 2). This new result is inspired in the matrix representation of the Cooley-Tukey FFT algorithm for complex discrete-time signals, corresponding to the decomposition of the transform size n into the product of two factors r and s, which is developed in [10, 12, 13]. The result obtained in Theorem 2 is transformed in a new block matrix representation such that it contributes to analysis, design, and implementation of parallel algorithms (i.e., Corollary 3). This new result is inspired in (3). Finally, a computational complexity analysis of new algorithm is developed.
Similar to the DFT matrix representation explained in Section 2.2, there are two different approaches of representing the vector-valued DFT: as summations (see (5)) or using matrix-vector products. Both approaches allow a parallel implementation. In fact, the proof of Theorem 2 is developed using summation notation.
The vector-valued DFT can be presented as matrix-vector products. The block matrix representation of vector-valued DFT of x ∈ l 2(ℤ n, ℂ d) is defined as ℱ x d = F n d x, where F n d ∈ ℂ dn×dn such that (F n d)j,k = W n −jk ∈ ℂ d×d, for j, k ∈ ℤ n. We clarify that the superscript d is an index, not power. In this section, a block matrix factorization of F n d is developed, and it is inspired in (3). First, a generalization of stride permutation is defined. Let n = rs. The block stride permutation matrix [14, 17] is defined as L s n,d ∈ ℂ dn×dn such that L s n,d = L s n ⊗ I d, and, for each x ∈ ℂ dn with n blocks x j ∈ ℂ d, the operation L s n,d x permutes each block of the input block x as jr + k → ks + j, j ∈ ℤ s, and k ∈ ℤ r.
Theorem 2 . —
Let n = rs and let F n d ∈ ℂ dn×dn be the block matrix of DFT for vector-valued signals. Then
(8) where T r n,d = ⨁j∈ℤs D r j such that D r = ⨁k∈ℤr W n −k.
Proof —
Let x ∈ ℂ dn, let l 1, k 1 ∈ ℤ r, and let l 2, k 2 ∈ ℤ s. The block vector y = (I s ⊗ F r d)L s n,d x is defined. Then
(9) Now, let z = T r n,d y. From Lemma 1, W r −k1l1 = W n −sk1l1; then
(10) Let w = (F s d ⊗ I r)z. Then
(11) But rk 2 l 2 + sk 1 l 1 + k 2 l 1 ≡ (k 2 + k 1 s)(l 1 + l 2 r)modn; then
(12) Let m = sk 1 + k 2, let k = l 1 + l 2 r, and let m, k ∈ ℤ n because l 1, k 1 ∈ ℤ r, l 2 k 2 ∈ ℤ s, and n = rs. Then
(13)
Now, if n = rs, A ∈ ℂ r×r, and B ∈ ℂ ds×ds, the following equality [17] is obtained:
| (14) |
From Theorem 2 and (14), the following corollary presents a matrix factorization of F n d such that it permits an implementation using parallel computing.
Corollary 3 . —
Let n = rs and let F n d ∈ ℂ dn×dn be the block matrix of DFT for vector-valued signals. Then
(15) where T r n,d was defined in Theorem 2.
Algorithm 2 shows a parallel implementation of (15).
Algorithm 2.

Vector-valued DFT (parallel algorithm).
r independent processes in Steps (3)–(5), and 2s independent processes in Steps (6)–(8) and (12)–(14) are observed, making this approach a parallel operation. A model of Algorithm 2 is shown in Figure 1.
Figure 1.

Parallel model of vector-valued DFT for x ∈ l 2(ℤ n, ℂ d), n = rs, using a matrix representation.
4.1. Computational Complexity Analysis
In this section, the computational complexity analysis of (15) is developed. First, consider the matrix operation L s n,d v. The computational complexity (CC) of L s n,d v is 𝒪(nd) [8] because it is the multiplication between a block matrix in ℂ dn×dn and a block vector in ℂ dn. But the operation L s n,d v can be implemented with a CC 𝒪(sd) (see, e.g., [12, 14]).
Let F n d ∈ ℂ dn×dn be the block matrix and vector-valued signal x ∈ l 2(ℤ n, ℂ d), where n = rs. It is known that the CC of operation y = F n d x is 𝒪(n 2 d 2) = 𝒪(r 2 s 2 d 2). Now consider operation y = F n d x using (15). If we consider each matrix-vector multiplication, we obtain the following:
-
(1)
The CC of y 1 = L s n,d x is 𝒪(sd).
-
(2)
The CC of y 2 = (I s ⊗ F r d)y 1 is 𝒪(sr 2 d 2), because it is a block diagonal matrix multiplication.
-
(3)
The CC of y 3 = T r n,d y 2 is 𝒪(nd), because T r n,d is a diagonal matrix multiplication.
-
(4)
The CC of y 4 = L r n,d y 3 is 𝒪(rd).
-
(5)
The CC of y 5 = (I r ⊗ F s d)y 4 is 𝒪(rs 2 d 2), because it is a block diagonal matrix multiplication.
-
(6)
The CC of y = L s n,d y 5 is 𝒪(sd).
Therefore, the CC of F n d x using (15) is
| (16) |
Thus, the CC of operation F n d x is 𝒪(r 2 s 2 d 2) and the CC of operation F n d x using (15) is 𝒪(sr(r + s)d 2). The above mentioned shows the efficiency of matrix formulation in (15).
5. Implementation and Experimental Investigation
5.1. General Information
The investigations have been carried out on a computer with multicore processor. The computer consists of 4 cores with Intel Core i7-3632QM CPU processor, system clock of 2.20 GHz, and 8 GB of RAM. The experiment develops the implementation and testing of Algorithms 1 and 2 with the hypercomplex DFT kernel and the DFT frame kernel is developed. Algorithm 1 does not use any parallel implementation, unlike Algorithm 2. A CAZAC signal in l 2(ℤ n, ℂ d) is used; it is generated using a Wiener CAZAC signal in l 2(ℤ n) [9] with d = 5 and n = rs, where n = 1024 = 32 · 32, n = 2048 = 64 · 32, n = 4096 = 64 · 64, n = 8192 = 128 · 64, and n = 16384 = 128 · 128.
The implementation of Algorithms 1 and 2 to compute the vector-valued DFT is performed using MATLAB. Algorithm 2 is computed using Parallel Computing Toolbox. MATLAB uses built-in multithreading and parallelism using MATLAB workers. Parallelism using MATLAB workers is used. We can run multiple MATLAB workers (MATLAB computational engines) on a multicore computer to execute applications in parallel with the Parallel Computing Toolbox. This approach allows more control over the parallelism compared to built-in multithreading. With programming constructs, such as parallel-for-loops (parfor) and batch, we write the parallel MATLAB programs of the parallel framework for the vector-valued DFT.
5.2. Results and Discussion
Let T ∗ be the execution time of Algorithm 1 without any parallel implementation, and let T p be the execution time of Algorithm 2, where p is the number of cores. The value of T p needs to be less than that of T ∗ for two reasons: Algorithm 2 has a parallel implementation and the matrix multiplication size is different. Algorithm 2 is computed with matrices in ℂ dr×dr and ℂ ds×ds. Algorithm 1 is computed with matrices in ℂ dn×dn, where n = rs.
The computational performance analysis of Algorithm 2 is evaluated using the metrics speedup (or acceleration) and efficiency. The speedup is the ratio between the execution times of parallel implementations with one core and parallel implementations with two or more cores [18]. The speedup is represented by the formula S = T 1/T p. The efficiency estimates how well utilized the processors are in solving the problem compared to how much effort is wasted in communication and synchronization [18]. The efficiency is determined by the ratio between the speedup and the number of processing elements, represented by the formula E = T 1/(pT p).
Table 1 shows the execution time, in seconds (s), of both algorithms. A significant reduction in the parallel execution time of the vector-valued DFT is observed. Table 1 shows that Algorithm 1 with hypercomplex kernel for a Wiener CAZAC signal in l 2(ℤ 8192, ℂ 5) produces a time of serial execution T ∗ = 13408 s. Using Algorithm 2, however, we obtain T 1 = 106.7 (0.80% of T ∗), T 2 = 80.44 s (0.60% of T ∗), T 3 = 57.35 s (0.43% of T ∗), and T 4 = 32.67 s (0.24% of T ∗). This result shows the advantage of using multicore processors and a parallel computing environment to minimize the high execution time in the vector-valued DFT. This is because parallel computing is a form of computation in which many calculations are carried out simultaneously [19, 20], operating on the principle that large problems can often be divided into smaller ones, which are then solved concurrently, and minimize the execution time [20, 21]. The difference between T ∗ and T p is because T p is computed with matrices in ℂ dr×dr and ℂ ds×ds. Algorithm 1 is computed with matrices in ℂ dn×dn, where n = rs.
Table 1.
| Kernel |
p | n | ||||
|---|---|---|---|---|---|---|
| 1024 | 2048 | 4096 | 8192 | 16384 | ||
| Hypercomplex DFT | ∗ | 36.80 | 163.9 | 853.5 | 13408 | +15000 |
| 1 | 3.142 | 8.950 | 18.78 | 106.7 | 263.4 | |
| 2 | 2.363 | 4.276 | 17.25 | 80.44 | 180.9 | |
| 3 | 1.695 | 3.222 | 12.75 | 57.35 | 154.7 | |
| 4 | 0.983 | 2.966 | 6.481 | 32.67 | 82.65 | |
|
| ||||||
| DFT frame | ∗ | 40.50 | 173.1 | 881.0 | 13913 | +15000 |
| 1 | 2.438 | 9.749 | 20.97 | 95.29 | 251.6 | |
| 2 | 1.911 | 5.798 | 16.38 | 58.72 | 179.5 | |
| 3 | 1.531 | 5.126 | 12.33 | 57.40 | 151.4 | |
| 4 | 1.199 | 2.329 | 5.366 | 31.46 | 73.42 | |
Test signal in l 2(ℤ n, ℂ d), where n = rs and d = 5.
p = ∗ is time execution of Algorithm 1.
p > 0 is the number of cores.
Table 2 represents the speedup of Algorithm 2. The acceleration of the vector-valued DFT increases when p increases regardless of the value of n. The results show that, using the proposed parallel implementation with p cores, where p = 2,3, 4, the speedup to compute the vector-valued DFT of a Wiener CAZAC signal is 1.09, 1.47, and 2.99, respectively. These results imply that, to get the highest speedup, one should prefer the approach with four cores.
Table 2.
Speedup of Algorithm 2.
| Kernel | p | n | ||||
|---|---|---|---|---|---|---|
| 1024 | 2048 | 4096 | 8192 | 16384 | ||
| Hypercomplex DFT | 2 | 1.333 | 2.093 | 1.089 | 1.326 | 1.456 |
| 3 | 1.853 | 2.778 | 1.473 | 1.860 | 1.703 | |
| 4 | 3.196 | 3.017 | 2.987 | 3.265 | 3.187 | |
|
| ||||||
| DFT frame | 2 | 1.275 | 1.509 | 1.281 | 1.623 | 1.402 |
| 3 | 1.592 | 1.707 | 1.701 | 1.660 | 1.661 | |
| 4 | 2.033 | 3.757 | 3.901 | 3.029 | 3.426 | |
Test signal in l 2(ℤ n, ℂ d), where n = rs and d = 5.
p is the number of cores.
Table 3 represents efficiency of Algorithm 2. The information in this table shows that a good efficiency (greater than 65%) is reached with p = 2. But the efficiency of the vector-valued DFT decreases (until 36%) when p increases regardless of the value of n. It is attributed to a decrease in the share of simultaneous computation of the partial vector-valued DFT in Algorithm 2 (steps (3)–(5) and (12)–(14)), which is responsible for the main effect. The results obtained in Table 3 imply that, to get a better efficiency, one should prefer the approach with two cores, because we obtain the highest efficiency.
Table 3.
Efficiency of Algorithm 2.
| Kernel | p | n | ||||
|---|---|---|---|---|---|---|
| 1024 | 2048 | 4096 | 8192 | 16384 | ||
| Hypercomplex DFT | 2 | 0.665 | 1.046 | 0.544 | 0.663 | 0.728 |
| 3 | 0.463 | 0.694 | 0.368 | 0.465 | 0.426 | |
| 4 | 0.400 | 0.377 | 0.362 | 0.408 | 0.398 | |
|
| ||||||
| DFT frame | 2 | 0.637 | 0.755 | 0.645 | 0.811 | 0.701 |
| 3 | 0.530 | 0.427 | 0.425 | 0.415 | 0.415 | |
| 4 | 0.508 | 0.470 | 0.489 | 0.379 | 0.428 | |
Test signal in l 2(ℤ n, ℂ d), where n = rs and d = 5.
p is the number of cores.
6. Conclusion
This work presented a parallel framework of vector-valued DFT for vector-valued discrete-time signals. This mathematical framework was inspired in the matrix representation of the Cooley-Tukey FFT algorithm for complex discrete-time signals, corresponding to the decomposition of the transform size n into the product of two factors r and s, which is developed in [10, 12]. It was expressed in (15) and Algorithm 2. This parallel framework was performed in terms of a matrix representation using a set of block matrix operations: Kronecker product, direct sum, stride permutation, vec operator, and vec inverse operator. These operations contributed to analysis, design, and implementation in parallel. Two kernels are used in the vector-valued DFT: hypercomplex DFT kernel and DFT frame kernel.
The experimental investigation indicated there are profit using MATLAB with the Parallel Computing Toolbox in a computer with multicore processors. First, there was advantage to use multicore processors and a parallel computing environment to minimize the high execution time (with hypercomplex DFT kernel, we obtained T ∗ = 13408 s, T 1 = 106.7, T 2 = 80.44 s, T 3 = 57.35 s, and T 4 = 32.67 s). Second, speedup increased when p increased regardless of the value of n, and a good efficiency too was obtained when p = 2 (above 65%).
As future work, we would like to extend the proposed parallel framework to vector-valued discrete-time signals in l 2(ℤ n, ℂ d), where n = 2k, using the idea of Pease algorithm for complex discrete-time signals [22]. Additionally, we would like to take advantage of more design tradeoffs of different approaches besides what have been shown in this paper, for example, the approach developed in [23].
Acknowledgment
This work was supported by Vicerrectoría de Investigación y Extensión of Instituto Tecnológico de Costa Rica.
Conflict of Interests
The author declares that there is no conflict of interests regarding the publication of this paper.
References
- 1.Wegmann B., Zetzsche C. Feature-specific vector quantization of images. IEEE Transactions on Image Processing. 1996;5(2):274–288. doi: 10.1109/83.480763. [DOI] [PubMed] [Google Scholar]
- 2.Huang J., Lv B.-Q. A feasible algorithm for designing biorthogonal bivariate vector-valued finitely supported wavelets. Physics Procedia. 2012;25:1507–1514. doi: 10.1016/j.phpro.2012.03.269. [DOI] [Google Scholar]
- 3.Li W. Vector transform and image coding. IEEE Transactions on Circuits and Systems for Video Technology. 1991;1(4):297–307. doi: 10.1109/76.120769. [DOI] [Google Scholar]
- 4.Xia X.-G., Suter B. W. Multirate filter banks with block sampling. IEEE Transactions on Signal Processing. 1996;44(3):484–496. doi: 10.1109/78.489022. [DOI] [Google Scholar]
- 5.Avdonin S. A., Ivanov S. A. Sampling and interpolation problems for vector valued signals in the Paley-Wiener spaces. IEEE Transactions on Signal Processing. 2008;56(11):5435–5441. doi: 10.1109/TSP.2008.928702. [DOI] [Google Scholar]
- 6.Vehkapera M., Chatterjee S., Skoglund M. Analysis of MMSE estimation for compressive sensing of block sparse signals. Proceedings of the IEEE Information Theory Workshop (ITW '11); October 2011; Paraty, Brazil. IEEE; pp. 553–557. [DOI] [Google Scholar]
- 7.Benedetto J. J., Donatelli J. J. Frames and a vector-valued ambiguity function. Proceedings of the 42nd Asilomar Conference on Signals, Systems and Computers; October 2008; Pacific Grove, Calif, USA. IEEE; pp. 8–12. [DOI] [Google Scholar]
- 8.Sangwine S. J., Ell T. A. Complex and hypercomplex discrete Fourier transforms based on matrix exponential form of Euler's formula. Applied Mathematics and Computation. 2012;219(2):644–655. doi: 10.1016/j.amc.2012.06.055. [DOI] [Google Scholar]
- 9.Benedetto J. J., Donatelli J. J. Ambiguity function and frame-theoretic properties of periodic zero-autocorrelation waveforms. IEEE Journal on Selected Topics in Signal Processing. 2007;1(1):6–20. doi: 10.1109/jstsp.2007.897044. [DOI] [Google Scholar]
- 10.Johnson J. R., Johnson R. W., Rodriguez D., Tolimieri R. A methodology for designing, modifying, and implementing Fourier transform algorithms on various architectures. Circuits, Systems, and Signal Processing. 1990;9(4):449–500. doi: 10.1007/BF01189337. [DOI] [Google Scholar]
- 11.Rodriguez D., Seguel J., Cruz E. Algebraic methods for the analysis and design of time-frequency signal processing algorithms. Proceedings of the IEEE International Symposium on Circuits and Systems (ISCAS '93); May 1993; Chicago, Ill, USA. IEEE; pp. 196–199. [DOI] [Google Scholar]
- 12.Tolimieri R., An M., Lu C. Algorithms for Discrete Fourier Transform and Convolution. Berlin, Germany: Springer; 1997. (Signal Processing and Digital Filtering). [DOI] [Google Scholar]
- 13.Van Loan C. Computational Frameworks for the Fast Fourier Transform. SIAM; 2012. (Frontiers in Applied Mathematics). [Google Scholar]
- 14.Franchetti F., uschel M., Voronenko Y., Chellappa S., Moura J. M. F. Discrete Fourier transform on multicore. IEEE Signal Processing Magazine. 2009;26(6):90–102. doi: 10.1109/msp.2009.934155. [DOI] [Google Scholar]
- 15.Saberi A., Stoorvogel A., Sannuti P. Internal and External Stabilization of Linear Systems with Constraints. Berlin, Germany: Springer; 2012. (Systems & Control: Foundations & Applications). [DOI] [Google Scholar]
- 16.Shirazinia A., Chatterjee S., Skoglund M. Performance bounds for vector quantized compressive sensing. Proceedings of the International Symposium on Information Theory and Its Applications (ISITA '12); October 2012; pp. 289–293. [Google Scholar]
- 17.Tolimieri R., An M., Lü C., Burrus C. Mathematics of Multidimensional Fourier Transform Algorithms. Berlin, Germany: Springer; 1997. (Signal Processing and Digital Filtering). [DOI] [Google Scholar]
- 18.McCool M. D., Robison A. D., Reinders J. Structured Parallel Programming: Patterns for Efficient Computation. Morgan Kaufmann Publishers, Elsevier; 2012. [Google Scholar]
- 19.Almasi G. S., Gottlieb A. Highly Parallel Computing. Benjamin-Cummings Publishing Company; 1989. [Google Scholar]
- 20.Trobec R., Vajteric M., Zinterhof P. Parallel Computing: Numerics, Applications, and Trends. Springer; 2009. [Google Scholar]
- 21.Tokhi M. O., Hossain M. A., Shaheed M. H. Parallel Computing for Real-Time Signal Processing and Control. Berlin, Germany: Springer; 2003. [Google Scholar]
- 22.Pease M. C. An adaptation of the fast fourier transform for parallel processing. Journal of the ACM. 15(2):252–264. doi: 10.1145/321450.321457. [DOI] [Google Scholar]
- 23.Qiuling Z., Akin B., Sumbul H. E., et al. A 3D-stacked logic-in-memory accelerator for application-specific data intensive computing. Proceedings of the IEEE International 3D Systems Integration Conference; October 2013; pp. 1–7. [DOI] [Google Scholar]
