Named as one of the top ten numerical algorithms of the twentieth century by SIAM News, the fast Fourier transform (FFT) is well known for reducing the computation complexity of the discrete Fourier transform (DFT) from O(*N*^{2}) to O(*N*log*N*), in which *N* is the number of the input samples.
In this project, we use POSIX Threads to further accelerate the computation of a 2-D FFT.

A size-N 1-D (DFT) of a sequence of number *x*[*n*] (*n* = 0, ⋯, *N* − 1) can be denoted by *DFT*_{N}(*x*[*n*]). Each output sample of *DFT*_{N}(*x*[*n*]) is denoted by *X*[*k*], where *k* = 0, ⋯, *N* − 1. In the following context, we add a superscript *k* to *DFT*_{N}(*x*[*n*]) to address the relationship between *X*[*k*] and *DFT*_{N}(*x*[*n*]), *X*[*k*] = *DFT*^{k}_{N}(*x*[*n*]).

According to DFT formula,

(1)
*X*[*k*] = *DFT*^{k}_{N}(*x*[*n*])
=
^{N − 1}⎲⎳_{n = 0}*x*[*n*]*W*^{nk}_{N},

in which
Possible *k*’s can be separated into even (*k* = 2*s*) and odd (*k* = 2*s* + 1) numbers. For even *k*’s, Eqn. 1↑ can be written as

(2)
*X*[2*s*]
=
^{(N)/(2) − 1}⎲⎳_{n = 0}*x*[*n*]*W*^{2ns}_{N} + ^{(N)/(2) − 1}⎲⎳_{n = 0}*x*[*n* + (*N*)/(2)]*W*^{2ns}_{N}
=
^{(N)/(2) − 1}⎲⎳_{n = 0}(*x*[*n*] + *x*[*n* + (*N*)/(2)])*W*^{2ns}_{N}
=
^{(N)/(2) − 1}⎲⎳_{n = 0}(*x*[*n*] + *x*[*n* + (*N*)/(2)])*W*^{ns}_{(N)/(2)}
=
*DFT*^{2s}_{(N)/(2)}(*x*[*n*] + *x*[*n* + (*N*)/(2)]).

Likewise, when *k* is a odd number, we have
*X*[2*s* + 1]
=
^{N − 1}⎲⎳_{n = 0}*x*[*n*]*W*^{n(2s + 1)}_{N}
=
^{(N)/(2) − 1}⎲⎳_{n = 0}*x*[*n*]*W*^{n(2s + 1)}_{N} + ^{N − 1}⎲⎳_{n = (N)/(2)}*x*[*n*]*W*^{n(2s + 1)}_{N}
=
^{(N)/(2) − 1}⎲⎳_{n = 0}*x*[*n*]*W*^{n(2s + 1)}_{N} + ^{(N)/(2) − 1}⎲⎳_{n = 0}*x*[*n* + (*N*)/(2)]*W*^{(n + (N)/(2))(2s + 1)}_{N}
=
^{(N)/(2) − 1}⎲⎳_{n = 0}*x*[*n*]*W*^{n(2s + 1)}_{N} + ^{(N)/(2) − 1}⎲⎳_{n = 0}*x*[*n* + (*N*)/(2)]*W*^{n(2s + 1)}_{N}*W*^{(Ns + (N)/(2))}_{N}
=
^{(N)/(2) − 1}⎲⎳_{n = 0}*x*[*n*]*W*^{n(2s + 1)}_{N} + ^{(N)/(2) − 1}⎲⎳_{n = 0}*x*[*n* + (*N*)/(2)]*W*^{n(2s + 1)}_{N}*W*^{(N)/(2)}_{N}.
Because \strikeout off\uuline off\uwave off*W*^{(N)/(2)}_{N} = *e*^{ − πj} = − 1, the equation above becomes\uuline default\uwave default

(3)
*X*[2*s* + 1]
=
^{(N)/(2) − 1}⎲⎳_{n = 0}*x*[*n*]*W*^{n(2s + 1)}_{N} − ^{(N)/(2) − 1}⎲⎳_{n = 0}*x*[*n* + (*N*)/(2)]*W*^{n(2s + 1)}_{N}
=
^{(N)/(2) − 1}⎲⎳_{n = 0}(*x*[*n*] − *x*[*n* + (*N*)/(2)])*W*^{2ns}_{N}*W*^{n}_{N}
=
^{(N)/(2) − 1}⎲⎳_{n = 0}((*x*[*n*] − *x*[*n* + (*N*)/(2)])*W*^{n}_{N})*W*^{ns}_{(N)/(2)}
=
*DFT*^{2s + 1}_{(N)/(2)}(*x*[*n*] − *x*[*n* + (*N*)/(2)])*W*^{n}_{N}.

The derivations of Eqns. (2↑) and (3↑) elucidate that a size-N DFT can be divided into two size-N/2 DFTs. If we repeat this strategy until *N* = 1, according to *X*[*k*] = *DFT*^{k}_{N}(*x*[*n*]) = ∑^{N − 1}_{n = 0}*x*[*n*]*W*^{nk}_{N}, the input is identical to the output. Based on this result, we can implement the radix-2 decimation-in-time FFT ( Cooley–Tukey algorithm) in C++ by simply using arrays and loops.

Fig. 1↑ illustrates the Cooley–Tukey algorithm in the case of *N* = 8. In practice, no matter how large *N* is, a three-layer nested *for loop* is sufficient for accomplishing the implementation of the algorithm: The index of the outer most layer loops through the stages, the index of the middle layer loops through all blocks, and the index of the inner most layer loops through all entries in a block. The arithmetic relationships between the indices of the three layers can be observed from the Cooley–Tukey diagram.

We can also observe from the diagram that the inputs of each stage only depend on the outputs of the previous stage. Therefore, from the perspective of memory-saving, only one size-N array is required to be allocated to store the computation results. This array is initially used to store the input sequence, subsequently overwritten by the intermediate results obtained from central stages, and eventually overwritten by the results obtained from the final stage.

So far, the output sequence stored in the array needs rearrangement to yield the correct answer. For example, the second element (index=1) of the output sequence, *X*[4], should be moved to the cell whose index is four; the seventh element (index=6) of the output sequence, *X*[3], should be moved to the cell whose index is three. The rearrangement can be performed by reversing the binary representations of indices of the output sequence. For instance, the binary representation of the cell storing *X*[4] in the output sequence is 001. The reverse of 001 is 100, indicating that the second element of the output sequence (*X*[4]) should be moved to the cell whose index is 4.

As shown in Fig. 1↑, all possible *W*^{n}_{N}’s need calculating only once initially and some of them get reused in the following stages (because *W*^{n}_{(N)/(2)} = *W*^{2n}_{N}). Recall that *W*^{n}_{N} is defined as *e*^{ − 2πj(nk)/(N)}, so we can use the analytical form, *e*^{ − 2πj(nk)/(N)} = *cos*(2*π*(*nk*)/(*N*)) − *jsin*(2*π*(*nk*)/(*N*)), to evaluate each *W*^{n}_{N}. However, this method is not the most efficient means of calculating *W*^{n}_{N} given that cosine and sine functions are more computationally expensive than the additions or multiplications of floating numbers.

To avoid repetitive calls to sine and cosine functions, we can compute *W*^{1}_{N} first and use the fact that *W*^{n}_{N} = (*W*^{1}_{N})^{n} to calculate all *W*^{n}_{N}’s. In this method, sine and cosine functions are called only when *W*^{1}_{N} is evaluated, generating two floating numbers representing the real and imaginary parts of a complex number. As such, *W*^{2}_{N} requires two multiplications and one addition to be obtained.

However, this means is not desirable from the perspective of numerical accuracy because numerical errors accumulate from *W*^{1}_{N} to *W*^{n}_{N}. A better approach that can improve the numerical accuracy is illustrated in the following figure:

In Fig. 2↑, *X*_{0} = *W*^{1}_{N}, *X*_{1} = (*X*_{0})^{2}, *X*_{2} = (*X*_{1})^{2}, and *X*_{3} = (*X*_{2})^{2}. With these terms available, *W*^{n}_{N}'s can be obtained by multiplying them by 1. The advantage of this approach is twofold: 1) Because sine and cosine functions are called only once, this method is computationally efficient, and 2) numerical errors are homogeneously distributed to all *W*^{n}_{N}’s.

Recall that a 2-D DFT can be formulated as
*F̂*(*k*, *l*)
=
(1)/(√(*MN*))^{M − 1}⎲⎳_{m = 0}^{N − 1}⎲⎳_{n = 0}*F*[*m*, *n*]*e*^{ − j2π(k(m)/(M) + l(n)/(N))}
=
(1)/(√(*M*))^{M − 1}⎲⎳_{m = 0}[(1)/(√(*N*))^{N − 1}⎲⎳_{n = 0}*F*[*m*, *n*]*e*^{ − j2π(l(n)/(N))}]*e*^{ − j2π(k(m)/(M))},
which indicates that a 2-D FFT can be accomplished by two series of 1-D FFTs—the parts inside and outside the square bracket. In fact, these FFTs can be further expedited with parallelism because the FFTs applied to all possible *k*’s and *l*’s are independent.

POSIX threads can be used to achieve parallelism on a shared-memory system. As shown in Fig. 3↑, a user can allocate only one copy of input data in memory and create multiple threads to fulfill the 2-D FFT, which is decomposed into the following steps:

- Each thread executes the 1-D FFTs of several rows.
- One thread is responsible for transposing the computation results of the previous step while other threads stay idle.
- The first step is repeated.
- The second step is repeated.

All threads must be synchronized to prevent data overwriting. To synchronize a number of threads, we can use the built-in barrier function or simply implement our own barrier.

Since the threads entering a barrier are intended to remain asleep before they leave, if a customized barrier is required, we should use functions pthread_cond_wait and pthread_cond_signal (and mutexes, of course) to implement it. The implementation of a customized barrier is straightforward: A certain thread (awake) is in charge of checking the total number of threads that have entered the barrier while others stay asleep. If the total number of threads reaches a specified value, the awake thread wakes all others by broadcasting, then all threads resume their scheduled work (leave the barrier, conceptually). Note that the variable specified to represent the total number of threads in the barrier should be declared with the qualifier “volatile” to prevent undesirable optimization imposed by compilers, such as g++ -O2.