1 Introduction

Named as one of the top ten numerical algorithms of the twentieth century by SIAM News, the fast Fourier transform (FFT) is well known for reducing the computation complexity of the discrete Fourier transform (DFT) from O(N2) to O(NlogN), in which N is the number of the input samples. In this project, we use POSIX Threads to further accelerate the computation of a 2-D FFT.
A size-N 1-D (DFT) of a sequence of number x[n] (n = 0, ⋯, N − 1) can be denoted by DFTN(x[n]). Each output sample of DFTN(x[n]) is denoted by X[k], where k = 0, ⋯, N − 1. In the following context, we add a superscript k to DFTN(x[n]) to address the relationship between X[k] and DFTN(x[n]), X[k] = DFTkN(x[n]).
According to DFT formula,
(1) X[k] = DFTkN(x[n])  =  N − 1n = 0x[n]WnkN, 
in which
WnkN = e − 2πj(nk)/(N).
Possible k’s can be separated into even (k = 2s) and odd (k = 2s + 1) numbers. For even k’s, Eqn. 1↑ can be written as
X[2s]  =  N − 1n = 0x[n]W2nsN  =  (N)/(2) − 1n = 0x[n]W2nsN + N − 1n = (N)/(2)x[n]W2nsN  =  (N)/(2) − 1n = 0x[n]W2nsN + (N)/(2) − 1n = 0x[n + (N)/(2)]W2(n + (N)/(2))sN.
Now that WnkN = e − 2πj(nk)/(N),
W2(n + (N)/(2))sN = e − 2πj(2(n + (N)/(2))s)/(N) = e − 2πj(2ns)/(N)e − 2πjs = e − 2πj(2ns)/(N) = W2nsN.
Note that the last two terms in the previous equation can be rearranged as follows:
e − 2πj(2ns)/(N) = e − 2πj(ns)/(N ⁄ 2) = Wns(N)/(2).
Based on this, X[2s] above can be further simplified as
(2) X[2s]  =  (N)/(2) − 1n = 0x[n]W2nsN + (N)/(2) − 1n = 0x[n + (N)/(2)]W2nsN  =  (N)/(2) − 1n = 0(x[n] + x[n + (N)/(2)])W2nsN  =  (N)/(2) − 1n = 0(x[n] + x[n + (N)/(2)])Wns(N)/(2)  =  DFT2s(N)/(2)(x[n] + x[n + (N)/(2)]).
Likewise, when k is a odd number, we have
X[2s + 1]  =  N − 1n = 0x[n]Wn(2s + 1)N  =  (N)/(2) − 1n = 0x[n]Wn(2s + 1)N + N − 1n = (N)/(2)x[n]Wn(2s + 1)N  =  (N)/(2) − 1n = 0x[n]Wn(2s + 1)N + (N)/(2) − 1n = 0x[n + (N)/(2)]W(n + (N)/(2))(2s + 1)N  =  (N)/(2) − 1n = 0x[n]Wn(2s + 1)N + (N)/(2) − 1n = 0x[n + (N)/(2)]Wn(2s + 1)NW(Ns + (N)/(2))N  =  (N)/(2) − 1n = 0x[n]Wn(2s + 1)N + (N)/(2) − 1n = 0x[n + (N)/(2)]Wn(2s + 1)NW(N)/(2)N.
Because \strikeout off\uuline off\uwave offW(N)/(2)N = e − πj =  − 1, the equation above becomes\uuline default\uwave default
(3) X[2s + 1]  =  (N)/(2) − 1n = 0x[n]Wn(2s + 1)N − (N)/(2) − 1n = 0x[n + (N)/(2)]Wn(2s + 1)N  =  (N)/(2) − 1n = 0(x[n] − x[n + (N)/(2)])W2nsNWnN  =  (N)/(2) − 1n = 0((x[n] − x[n + (N)/(2)])WnN)Wns(N)/(2)  =  DFT2s + 1(N)/(2)(x[n] − x[n + (N)/(2)])WnN.

2 Implementation

The derivations of Eqns. (2↑) and (3↑) elucidate that a size-N DFT can be divided into two size-N/2 DFTs. If we repeat this strategy until N = 1, according to X[k] = DFTkN(x[n]) = N − 1n = 0x[n]WnkN, the input is identical to the output. Based on this result, we can implement the radix-2 decimation-in-time FFT ( Cooley–Tukey algorithm) in C++ by simply using arrays and loops.
figure Cooley_Tukey.png
Figure 1 The Cooley-Tukey algorithm
Fig. 1↑ illustrates the Cooley–Tukey algorithm in the case of N = 8. In practice, no matter how large N is, a three-layer nested for loop is sufficient for accomplishing the implementation of the algorithm: The index of the outer most layer loops through the stages, the index of the middle layer loops through all blocks, and the index of the inner most layer loops through all entries in a block. The arithmetic relationships between the indices of the three layers can be observed from the Cooley–Tukey diagram.
We can also observe from the diagram that the inputs of each stage only depend on the outputs of the previous stage. Therefore, from the perspective of memory-saving, only one size-N array is required to be allocated to store the computation results. This array is initially used to store the input sequence, subsequently overwritten by the intermediate results obtained from central stages, and eventually overwritten by the results obtained from the final stage.
So far, the output sequence stored in the array needs rearrangement to yield the correct answer. For example, the second element (index=1) of the output sequence, X[4], should be moved to the cell whose index is four; the seventh element (index=6) of the output sequence, X[3], should be moved to the cell whose index is three. The rearrangement can be performed by reversing the binary representations of indices of the output sequence. For instance, the binary representation of the cell storing X[4] in the output sequence is 001. The reverse of 001 is 100, indicating that the second element of the output sequence (X[4]) should be moved to the cell whose index is 4.

3 Numerical efficiency and accuracy

As shown in Fig. 1↑, all possible WnN’s need calculating only once initially and some of them get reused in the following stages (because Wn(N)/(2) = W2nN). Recall that WnN is defined as e − 2πj(nk)/(N), so we can use the analytical form, e − 2πj(nk)/(N) = cos(2π(nk)/(N)) − jsin(2π(nk)/(N)), to evaluate each WnN. However, this method is not the most efficient means of calculating WnN given that cosine and sine functions are more computationally expensive than the additions or multiplications of floating numbers.
To avoid repetitive calls to sine and cosine functions, we can compute W1N first and use the fact that WnN = (W1N)n to calculate all WnN’s. In this method, sine and cosine functions are called only when W1N is evaluated, generating two floating numbers representing the real and imaginary parts of a complex number. As such, W2N requires two multiplications and one addition to be obtained.
However, this means is not desirable from the perspective of numerical accuracy because numerical errors accumulate from W1N to WnN. A better approach that can improve the numerical accuracy is illustrated in the following figure:
figure FFT_coefficients.png
Figure 2 FFT coefficients
In Fig. 2↑, X0 = W1N, X1 = (X0)2, X2 = (X1)2, and X3 = (X2)2. With these terms available, WnN's can be obtained by multiplying them by 1. The advantage of this approach is twofold: 1) Because sine and cosine functions are called only once, this method is computationally efficient, and 2) numerical errors are homogeneously distributed to all WnN’s.

4 2D FFT

Recall that a 2-D DFT can be formulated as
(k, l)  =  (1)/((MN))M − 1m = 0N − 1n = 0F[m, n]e − j2π(k(m)/(M) + l(n)/(N))  =  (1)/((M))M − 1m = 0[(1)/((N))N − 1n = 0F[m, n]e − j2π(l(n)/(N))]e − j2π(k(m)/(M)), 
which indicates that a 2-D FFT can be accomplished by two series of 1-D FFTs—the parts inside and outside the square bracket. In fact, these FFTs can be further expedited with parallelism because the FFTs applied to all possible k’s and l’s are independent.
figure 2D_FFT_multithreads.png
Figure 3 A 2-D FFT can be decomposed into four steps: 1) Each thread executes the 1-D FFTs of several rows, 2) one thread is responsible for transposing the computation results of the previous step while other threads stay idle, 3) the first step is repeated, and 4) the second step is repeated. All threads are synchronized with barriers between each step. This figure illustrates steps 1 and 2.
POSIX threads can be used to achieve parallelism on a shared-memory system. As shown in Fig. 3↑, a user can allocate only one copy of input data in memory and create multiple threads to fulfill the 2-D FFT, which is decomposed into the following steps:
All threads must be synchronized to prevent data overwriting. To synchronize a number of threads, we can use the built-in barrier function or simply implement our own barrier.
Since the threads entering a barrier are intended to remain asleep before they leave, if a customized barrier is required, we should use functions pthread_cond_wait and pthread_cond_signal (and mutexes, of course) to implement it. The implementation of a customized barrier is straightforward: A certain thread (awake) is in charge of checking the total number of threads that have entered the barrier while others stay asleep. If the total number of threads reaches a specified value, the awake thread wakes all others by broadcasting, then all threads resume their scheduled work (leave the barrier, conceptually). Note that the variable specified to represent the total number of threads in the barrier should be declared with the qualifier “volatile” to prevent undesirable optimization imposed by compilers, such as g++ -O2.