1 Introduction

In this project, we use the message passing interface (MPI) to accelerate the computation of a 2-D discrete Fourier transform (DFT).

Frequently used in digital image processing applications, a 2-D DFT can be mathematically represented as follows:

(k, l)  =  (1)/((MN))M − 1m = 0N − 1n = 0F[m, n]e − j2π(k(m)/(M) + l(n)/(N)).
In this equation, F[m, n] represents each input element, (k, l) stands for each output element, N is the dimension of variable n, and M is the dimension of variable m.
The representation above can be written as
(k, l)  =  (1)/((MN))M − 1m = 0N − 1n = 0F[m, n]e − j2π(k(m)/(M) + l(n)/(N))  =  (1)/((M))M − 1m = 0[(1)/((N))N − 1n = 0F[m, n]e − j2π(l(n)/(N))]e − j2π(k(m)/(M))  =  A(BFT)T, 
in which A and B are two Vandermonde matrices defined as
A  = (1)/((M)) e − j2π(k(m)/(M))|k = 0, m = 0 e − j2π(k(m)/(M))|k = 0, m = 1 e − j2π(k(m)/(M))|k = 0, m = M − 1 e − j2π(k(m)/(M))|k = 1, m = 0 e − j2π(k(m)/(M))|k = 1, m = 1 e − j2π(k(m)/(M))|k = 1, m = M − 1 e − j2π(k(m)/(M))|k = M − 1, m = 0 e − j2π(k(m)/(M))|k = M − 1, m = 1 e − j2π(k(m)/(M))|k = M − 1, m = M − 1
B  =  (1)/((N)) e − j2π(l(n)/(N))|l = 0, n = 0 e − j2π(l(n)/(N))|l = 0, n = 1 e − j2π(l(n)/(N))|l = 0, n = N − 1 e − j2π(l(n)/(N))|l = 1, n = 0 e − j2π(l(n)/(N))|l = 1, n = 1 e − j2π(l(n)/(N))|l = 1, n = N − 1 e − j2π(l(n)/(N))|l = N − 1, n = 0 e − j2π(l(n)/(N))|l = N − 1, n = 1 e − j2π(l(n)/(N))|l = N − 1, n = N − 1 .
In other words, we can accomplish the computation of a 2-D DFT by conducting two series of 1-D DFTs: BFT and the multiplications between A and (BFT)T.
As can be understood from the equations above, conducting a 1-D DFT is very slow because its computation complexity is O(N2), in which N is the number of input samples. However, given that the multiplications between two matrices is performed through independent inner products, we separate the computation required for a 2-D DFT into pieces, distribute them to several processes for completion, and collect the results back. This task can be fulfilled through the Message Passing Interface (MPI).

2 MPI for 2-D DFT

Although MPI was specified for distributed computing, some MPI libraries support the parallel computation on personal computers. In this project, we use the MPICH2 MPI library to parallelize a 2-D DFT on a personal computer.
  1. The first step is to configure each process to perform 1-D DFTs of several rows, as shown in Fig. 1↓. From the perspective of matrix multiplications, each process calculates the multiplications between B and several columns of FT.
    figure First_1DDFT.png
    Figure 1 Each process has a complete copy of input data but only compute 1-D DFTs of several rows.
  2. The computation results are shown as the color bars within the blocks in Fig. 2↓. These results are segmented into data fragments and exchanged between processes using nonblocking sends and receives to avoid deadlocks of message passing. The data segments passed in and out a certain process are determined by the ensuing computation (partial computation of the ensuing matrix multiplication, A and (BFT)T) configured to the process. In the demonstration of Fig. 2↓, process 0 passes out the second and third segments to processes 1 and 2, respectively. In addition, process 0 receives both first segments generated by processes 1 and 2. Likewise, process 1 passes out the first and third segments to processes 0 and 2, respectively, and receives both second segments generated by processes 0 and 2.
    figure distribute_results.png
    Figure 2 The computation results of step 1 are shown in color in each process. The computation result in one process is further segmented into data fragments (painted with different gradients of the same color). These data fragments are passed between processes for the second 1-D DFTs.
  3. After all the data exchanges are finished (MPI_Waitall is used to monitor all the nonblocking activities), each process reorganizes the the data to reflect the outer transpose operation of (BFT)T and subsequently performs another 1-D DFT, as shown in Fig. 3↓.
    figure Second_1DDFT.png
    Figure 3 MPI_Waitall is used to monitor all the nonblocking sends and receives applied to each process. Afterwards, the second 1-D DFT is performed in each process, as denoted by the curly arrows.
  4. The required computation for a 2-D DFT has been entirely completed by the collaborations between all processes. The next step is to collect the distributedly computed results back to a certain process for output using “MPI_Gather,” as shown in Fig. 4↓.
    figure gather_results.png
    Figure 4 All the computation results produced in each process are gathered to a certain process.
  5. Finally, the collected results require one more transpose operation to become the correct answer.
    figure final_result.png
    Figure 5 Transpose the data collected to process 0.