1 Introduction
In this project, we use the message passing interface (MPI) to accelerate the computation of a 2D discrete Fourier transform (DFT).
Frequently used in digital image processing applications, a 2D DFT can be mathematically represented as follows:
F̂(k, l)
=
(1)/(√(MN))^{M − 1}⎲⎳_{m = 0}^{N − 1}⎲⎳_{n = 0}F[m, n]e^{ − j2π(k(m)/(M) + l(n)/(N))}.
In this equation,
F[m, n] represents each input element,
F̂(k, l) stands for each output element,
N is the dimension of variable
n, and
M is the dimension of variable
m.
The representation above can be written as
F̂(k, l)
=
(1)/(√(MN))^{M − 1}⎲⎳_{m = 0}^{N − 1}⎲⎳_{n = 0}F[m, n]e^{ − j2π(k(m)/(M) + l(n)/(N))}
=
(1)/(√(M))^{M − 1}⎲⎳_{m = 0}[(1)/(√(N))^{N − 1}⎲⎳_{n = 0}F[m, n]e^{ − j2π(l(n)/(N))}]e^{ − j2π(k(m)/(M))}
=
A(BF^{T})^{T},
in which
A and
B are two Vandermonde matrices defined as
A
= (1)/(√(M))
e^{ − j2π(k(m)/(M))}_{k = 0, m = 0}
e^{ − j2π(k(m)/(M))}_{k = 0, m = 1}
⋯
e^{ − j2π(k(m)/(M))}_{k = 0, m = M − 1}
e^{ − j2π(k(m)/(M))}_{k = 1, m = 0}
e^{ − j2π(k(m)/(M))}_{k = 1, m = 1}
⋯
e^{ − j2π(k(m)/(M))}_{k = 1, m = M − 1}
⋮
⋮
⋮
⋮
e^{ − j2π(k(m)/(M))}_{k = M − 1, m = 0}
e^{ − j2π(k(m)/(M))}_{k = M − 1, m = 1}
⋯
e^{ − j2π(k(m)/(M))}_{k = M − 1, m = M − 1}
and
B
=
(1)/(√(N))
e^{ − j2π(l(n)/(N))}_{l = 0, n = 0}
e^{ − j2π(l(n)/(N))}_{l = 0, n = 1}
⋯
e^{ − j2π(l(n)/(N))}_{l = 0, n = N − 1}
e^{ − j2π(l(n)/(N))}_{l = 1, n = 0}
e^{ − j2π(l(n)/(N))}_{l = 1, n = 1}
⋯
e^{ − j2π(l(n)/(N))}_{l = 1, n = N − 1}
⋮
⋮
⋮
⋮
e^{ − j2π(l(n)/(N))}_{l = N − 1, n = 0}
e^{ − j2π(l(n)/(N))}_{l = N − 1, n = 1}
⋯
e^{ − j2π(l(n)/(N))}_{l = N − 1, n = N − 1}
.
In other words, we can accomplish the computation of a 2D DFT by conducting two series of 1D DFTs:
BF^{T} and the multiplications between
A and
(BF^{T})^{T}.
As can be understood from the equations above, conducting a 1D DFT is very slow because its computation complexity is O(N^{2}), in which N is the number of input samples. However, given that the multiplications between two matrices is performed through independent inner products, we separate the computation required for a 2D DFT into pieces, distribute them to several processes for completion, and collect the results back. This task can be fulfilled through the Message Passing Interface (MPI).
2 MPI for 2D DFT
Although MPI was specified for distributed computing, some MPI libraries support the parallel computation on personal computers. In this project, we use the MPICH2 MPI library to parallelize a 2D DFT on a personal computer.

The first step is to configure each process to perform 1D DFTs of several rows, as shown in Fig. 1↓. From the perspective of matrix multiplications, each process calculates the multiplications between B and several columns of F^{T}.

The computation results are shown as the color bars within the blocks in Fig. 2↓. These results are segmented into data fragments and exchanged between processes using nonblocking sends and receives to avoid deadlocks of message passing. The data segments passed in and out a certain process are determined by the ensuing computation (partial computation of the ensuing matrix multiplication, A and (BF^{T})^{T}) configured to the process. In the demonstration of Fig. 2↓, process 0 passes out the second and third segments to processes 1 and 2, respectively. In addition, process 0 receives both first segments generated by processes 1 and 2. Likewise, process 1 passes out the first and third segments to processes 0 and 2, respectively, and receives both second segments generated by processes 0 and 2.

After all the data exchanges are finished (MPI_Waitall is used to monitor all the nonblocking activities), each process reorganizes the the data to reflect the outer transpose operation of (BF^{T})^{T} and subsequently performs another 1D DFT, as shown in Fig. 3↓.

The required computation for a 2D DFT has been entirely completed by the collaborations between all processes. The next step is to collect the distributedly computed results back to a certain process for output using “MPI_Gather,” as shown in Fig. 4↓.

Finally, the collected results require one more transpose operation to become the correct answer.