Procedures

1. Decompose the original data into N parts, such as (Nx, Ny, Nz/N), where N is the processor number. Store each of them to distinct GPUs.
2. In the above case, we do 2D FFT in the x-y dimension for Nz/N times, or do two times 1D FFT respectively along X coordinate and along Y coordinate, each with Nz*Ny/N and Nz*Nx/N times.
3. Do a global transpose
4. Do 1D FFT respectively along Z coordinate.
5. Transfer the data back to their original place.

There are two graphs showing the steps described above.

Figure 1: all-to-all comminication (Retrived from: Roland Schulz, "3D FFT with 2D decomposition", 2008)

Figure 2: data seperation and transfer (Retrived from: http://cecs.anu.edu.au/files/posters/2012/u4408011.pdf.)

The pseudo code:

// Do 2D FFT in x-y dimension
cufftExecZ2Z(plan_2D,p_d,p_d,CUFFT_FORWARD);
// Copy data from GPU to CPU; numprocs is the number of processors in  parallel
cudaMemcpy(p_h,p_d,sizeof(_Complex double)*Nx*Ny*Nz/numprocs,cudaMemcpyDeviceToHost);


// Do a transpose and store data into numprocs arrays with size of Nx*Ny*Nz/(numprocs*numprocs)
// Like figure 2 shows
transpose(p_h, p_part_h);
MPI_Barrier (MPI_COMM_WORLD); // Make sure the transpose operation are all done in every processor
// Transfer data in all-to-all communication like figure 1 shows
for(int i=0; i<numprocs; i++){
 if(i!=rank)
  MPI_Sendrecv(&p_part_h[i][0], Nx*Ny*Nz/(numprocs*numprocs), MPI_DOUBLE_COMPLEX, i, 0, &p_part_receive_h[i][0],   
  Nx*Ny*Nz/(numprocs*numprocs), MPI_DOUBLE_COMPLEX, i, 0, MPI_COMM_WORLD, &stat);
}
MPI_Barrier (MPI_COMM_WORLD); // Make sure the transfer operation are all done in every processor		


// Coalescing the distinct data in different array into one array
coalescing(p_h, p_part_h, p_part_receive_h);
// Copy data from CPU to GPU
cudaMemcpy(p1ff_d,pf_h,sizeof(_Complex double)*Nx*Ny*Nz/numprocs,cudaMemcpyHostToDevice);
// Do 1D FFT along z coordinate
cufftExecZ2Z(plan_1D,p_d,p_d,CUFFT_FORWARD); // By this step, 3D FFT computation is finished.


//send back data to its original position in CPU
cudaMemcpy(p_h,p_d,sizeof(_Complex double)*Nx*Ny*Nz/numprocs,cudaMemcpyDeviceToHost);
transpose(p_h, p_part_h);
MPI_Barrier (MPI_COMM_WORLD);
for(int i=0; i<numprocs; i++){
 if(i!=rank)
  MPI_Sendrecv(&p_part_h[i][0], Nx*Ny*Nz/(numprocs*numprocs), MPI_DOUBLE_COMPLEX, i, 0, &p_part_receive_h[i][0],
  Nx*Ny*Nz2/(numprocs*numprocs), MPI_DOUBLE_COMPLEX, i, 0, MPI_COMM_WORLD, &stat); 
}
MPI_Barrier (MPI_COMM_WORLD);			
coalescing(p_h, p_part_h, p_part_receive_h);
// Copy FFT result to GPU 			
cudaMemcpy(p_d,p_h,sizeof(_Complex double)*Nx*Ny*Nz/numprocs,cudaMemcpyHostToDevice);



All the procedures discussed above equal to the following one line code if all of the datas are saved in one GPU
cufftExecZ2Z(plan_3D,p_d,p_d,CUFFT_FORWARD);

Issue

For this kind of FFT parallel algorithm executed on multiple GPUs, the all-to-all communication between CPUs occupies about 60% of the total time, and for transpose, coaloscing and cudaMemcpy operation occupies about 30% of the total time. While cufft of 1D and 2D only occupies 5%~10% of the total time. We could see that the overhead of this method is extremely high.

How to write code for 3D FFT in parallel processors

Procedures

The pseudo code:

Issue

Navigation menu

Views

Personal tools

Navigation

Search

Tools