Talk:How to write code for 3D FFT in parallel processors
I read about "Large-Scale FFT on GPU Clusters", which illustrate improvement based on your implementation, but since you said that the data transferring take nearly 90% of run time, this method may not have significant speed up, you may try and compare. In your implementation, CPU transpositions costs a lot performance, if using GPU to do it, may speed up. Since in this situation, the device memory access is "non-coalesced" and extremely slow, what that paper suggests is to perform certain adjustment during data transfer from host memory to device memory. You may check on that paper to see detail implementation. In the meanwhile, where are some minus improvement you can find in that paper. I don't know how much it can improve. --Xinhong (talk) 01:49, 1 July 2014 (HKT)