Speaker
John Bowman
(University of Alberta)
Description
The matrix transpose is an essential primitive of high-performance parallel computing. In plasma physics and fluid dynamics, a matrix transpose is used to localize the computation of the multidimensional Fast Fourier transform, the engine that powers the pseudospectral collocation method.
An adaptive parallel matrix transpose algorithm optimized for distributed multicore architectures running in a hybrid OpenMP/MPI configuration is presented. Significant boosts in speed are observed relative to the distributed transpose used in the state-of-the-art adaptive FFTW library. In some cases, a hybrid configuration allows one to reduce communication costs by reducing the number of MPI nodes, and thereby increasing message sizes. This also allows for a more slab-like than pencil-like domain decomposition for multidimensional Fast Fourier Transforms, reducing the cost of, or even eliminating the need for, a second distributed transpose. Nonblocking all-to-all transfers enable user computation and communication to be overlapped.
We apply adaptive matrix transposition algorithms on hybrid architectures to the parallelization of implicitly dealiased pseudospectral convolutions used to simulate turbulent flow. Implicit dealiasing outperforms conventional zero padding by decoupling the data and temporary work arrays. Parallelized versions of our implicit dealiasing algorithms for hybrid architectures are publically available in the open-source library FFTW++.
Author
John Bowman
(University of Alberta)
Co-author
Malcolm Roberts
(University of Strasbourg)