Introduction

Since the porting of the particle part of ORB5 to GPUs, the fraction of time spent in the field part has increased substantially. In particular, the main bottleneck is the parallel matrix transpose required to perform the 2D DFT of the toroidal and poloidal dimensions for the RHS (charge and current spline coefficients).

In the original implementation, the pptransp routine performs this parallel data transpose using pairwise exchanges in the form of MPI sendrecv. As an improvement to this method, two new algorithms have been implemented. The first concerns directly the pptransp routine and uses MPI derived types to perform a zero-copy parallel transpose. The second method split the toroidal DFT into local partial DFTs that are then reduced and scattered back.

Local and partial DFT algorithm

In ORB5, the Poisson and Ampere equations are solved by using an equivalent system corresponding to the 2D (toroidal and poloidal) Fourier transform of their spline representation.

Given the RHS B-splines coefficients \(\rho_{ijk}\), their 2D DFT is given by the composition of a poloidal and toroidal transfrom:

\[ \hat{\rho}_{ik}^{m} = \sum_{j=0}^{N_\theta-1}\rho_{ijk}\exp\left(-\frac{2\pi \imath j m }{N_\theta}\right), \quad \forall m \in [m_{1}^{f}, m_{2}^{f}], \]

where \(m_{1}^{f}\) and \(m_{2}^{f}\) are the lower and upper bounds of the poloidal filter.

\[ \hat{\hat{\rho}}_{i}^{nm} = \sum_{k=0}^{N_\varphi-1}\hat{\rho}_{ik}^{m}\exp\left(-\frac{2\pi \imath k n }{N_\varphi}\right), \quad \forall n \in [n_{1}^{f}, n_{2}^{f}], \]

where \(n_{1}^{f}\) and \(n_{2}^{f}\) are the lower and upper bounds of the toroidal filter. Typically, one uses a diagonal filter for the poloidal modes, i.e.

\[ \forall n \in [n_{1}^{f}, n_{2}^{f}], \quad m \in [-nq(s)-\Delta m, -nq(s)+\Delta m], \]

with \(\Delta m \sim 5\).

Current implementation

In the original implementation, the Fourier transformation is done according to the following workflow:

Compute charge/current deposition in real space and store it into rho(s, th, philoc)
Transpose rho and store it as complex numbers:
rho(s, th, phi) \(\longrightarrow\) crhs(th, s, philoc)
DFT in \(\theta\):
crhs(th, s, philoc) \(\longrightarrow\) crhs(m, s, philoc)
Parallel matrix transpose using pptransp:
crhs(th, s, philoc) \(\longrightarrow\) crhs_fft(phi, s, mloc)
DFT in \(\varphi\):
crhs_fft(phi, s, mloc) \(\longrightarrow\) crhs_fft(n, s, mloc)
Filtering
Final parallel transpose using pptransp:
crhs_fft(n, s, mloc) \(\longrightarrow\) crhs(m, s, nloc)
Sum accross all the clones using ppsum
Backsolve to find the fields
IDFT to find fields in real space

Typically, steps 2 to 8 are done in the create_rhs subroutine and step 10 is done in the create_field_bspl subroutine.

Local algorithm

The idea behind the local algorithm is to split the toroidal DFT into two sums, one over the subdomains and one over the toroidal grid points in each subdomain:

\[ \hat{\hat{\rho}}_{i}^{nm} = \sum_{k=0}^{N_\varphi-1}\hat{\rho}_{ik}^{m}\exp\left(-\frac{2\pi \imath k n }{N_\varphi}\right) = \sum_{s\in\mathcal{S}}\sum_{k=0}^{N_\varphi^{l}-1}\hat{\rho}_{ik}^{m}\exp\left(-\frac{2\pi \imath k n }{N_\varphi}\right) \]

Note that here, one only computes \(\forall n \in [n_{1}^{f}, n_{2}^{f}]\) and for \(m \in [-nq(s)-\Delta m, -nq(s)+\Delta m]\). In practice this allows one to combine the toroidal DFT with the filtering, thus computing only the modes that are in the filter. Furthermore, one can replace two calls to the pptransp routine by a call to an MPI reduce_scatter. The later will compute the sum of the partial DFTs across the subdomains and scatter the resulting modes to their corresponding ranks.

Choosing the DFT algorithm to use

Currently, only the original pptransp and the local DFT algorithms are implemented in ORB5. To choose which one to use, one uses the nsel_dft_method input paremeter of the FIELDS namelist. It can currently take the values pptransp and local_dft.

Table of Contents

Introduction

Local and partial DFT algorithm

Current implementation

Local algorithm

Choosing the DFT algorithm to use