![]() |
ORB5
v4.9.4
|
Implementation of the local DFT algorithm.
Since the porting of the particle part of ORB5 to GPUs, the fraction of time spent in the field part has increased substantially. In particular, the main bottleneck is the parallel matrix transpose required to perform the 2D DFT of the toroidal and poloidal dimensions for the RHS (charge and current spline coefficients).
In the original implementation, the pptransp
routine performs this parallel data transpose using pairwise exchanges in the form of MPI sendrecv
. As an improvement to this method, two new algorithms have been implemented. The first concerns directly the pptransp
routine and uses MPI derived types to perform a zero-copy parallel transpose. The second method split the toroidal DFT into local partial DFTs that are then reduced and scattered back.
In ORB5, the Poisson and Ampere equations are solved by using an equivalent system corresponding to the 2D (toroidal and poloidal) Fourier transform of their spline representation.
Given the RHS B-splines coefficients \(\rho_{ijk}\), their 2D DFT is given by the composition of a poloidal and toroidal transfrom:
\[ \hat{\rho}_{ik}^{m} = \sum_{j=0}^{N_\theta-1}\rho_{ijk}\exp\left(-\frac{2\pi \imath j m }{N_\theta}\right), \quad \forall m \in [m_{1}^{f}, m_{2}^{f}], \]
where \(m_{1}^{f}\) and \(m_{2}^{f}\) are the lower and upper bounds of the poloidal filter.
\[ \hat{\hat{\rho}}_{i}^{nm} = \sum_{k=0}^{N_\varphi-1}\hat{\rho}_{ik}^{m}\exp\left(-\frac{2\pi \imath k n }{N_\varphi}\right), \quad \forall n \in [n_{1}^{f}, n_{2}^{f}], \]
where \(n_{1}^{f}\) and \(n_{2}^{f}\) are the lower and upper bounds of the toroidal filter. Typically, one uses a diagonal filter for the poloidal modes, i.e.
\[ \forall n \in [n_{1}^{f}, n_{2}^{f}], \quad m \in [-nq(s)-\Delta m, -nq(s)+\Delta m], \]
with \(\Delta m \sim 5\).
In the original implementation, the Fourier transformation is done according to the following workflow:
rho(s, th, philoc)
rho
and store it as complex numbers: rho(s, th, phi)
\(\longrightarrow\) crhs(th, s, philoc)
crhs(th, s, philoc)
\(\longrightarrow\) crhs(m, s, philoc)
pptransp
: crhs(th, s, philoc)
\(\longrightarrow\) crhs_fft(phi, s, mloc)
crhs_fft(phi, s, mloc)
\(\longrightarrow\) crhs_fft(n, s, mloc)
pptransp
:crhs_fft(n, s, mloc)
\(\longrightarrow\) crhs(m, s, nloc)
ppsum
Typically, steps 2 to 8 are done in the create_rhs
subroutine and step 10 is done in the create_field_bspl
subroutine.
The idea behind the local algorithm is to split the toroidal DFT into two sums, one over the subdomains and one over the toroidal grid points in each subdomain:
\[ \hat{\hat{\rho}}_{i}^{nm} = \sum_{k=0}^{N_\varphi-1}\hat{\rho}_{ik}^{m}\exp\left(-\frac{2\pi \imath k n }{N_\varphi}\right) = \sum_{s\in\mathcal{S}}\sum_{k=0}^{N_\varphi^{l}-1}\hat{\rho}_{ik}^{m}\exp\left(-\frac{2\pi \imath k n }{N_\varphi}\right) \]
Note that here, one only computes \(\forall n \in [n_{1}^{f}, n_{2}^{f}]\) and for \(m \in [-nq(s)-\Delta m, -nq(s)+\Delta m]\). In practice this allows one to combine the toroidal DFT with the filtering, thus computing only the modes that are in the filter. Furthermore, one can replace two calls to the pptransp
routine by a call to an MPI reduce_scatter
. The later will compute the sum of the partial DFTs across the subdomains and scatter the resulting modes to their corresponding ranks.
Currently, only the original pptransp
and the local DFT algorithms are implemented in ORB5. To choose which one to use, one uses the nsel_dft_method
input paremeter of the FIELDS
namelist. It can currently take the values pptransp
and local_dft
.