[Note: this project is in active development.]
This is a SUNDIALS-based demonstration application to assess and demonstrate the large-scale parallel performance of new capabilities that have been added to SUNDIALS in recent years. Namely:
-
The new SUNDIALS MPIManyVector implementation, that enables flexibility in how a solution data is partitioned across computational resources e.g., CPUs and GPUs.
-
The new ARKODE multirate integration module, MRIStep, allowing high-order accurate calculations that subcycle "fast" processes within "slow" ones.
-
The new flexible SUNDIALS SUNLinearSolver interfaces, to enable streamlined use of problem specific and scalable linear solver libraries e.g., SuiteSparse and MAGMA.
This code simulates a 3D nonlinear inviscid compressible Euler equation with advection and reaction of chemical species,
for independent variables
The differential equation is completed using initial condition
The system state vector
corresponding to the density, momentum in the x, y, and z directions, total
energy per unit volume, and any number of chemical densities
The external force
We have the physical parameters:
-
$R$ is the specific ideal gas constant (287.14 J/kg/K), -
$c_v$ is the specific heat capacity at constant volume (717.5 J/kg/K), -
$\gamma = c_p/c_v = 1 + R/c_v$ is the ratio of specific heats (1.4),
corresponding to air (predominantly an ideal diatomic gas). The speed
of sound in the gas is then given by
The fluid variables above are non-dimensionalized; in standard SI units these would be:
-
$[\rho] = kg / m^3$ , -
$[v_x] = [v_y] = [v_z] = m/s$ , which implies$[m_x] = [m_y] = [m_z] = kg / m^2 / s$ -
$[e_t] = kg / m / s^2$ , and -
$[\mathbf{c}_i] = kg / m^3$
Note: the fluid portion of the description above follows the presentation here in sections 7.3.1 - 7.3.3.
We discretize this problem using the method of lines, where we first semi-discretize
in space using a regular finite volume grid with dimensions nx
x ny
x nz
, with
fluxes at cell faces calculated using a 5th-order FD-WENO reconstruction. MPI
parallelization is achieved using a standard 3D domain decomposition, using nprocs
MPI ranks, with layout npx
x npy
x npz
defined automatically via the
MPI_Dims_create
utility routine. The minimum size for any dimension is 3, so
to run a two-dimensional test in the yz-plane, one could specify nx = 3
and
ny = nz = 200
. When run in parallel, only "active" spatial dimensions (those
with extent greater than 3) will be parallelized.
The fluid fields N_Vector
objects on each MPI rank. The chemical species at all spatial locations over
each MPI rank are collocated into a single serial or RAJA N_Vector
object when
running on the CPU or GPU respectively. The five fluid vectors and the chemical
species vector are combined together to form the full "solution" vector MPIManyVector
N_Vector
module.
After spatial semi-discretization, we are faced with a large IVP system,
where
For non-reactive flows, the resulting initial-value problem is evolved in time using an adaptive step explicit Runge-Kutta method from the ARKStep module in ARKODE. For problems involving (typically stiff) chemical reactions, the problem may be solved using one of two approaches.
-
It may be treated as a multirate initial-value problem, that is solved using the MRIStep module in ARKODE, wherein the gas dynamics equations are evolved explicitly at the slow time scale, while the chemical kinetics are evolved at a faster time scale using a temporally-adaptive, diagonally-implicit Runge-Kutta method from the ARKStep module.
-
It may be treated using mixed implicit-explicit (IMEX) methods at a single time scale. Here, the gas dynamics equations are treated explicitly, while the chemical kinetics are treated implicitly, using an additive Runge-Kutta method from the ARKStep module.
For (1) we use SUNDIALS' modified Newton solver to handle the global nonlinear
algebraic systems arising at each implicit stage of each time step. Since only
SUNLinearSolver
implementation that solves each MPI rank-local linear system
independently. The portion of the Jacobian matrix on each rank is itself
block-diagonal. We further leverage this structure by solving each rank-local
linear system using either the sparse KLU (CPU-only) or batched dense MAGMA
(GPU-enabled) SUNDIALS SUNLinearSolver
implementations.
The multirate approach (2) can leverage the structure of
and all MPI communication necessary to construct the forcing functions, nx
x ny
x nz
spatially-decoupled fast IVPs. We construct a custom fast integrator that groups
all the independent fast IVPs on an MPI rank together as a single system evolved
using a rank-local ARKStep instance. The code for this custom integrator itself
is minimal, primarily consisting of steps to access the local subvectors in SUNLinearSolver
modules listed
above for linear systems that arise within the modified Newton iteration.
The following steps describe how to build the demonstration code in a Linux or OS X environment.
To obtain the code, clone this repository with Git:
git clone https://github.com/sundials-codes/sundials-manyvector-demo.git
To compile the code you will need:
-
CMake 3.20 or newer
-
modern C and C++ compilers
-
the NVIDIA CUDA Toolkit (when using the CUDA backend)
-
the HDF5 high-performance data management and storage suite
-
the RAJA performance portability library
-
the SUNDIALS library of time integrators and nonlinear solvers
-
the SuiteSparse library of sparse direct linear solvers (when using a CPU backend)
-
the MAGMA dense linear solver library (when using a GPU backend)
Many of the above dependencies can be installed using the Spack package manager. For information on using Spack see the getting started guide. The instructions below were formulated from Spack v0.19.0, although newer versions should also work.
Once Spack is setup, we recommend creating a Spack environment with the required dependencies e.g., on a system with Pascal GPUs and CUDA 11.4.2 installed:
spack env create --with-view ~/views/sundials-demo sundials-demo
spack env activate sundials-demo
spack add sundials@6.2.0 +openmp +mpi +logging-mpi +klu +magma +raja +cuda cuda_arch=60 ^cuda@11.4.2 ^magma@2.6.1 +cuda cuda_arch=60 ^raja@0.13.0 +cuda cuda_arch=60 ^suite-sparse@5.8.1
spack add hdf5@1.10.7 +hl +mpi
spack install
To assist in building the dependencies on select systems the spack directory contains environment files leveraging software already available on the system. For example, on the OLCF Summit system:
module load gcc/10.2.0 cuda/11.4.2 cmake/3.21.3
cd spack
spack env create sundials-demo spack-summit.yaml
spack env activate sundials-demo
spack install
It also possible to use the Docker containers from the GitHub Container Registry with the necessary dependencies preinstalled for CPU-only testing. Two images are provided:
-
sundials-demo-spack-latest -- based on the latest Spack release (currently v0.19.0)
-
sundials-demo-spack-develop -- based on the Spack develop branch and updated monthly
Pull the image(s) using Docker (or Podman).
For example, the run
command below will pull the image and start the container
and the exec
command will start a bash shell inside the container.
docker run -t -d --name sundialsci-demo-spack-latest ghcr.io/sundials-codes/sundials-demo-spack-latest:spack-latest
docker exec -it sundials-demo-spack-lateset bash
Then clone this repository with Git and configure/build the code as described
below. The Spack installed dependencies are available from the /opt/view
directory.
Once the necessary dependencies are installed, the following CMake variables can be used to configure the demonstration code build:
-
CMAKE_INSTALL_PREFIX
- the path where executables and input files should be installed e.g.,my/install/path
. The executables will be installed in thebin
directory and input files in thetests
directory under the given path. -
CMAKE_C_COMPILER
- the C compiler to use e.g.,mpicc
. If not set, CMake will attempt to automatically detect the C compiler. -
CMAKE_C_FLAGS
- the C compiler flags to use e.g.,-g -O2
. -
CMAKE_C_STANDARD
- the C standard to use, defaults to99
. -
CMAKE_CXX_COMPILER
- the C++ compiler to use e.g.,mpicxx
. If not set, CMake will attempt to automatically detect the C++ compiler. -
CMAKE_CXX_FLAGS
- the C++ flags to use e.g.,-g -O2
. -
CMAKE_CXX_STANDARD
- the C++ standard to use, defaults to11
. -
RAJA_ROOT
- the root directory of the RAJA installation, defaults to the value of theRAJA_ROOT
environment variable. If not set, CMake will attempt to automatically locate a RAJA install on the system. -
RAJA_BACKEND
- the RAJA backend to use with the demonstration code, defaults toSERIAL
. Supported options areSERIAL
,OPENMP
andCUDA
. Note that this only applies to on-node parallelism that is used when evaluating chemistry-based components associated with$f_2(w)$ . -
SUNDIALS_ROOT
- the root directory of the SUNDIALS installation, defaults to the value of theSUNDIALS_ROOT
environment variable. If not set, CMake will attempt to automatically locate a SUNDIALS install on the system. -
ENABLE_HDF5
- build with HDF5 I/O support, defaults toOFF
. -
HDF5_ROOT
- the root directory of the HDF5 installation, defaults to the value of theHDF5_ROOT
environment variable. If not set, CMake will attempt to automatically locate a HDF5 install on the system.
When RAJA is installed with CUDA support enabled, the following additional variables may also be set:
-
CMAKE_CUDA_COMPILER
- the CUDA compiler to use e.g.,nvcc
. If not set, CMake will attempt to automatically detect the CUDA compiler. -
CMAKE_CUDA_FLAGS
- the CUDA compiler flags to use. -
CMAKE_CUDA_ARCHITECTURES
- the CUDA architecture to target e.g.,70
.
In-source builds are not permitted, as such the code should be configured and built from a separate build directory e.g.,
cd sundials-manyvector-demo
mkdir build
cd build
cmake ../. \
-DCMAKE_INSTALL_PREFIX="[install-path]" \
-DRAJA_BACKEND="SERIAL" \
-DENABLE_HDF5="ON" \
-DHDF5_ROOT="[spack-view-path]" \
-DRAJA_ROOT="[spack-view-path]" \
-DSUNDIALS_ROOT="[spack-view-path]"
make
make install
where [install-path]
is the path to where the binary and test input files
should be installed and [spack-view-path]
is the path to the Spack environment
view, ~/views/sundials-demo
when following the Spack instructions above or
/opt/view
when using the Docker containers.
Several test cases are included with the code and the necessary input files for
each case are contained in the subdirectories within the tests
directory. Each input file is internally documented to discuss all possible
input parameters (in case some have been added since this README
was last
updated).
The input files contain parameters to set up the physical problem:
-
spatial domain,
$\Omega$ --xl
,xr
,yl
,yr
,zl
,zr
-
time interval,
$[t_0, t_f]$ --t0
,tf
-
the ratio of specific heats,
$\gamma$ --gamma
-
spatial discretization dimensions --
nx
,ny
,nz
-
boundary condition types --
xlbc
,xrbc
,ylbc
,yrbc
,zlbc
,zrbc
Parameters to control the execution of the code:
-
desired CFL fraction --
cfl
(if set to zero, then the time step is chosen purely using temporal adaptivity). -
number of desired solution outputs --
nout
-
a flag to enable optional output of RMS averages for each field at the frequency specified via
nout
--showstats
Numerous parameters are also provided to control how time integration is performed (these are passed directly to ARKODE). For further information on the ARKODE solver parameters and the meaning of individual values, see the ARKODE documentation.
To specify an input file to the executable, the input filename should be
provided using the -f
flag e.g.,
<executable> -f <input_file>
Additionally, any input parameters may also be specified on the command line e.g.,
<executable> --nx=100 --ny=100 --nz=400
For example, continuing with the Summit case from above, the primordial blast test can be run on one Summit node using four cores and four GPUs with the following commands:
cd ${MEMBERWORK}/[projid]/sundials-demo/tests/primordial_blast
bsub -q debug -nnodes 1 -W 0:10 -P [projid] -Is $SHELL
jsrun -n4 -a1 -c1 -g1 ../../bin/primordial_blast_mr.exe -f input_primordial_blast_mr_gpu.txt
The bsub
command above will submit a request for an interactive job to the
debug queue allocating one node for 10 minutes with the compute time charged to
[projid]
. Once the interactive session starts the test case is launched using
the jsrun
command. Solutions are output to disk using parallel HDF5, solution
statistics are optionally output to the screen at specified frequencies, and run
statistics are printed at the end of the simulation.
The parallel HDF5 solution snapshots are written at the frequency specified by
nout
. Accompanying these output-#######.hdf5
files is an automatically
generated input file, restart_parameters.txt
that stores a complete set of
input parameters to restart the simulation from the most recently generated
output file. This is a "warm" restart, in that it will pick up the calculation
where the previous one left off, using the same initial time step size as
ARKStep would use. This restart may differ slightly from an uninterrupted run
since other internal ARKStep time adaptivity parameters cannot be reused. We
note that the restart must use the same spatial grid size and number of chemical
tracers as the original run, but it may use a different number of MPI tasks if
desired.
Individual test problems are uniquely specified through an input file and
auxiliary source code file(s) that should be linked with the main routine at
compile time. By default, all codes are built with no chemical species; however,
this may be controlled at compilation time using the NVAR
preprocessor
directive, corresponding to the number of unknowns at any spatial location.
Hence, the (default) minimum value for NVAR
is 5, so for a calculation with 4
chemical species the code should be compiled with the preprocessor directive
NVAR=9
. See src/CMakeLists.txt for examples of how to
specify NVAR
when adding a new test/executable.
The auxiliary source code files for creating a new test must contain three
functions. Each of these must return an integer flag indicating success (0) or
failure (nonzero). The initial condition function
int initial_conditions(const realtype& t, N_Vector w, const UserData& udata);
and the forcing function
int external_forces(const realtype& t, N_Vector G, const UserData& udata);
Additionally, a function must be supplied to compute/output any desired solution diagnostic information with the signature
int output_diagnostics(const realtype& t, const N_Vector w, const UserData& udata);
If no diagnostics information is desired, then this routine may just return 0.
Here, the initial_conditions
routine will be called once when the simulation
begins, external_forces
will be called on every evaluation of the ODE
right-hand side function for the Euler equations (it is assumed that this does
not require the results from (UserData::ExchangeStart
/ UserData::ExchangeEnd
), and output_diagnostics
will be called at the same
frequency as the solution is output to disk.
To add a new executable using these auxiliary source code file(s), update
src/CMakeLists.txt to include a new call to
sundemo_add_executable
in a similar manner as the existing test problems e.g.,
hurricane_yz.exe
.