Skip to content

HISQ mixed precision deflation

maddyscientist edited this page Dec 7, 2019 · 19 revisions

For this case study we are using a 48x48x48x12 configuration as provided by the HotQCD collaboration.

Parameter Value
Volume 48x48x48x12
Gauge action Improved Symanzik
beta 6.794
Fermion action HISQ fermions
light quark mass 0.00167
strange quark mass 0.0450

Here were are going to examine the quark mass dependence of the solve time as we scale from the light to strange quark masses and progressively optimize the solver with mixed precision and deflation. For this study we are using the staggered_invert_test example code that is included with QUDA and run on a workstation using 2x Quadro GV100 GPUs. All of these runs are done using a launch syntax of the form

export QUDA_RESOURCE_PATH.
ARGS="--dim 48 48 24 12 --gridsize 1 1 2 1 --load-gauge /scratch/mathias/l4812f21b6794m00167m0450c_130.quda --compute-fat-long true --test 1"

mpirun -np 2 tests/staggered_invert_test $ARGS $RECON $PREC $SOLVER $EIG --mass 0.00167 --verbosity verbose

where we will adjust the variables SOLVER, PREC, and RECON according to the solver parameters as desired.

Starting point - Pure Double CG solver

Our initial starting point is using a standard double precision CG solver. This uses the parameters:

SOLVER="--inv-type cg --tol 1e-10 --reliable-delta 0.001 --niter 20000"
PREC="--prec double"

which means we run a CG solver to a relative residual tolerance of 1e-10, reliably updating the true residual every time the iterated residual drops by 3 orders of magnitude, with a maximum iteration count of 20000 using double precision only.

With HISQ fermions we can also use compression on the long-link field to reduce the memory traffic. We do so with these parameters

RECON="--recon 13 --recon-sloppy 9"

where we only use the maximal reconstruct-9 compression on the sloppy updates to ensure stability. In doing so we that the iteration count is constant, and we improve the overall solve time by around 1.2x. From now on, we will assume that gauge compression is always used.

no-recon recon 13/9
Mass Iterations Time GFLOPS Iterations Time GFLOPS
0.00167 6178 20.9 468 6178 17.5 556
0.00334 4551 15.4 467 4551 12.9 557
0.00668 2621 8.89 467 2621 7.46 556
0.01336 1431 4.87 467 1431 4.09 556
0.02672 759 2.60 467 759 2.18 556
0.05344 400 1.39 467 400 1.16 556

Mixed-Precision CG

The first significant performance boost is to enable mixed-precision CG. To do so, we simple set the sloppy precision to a lower precision than the outer solver precision. Valid values are single, half and quarter, with the latter two formats being QUDA's custom block-fixed-point formats. E.g.,

SOLVER="--inv-type cg --tol 1e-10 --reliable-delta 0.1 --niter 10000"
PREC="--prec double --prec-sloppy single"

Note we also change the reliable-delta parameter such that the true residual is recomputed every time the residual drops by an order of magnitude. This is to minimize the divergence of the solver due to the reduced precision.

double-single double-half double-quarter
Mass Iterations Time GFLOPS Iterations Time GFLOPS Iterations Time GFLOPS
0.00167 6179 8.73 1120 7710 6.65 1830 14973 8.59 2760
0.00334 4552 6.43 1120 5118 4.40 1830 7786 4.51 2730
0.00668 2621 3.72 1120 2884 2.51 1820 3790 2.24 2690
0.01336 1431 2.04 1110 1456 1.29 1800 1869 1.13 2640
0.02672 759 1.01 1100 759 0.678 1790 900 0.570 2540
0.05344 400 0.594 1090 400 0.374 1730 487 0.325 2440

These results demonstrate that more GFLOPS doesn't mean reduced time to solution. In particular we see that

  • the double-single solver has almost identical iteration count to the pure double solver, and as a result has the expected ideal 2x speedup
  • while the double-half shows increased iteration count at small quark mass, overall it is stable and represents the sweet spot with a 2.6x speedup over the pure double solver at the light quark mass and 3.1x speedup at the heavy quark mass
  • the double-quarter solver is not numerically stable as we reduce the quark mass. At the light quark mass we more than double the iteration count for a solve time similar to the double-single solver, though at heavy quark masses we achieve a 3.6x speedup

Considering Deflation

We now consider adding deflation to accelerate the solver with the goal of removing the critical slowing down with quark mass of the solver. For this we shall use the thick-restarted Lanczos eigensolver implemented in QUDA, the usage of which requires some consideration with respect to the parameters to use.

  • The number of eigenvectors to deflate with. This will of course be problem dependent, a greater deflation space will result in a better conditioned system to solve with, but increases the setup time, memory costs and deflation overhead. There are actually two parameters exposed in QUDA: the number of eigenvectors we attempt to converge and the number of eigenvectors that are required to converge. The number of deflation modes we will use corresponds to the latter.
  • The size of the Krylov space the eigen-solver should construct before triggering a restart. Typically this is 1.2-2x larger than the number of the desired eigenvalues.
  • Polynomial acceleration parameters: for finding the low eigenvalues of an operator it is usually optima to use polynomial acceleration to isolate the part of the eigen-spectrum one is interested in.
  • The precision of the eigensolver. In principle we can run the eigensolver in any of the precisions possible in QUDA, however, as we shall see below, in practice it probably only makes sense to consider running in single or half precision.
  • How often to re-deflate the residual. In infinite precision, the deflation need only be applied once prior to the solver, however, in finite precision, and especially with mixed-precision, we will need to re-deflate the residual vector to ensure optimal convergence.

Before we consider the full mass sweep we initially focus on getting the eigenvectors for the light quark mass only. To enable deflation in QUDA's solver we use the following options

--prec-precondition # precision of the eigensolver and subsequent deflation"
--inv-deflate       # enable initial eigen-solver and deflation
--eig-poly-deg      # use polynomial acceleration of degree n
--eig-amax          # sets the upper bound for the polynomial acceleration
--eig-amin          # set the lower bound of the polynomial acceleration
--eig-nEv           # number of eigenvectors to find
--eig-nConv         # number of converged eigenvectors required
--eig-nKr           # size of Krylov search space
--eig-tol           # tolerance of the eigensolver
--df-tol-restart    # how often to re-deflate the residual

The part that can require some tuning and care is deciding how many eigenvectors to deflate and picking the polynomial acceleration parameters. The former will likely require some per problem tuning. For the latter we recommend point the reader here.

PREC="--prec double --prec-sloppy half --prec-precondition single"
EIG="--inv-deflate false --eig-amax 22 --eig-amin 0.01 --eig-poly-deg 80 --eig-nEv 256 --eig-nConv 256 --eig-nKr 512 --eig-tol 1e-6 --df-tol-restart 1e-1 "

These options will

Clone this wiki locally