-
Notifications
You must be signed in to change notification settings - Fork 100
HISQ mixed precision deflation
For this case study we are using a 48x48x48x12 configuration as provided by the HotQCD collaboration.
Parameter | Value |
---|---|
Volume | 48x48x48x12 |
Gauge action | Improved Symanzik |
beta | 6.794 |
Fermion action | HISQ fermions |
light quark mass | 0.00167 |
strange quark mass | 0.0450 |
Here were are going to examine the quark mass dependence of the solve time as we scale from the light to strange quark masses and progressively optimize the solver with mixed precision and deflation. For this study we are using the staggered_invert_test
example code that is included with QUDA and run on a workstation using 2x Quadro GV100 GPUs. All of these runs are done using a launch syntax of the form
export QUDA_RESOURCE_PATH.
ARGS="--dim 48 48 24 12 --gridsize 1 1 2 1 --load-gauge /scratch/mathias/l4812f21b6794m00167m0450c_130.quda --compute-fat-long true --test 1"
mpirun -np 2 tests/staggered_invert_test $ARGS $RECON $PREC $SOLVER $EIG --mass 0.00167 --verbosity verbose
where we will adjust the variables SOLVER
, PREC
, and RECON
according to the solver parameters as desired.
Our initial starting point is using a standard double precision CG solver. This uses the parameters:
SOLVER="--inv-type cg --tol 1e-10 --reliable-delta 0.001 --niter 20000"
PREC="--prec double"
which means we run a CG solver to a relative residual tolerance of 1e-10, reliably updating the true residual every time the iterated residual drops by 3 orders of magnitude, with a maximum iteration count of 20000 using double precision only.
With HISQ fermions we can also use compression on the long-link field to reduce the memory traffic. We do so with these parameters
RECON="--recon 13 --recon-sloppy 9"
where we only use the maximal reconstruct-9 compression on the sloppy updates to ensure stability. In doing so we that the iteration count is constant, and we improve the overall solve time by around 1.2x. From now on, we will assume that gauge compression is always used.
no-recon | recon 13/9 | |||||
---|---|---|---|---|---|---|
Mass | Iterations | Time | GFLOPS | Iterations | Time | GFLOPS |
0.00167 | 6178 | 20.9 | 468 | 6178 | 17.5 | 556 |
0.00334 | 4551 | 15.4 | 467 | 4551 | 12.9 | 557 |
0.00668 | 2621 | 8.89 | 467 | 2621 | 7.46 | 556 |
0.01336 | 1431 | 4.87 | 467 | 1431 | 4.09 | 556 |
0.02672 | 759 | 2.60 | 467 | 759 | 2.18 | 556 |
0.05344 | 400 | 1.39 | 467 | 400 | 1.16 | 556 |
The first significant performance boost is to enable mixed-precision CG. To do so, we simple set the sloppy precision to a lower precision than the outer solver precision. Valid values are single
, half
and quarter
, with the latter two formats being QUDA's custom block-fixed-point formats. E.g.,
SOLVER="--inv-type cg --tol 1e-10 --reliable-delta 0.1 --niter 10000"
PREC="--prec double --prec-sloppy single"
Note we also change the reliable-delta parameter such that the true residual is recomputed every time the residual drops by an order of magnitude. This is to minimize the divergence of the solver due to the reduced precision.
double-single | double-half | double-quarter | |||||||
---|---|---|---|---|---|---|---|---|---|
Mass | Iterations | Time | GFLOPS | Iterations | Time | GFLOPS | Iterations | Time | GFLOPS |
0.00167 | 6179 | 8.73 | 1120 | 7710 | 6.65 | 1830 | 14973 | 8.59 | 2760 |
0.00334 | 4552 | 6.43 | 1120 | 5118 | 4.40 | 1830 | 7786 | 4.51 | 2730 |
0.00668 | 2621 | 3.72 | 1120 | 2884 | 2.51 | 1820 | 3790 | 2.24 | 2690 |
0.01336 | 1431 | 2.04 | 1110 | 1456 | 1.29 | 1800 | 1869 | 1.13 | 2640 |
0.02672 | 759 | 1.01 | 1100 | 759 | 0.678 | 1790 | 900 | 0.570 | 2540 |
0.05344 | 400 | 0.594 | 1090 | 400 | 0.374 | 1730 | 487 | 0.325 | 2440 |
These results demonstrate that more GFLOPS doesn't mean reduced time to solution. In particular we see that
- the double-single solver has almost identical iteration count to the pure double solver, and as a result has the expected ideal 2x speedup
- while the double-half shows increased iteration count at small quark mass, overall it is stable and represents the sweet spot with a 2.6x speedup over the pure double solver at the light quark mass and 3.1x speedup at the heavy quark mass
- the double-quarter solver is not numerically stable as we reduce the quark mass. At the light quark mass we more than double the iteration count for a solve time similar to the double-single solver, though at heavy quark masses we achieve a 4.3x speedup