Skip to content

HISQ mixed precision deflation

maddyscientist edited this page Dec 6, 2019 · 19 revisions

For this case study we are using a 48x48x48x12 configuration as provided by the HotQCD collaboration.

Parameter Value
Volume 48x48x48x12
Gauge action Improved Symanzik
beta 6.794
Fermion action HISQ fermions
light quark mass 0.00167
strange quark mass 0.0450

Here were are going to examine the quark mass dependence of the solve time as we scale from the light to strange quark masses and progressively optimize the solver with mixed precision and deflation. For this study we are using the staggered_invert_test example code that is included with QUDA and run on a workstation using 2x Quadro GV100 GPUs. All of these runs are done using a launch syntax of the form

export QUDA_RESOURCE_PATH.
ARGS="--dim 48 48 24 12 --gridsize 1 1 2 1 --load-gauge /scratch/mathias/l4812f21b6794m00167m0450c_130.quda --compute-fat-long true --test 1"

mpirun -np 2 tests/staggered_invert_test $ARGS $RECON $PREC $SOLVER $EIG --mass 0.00167 --verbosity verbose

where we will adjust the variables SOLVER, PREC, and RECON according to the solver parameters as desired.

Starting point - Pure Double CG solver

Our initial starting point is using a standard double precision CG solver. This uses the parameters:

SOLVER="--inv-type cg --tol 1e-10 --reliable-delta 0.001 --niter 10000"
PREC="--prec double"

which means we run a CG solver to a relative residual tolerance of 1e-10, reliably updating the true residual every time the iterated residual drops by 3 orders of magnitude, with a maximum iteration count of 10000 using double precision only.

Mass Iterations Time GFLOPS
0.00167 6178 20.8658 467.609
0.00334 4551 15.4005 466.992
0.00668 2621 8.88522 466.961
0.01336 1431 4.8658 467.117
0.02672 759 2.59715 467.213
0.05344 400 1.38505 467.437

With HISQ fermions we can also use compression on the long-link field to reduce the memory traffic. We do so with these parameters

RECON="--recon 13 --recon-sloppy 9"

where we only use the maximal reconstruct-9 compression on the sloppy updates to ensure stability. In doing so we that the iteration count is constant, and we improve the overall solve time by 1.19x.

Mass Iterations Time GFLOPS
0.00167 6178 17.5381 556.334
0.00334 4551 12.9228 556.527
0.00668 2621 7.45696 556.4
0.01336 1431 4.08501 556.4
0.02672 759 2.18148 556.24
0.05344 400 1.16466 555.893
Clone this wiki locally