Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory Problem #91

Open
philomat opened this issue May 18, 2021 · 1 comment
Open

Memory Problem #91

philomat opened this issue May 18, 2021 · 1 comment

Comments

@philomat
Copy link

I was just trying to run a couple of test on Summit. Unfortunately, all of my tests ran out of memory and I do not understand why this is happening.

I am trying to run a simple measurement of a pion correlator on a 64^4 lattice with physical pion mass using the Wilson Clover Dirac operator. I am using a 2-lvl multigrid solver, that I basically took from tests/algorithms/multigrid.py, the only difference is that I use a g.mspincolor(grid) as rhs.

Here is some output from some g.mem_report()'s that I put in my code at several places:

After loading the gauge field:

GPT : 131.608838 s : ==================================================================================================================================== GPT : 131.608853 s : GPT Memory Report GPT : 131.608865 s : ==================================================================================================================================== GPT : 131.608876 s : Index Grid Precision OType CBType Size/GB Created at time GPT : 131.608912 s : 0 [64, 64, 64, 64] double ot_matrix_su_n_fundamental_group(3) full 2.25 122.323134 s GPT : 131.608932 s : 1 [64, 64, 64, 64] double ot_matrix_su_n_fundamental_group(3) full 2.25 122.784528 s GPT : 131.608951 s : 2 [64, 64, 64, 64] double ot_matrix_su_n_fundamental_group(3) full 2.25 123.230915 s GPT : 131.608968 s : 3 [64, 64, 64, 64] double ot_matrix_su_n_fundamental_group(3) full 2.25 123.696448 s GPT : 131.608979 s : ==================================================================================================================================== GPT : 131.608989 s : Lattice fields on all ranks 9 GB GPT : 131.608999 s : Lattice fields per rank 0.140625 GB GPT : 131.609009 s : Resident memory per rank 3.50134 GB GPT : 131.609019 s : Total memory available (host) 504.182 GB GPT : 131.609030 s : Total memory available (accelerator) 9.61707 GB GPT : 131.609038 s : ====================================================================================================================================

After setting the point src:

GPT : 131.801943 s : ==================================================================================================================================== GPT : 131.801998 s : GPT Memory Report GPT : 131.802015 s : ==================================================================================================================================== GPT : 131.802032 s : Index Grid Precision OType CBType Size/GB Created at time GPT : 131.802074 s : 0 [64, 64, 64, 64] double ot_matrix_su_n_fundamental_group(3) full 2.25 122.323134 s GPT : 131.802102 s : 1 [64, 64, 64, 64] double ot_matrix_su_n_fundamental_group(3) full 2.25 122.784528 s GPT : 131.802129 s : 2 [64, 64, 64, 64] double ot_matrix_su_n_fundamental_group(3) full 2.25 123.230915 s GPT : 131.802156 s : 3 [64, 64, 64, 64] double ot_matrix_su_n_fundamental_group(3) full 2.25 123.696448 s GPT : 131.802183 s : 4 [64, 64, 64, 64] double ot_matrix_spin_color(4,3) full 36 131.609090 s GPT : 131.802198 s : ==================================================================================================================================== GPT : 131.802214 s : Lattice fields on all ranks 45 GB GPT : 131.802230 s : Lattice fields per rank 0.703125 GB GPT : 131.802246 s : Resident memory per rank 3.5094 GB GPT : 131.802261 s : Total memory available (host) 503.6 GB GPT : 131.802279 s : Total memory available (accelerator) 9.61707 GB GPT : 131.802292 s : ====================================================================================================================================

After the multigrid setup, here I only state the report without details but there are an additional 30 instances of ot_vector_spin_color(4,3)'s each coming with a size of 3GB:

GPT : 345.357736 s : ==================================================================================================================================== GPT : 345.357749 s : Lattice fields on all ranks 135 GB GPT : 345.357763 s : Lattice fields per rank 2.10938 GB GPT : 345.357776 s : Resident memory per rank 4.46045 GB GPT : 345.357788 s : Total memory available (host) 496.169 GB GPT : 345.357802 s : Total memory available (accelerator) 9.21082 GB GPT : 345.357813 s : ====================================================================================================================================

So up until here everything seems fine. The only confusing thing to me is the "Total memory available (accelerator)" I am running on 64 GPUs and I would expect this to be bigger, but maybe this is just the memory available on a single GPU. Also I do not understand what "Resident memory per rank" means.

Now, I define my propagator and start the inversions by calling:

dst = g.eval(fgmres_outer(w) * src)

And I immediately run out of memory: cudaMalloc failed for 603979776 out of memory

Obviously, I the inverter needs additional memory not only for the propagator but also all the helper fields used internally, but why more than 800GB?

My initially guess is that the inverter tries to run on a single GPU and therefore runs out of memory. But how do I tell it to distribute the work over all cards?

I tried running the program with:

jsrun --smpiargs="-gpu" -n 64 -c 6 -a 1 -g 1 python pion_2lvl_mg.py --mpi 2.2.4.4

@jfy4
Copy link

jfy4 commented Feb 1, 2023

I'm not sure if it's the same thing, but in my simulations the memory usage also gradually increases over time. I'm not sure if it's me or gpt, but eventually the program bails because it uses all the memory on my computer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants