You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am one of the CP2K developers and I am working on our quartically-scaling SOS-MP2 and RPA implementations. Marko Kabic used energy-only calculations with RPA to benchmark COSMA (test system: 128 water molecules).
I am currently implementing gradients for these methods. I know that my gradient implementation (available in the CP2K master trunk) requires roughly 3-4 times the memory of an energy-only calculation. I am testing the code on the GPU section of Daint. The code runs well with ScaLapack (libsci_acc). I can run my code with COSMA on a smaller system (up to 64 water molecules) and have a decent acceleration of PDGEMM calls compared to ScaLapack. Unfortunately, I cannot run larger systems (like 128 water molecules) even on 1000 nodes.
A gradient calculation consists of a set of two calls to PDGEMM with the following global sizes in the case of 128 H2O molecules:
n=m=17,408 and k=3,473,408 (also in case of energy-only calculations)
n=3,473,408 and m=k=17,408 (not required in case of energy-only calculations).
I observe both, out-of-memory events on the GPU and on the CPU depending on the setup when COSMA is called.
My questions are:
What are COSMA's memory requirements or at least what scaling behavior do I have to expect?
Is it possible for you to add a hint displaying the actual amount of missing memory in case of COSMA being able to catch the OOM event?
Is it possible to provide a function to ask COSMA to release its buffers to use the idle resources of COSMA for other operations?
EDIT:
I can run energy-only calculations with 128 water molecules (just PDGEMM step 1) on 64 nodes. I can run the calculations on 2048 Daint nodes. Nevertheless, the memory requirements are extremely high and it is very frustrating (and a waste of resources) to find a suitable amount of nodes for a given calculation.
EDIT2:
The calculation with COSMA on 2048 nodes requires 3 times the resources than with ScaLapack on 128 nodes.
The text was updated successfully, but these errors were encountered:
i am not a cosma developer but I can give an advice:
simply set export COSMA_CPU_MAX_MEMORY=XXX
to a value around 2-3 times the value you need to store the matrices. This should be enough to find a reasonable setting for COSMA and you should outperform ScaLAPACK (at least for the largeK case).
ScaLAPACK should roughly need twice the memory as it uses the SUMMA algorithm.
It does not help with the default settings. I could get it running by simply setting COSMA_ADAPT_STRATEGY=OFF. But I wonder why the default does not capture it properly.
I have the same issue with the GPU runs on NERSC Perlmutter. I am running the cosma matrix-multiply miniapp with m=n=k=25000. It fails with OOM errors on even 100 nodes. I built COSMA with the regular CUDA options (no NCCL or GPU-aware MPI).
Dear authors,
I am one of the CP2K developers and I am working on our quartically-scaling SOS-MP2 and RPA implementations. Marko Kabic used energy-only calculations with RPA to benchmark COSMA (test system: 128 water molecules).
I am currently implementing gradients for these methods. I know that my gradient implementation (available in the CP2K master trunk) requires roughly 3-4 times the memory of an energy-only calculation. I am testing the code on the GPU section of Daint. The code runs well with ScaLapack (libsci_acc). I can run my code with COSMA on a smaller system (up to 64 water molecules) and have a decent acceleration of PDGEMM calls compared to ScaLapack. Unfortunately, I cannot run larger systems (like 128 water molecules) even on 1000 nodes.
A gradient calculation consists of a set of two calls to PDGEMM with the following global sizes in the case of 128 H2O molecules:
I observe both, out-of-memory events on the GPU and on the CPU depending on the setup when COSMA is called.
My questions are:
EDIT:
I can run energy-only calculations with 128 water molecules (just PDGEMM step 1) on 64 nodes. I can run the calculations on 2048 Daint nodes. Nevertheless, the memory requirements are extremely high and it is very frustrating (and a waste of resources) to find a suitable amount of nodes for a given calculation.
EDIT2:
The calculation with COSMA on 2048 nodes requires 3 times the resources than with ScaLapack on 128 nodes.
The text was updated successfully, but these errors were encountered: