-
Notifications
You must be signed in to change notification settings - Fork 127
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Can't Allocate Memory Issue #604
Comments
Dear @bartdemooij , thanks for reporting. It seems your system is too large to fit into your GPU memory. At this point, we implemented only direct algorithms that would be memory-bound depending on matrix sizes. Therefore you have two choices: i) reduce system size, ii) get a GPU with more memory. We are working to make simulations plugins into LAMMPS and AMBER, with domain decomposition you would be able to run much larger systems in a distributed fashion. |
Dear @isayev, thanks for the swift reply. So we ran this system on CPU with 64gb ram. Would you say it is to be expected that this all get used up by a 1000 ethanol molecules (9000 atoms)? Perhaps this is a trivial question, but in what way does memory usage scale with system size? |
Since your code invokes a CUDA memory error, I would assume you need to check your run script and check its correctness. It seems to be still running on a GPU. Typical suspects are CUDA_VISIBLE_DEVICES variable and torch.device definition in your code. |
Bart: Current memory scale is O(N^2) since TorchANI code calculates NxN
distance matrix to find neighbors. In the case of PBC, the code builds
extra images (in your case of cubit cell, it would be 18 cells) to find all
neighbors. This is the stage when you run out of memory.
…On Fri, Dec 10, 2021 at 11:47 AM Olexandr Isayev ***@***.***> wrote:
Since your code invokes a CUDA memory error, I would assume you need to
check your run script and check its correctness. It seems to be still
running on a GPU. Typical suspects are CUDA_VISIBLE_DEVICES variable and
torch.device definition in your code.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#604 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AA7FDQ2EXJCJPDLQ5HVYQVDUQIVKVANCNFSM5JQ7332Q>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
Thank you, the memory error is clear. @zubatyuk The only thing I don't understand is how you get 18 cells for PBC in three dimensions as I thought it would be 26. Am I right that you get 18 by |
Sorry, it was clearly my mistake. 3x3x3 is 27 indeed.
…On Fri, Dec 17, 2021 at 4:44 AM bartdemooij ***@***.***> wrote:
Thank you, the memory error is clear.
@zubatyuk <https://github.com/zubatyuk> The only thing I don't understand
is how you get 18 cells for PBC in three dimensions as I thought it would
be 26. Am I right that you get 18 by
3x3x3 - 1(original) - 8(the corner boxes) = 18.
If this is the case, why are you allowed to omit the corner boxes? If not,
how do you get 18 images?
—
Reply to this email directly, view it on GitHub
<#604 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AA7FDQ4GZMA5IHV6YVPKU4TURMA7JANCNFSM5JQ7332Q>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
We are currently preparing to release a new TorchANI version which addresses this issue since it supports a built-in cell list that makes the scaling O(N), in the next month or two it should be ready and you will be able to run much larger systems with no issue. |
Dear,
What would be the best way to perform high-performance molecular dynamics with ANI on a cluster? We run torchANI in combination with ASE. Currently, when running a box of 1000 ethanol molecules gives the following error when performing the BFGS optimisation:
warnings.warn( Traceback (most recent call last): File "/home/bmooij/ANI_quality_check/MD_ethanol_quality_check_ANI.py", line 49, in <module> opt.run(fmax=1.0) File "/home/bmooij/.conda/envs/py9/lib/python3.9/site-packages/ase/optimize/optimize.py", line 269, in run return Dynamics.run(self) File "/home/bmooij/.conda/envs/py9/lib/python3.9/site-packages/ase/optimize/optimize.py", line 156, in run for converged in Dynamics.irun(self): File "/home/bmooij/.conda/envs/py9/lib/python3.9/site-packages/ase/optimize/optimize.py", line 122, in irun self.atoms.get_forces() File "/home/bmooij/.conda/envs/py9/lib/python3.9/site-packages/ase/atoms.py", line 788, in get_forces forces = self._calc.get_forces(self) File "/home/bmooij/.conda/envs/py9/lib/python3.9/site-packages/ase/calculators/abc.py", line 23, in get_forces return self.get_property('forces', atoms) File "/home/bmooij/.conda/envs/py9/lib/python3.9/site-packages/ase/calculators/calculator.py", line 737, in get_property self.calculate(atoms, [name], system_changes) File "/home/bmooij/.conda/envs/py9/lib/python3.9/site-packages/torchani/ase.py", line 82, in calculate energy = self.model((species, coordinates), cell=cell, pbc=pbc).energies File "/home/bmooij/.conda/envs/py9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl return forward_call(*input, **kwargs) File "/home/bmooij/.conda/envs/py9/lib/python3.9/site-packages/torchani/models.py", line 106, in forward species_aevs = self.aev_computer(species_coordinates, cell=cell, pbc=pbc) File "/home/bmooij/.conda/envs/py9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl return forward_call(*input, **kwargs) File "/home/bmooij/.conda/envs/py9/lib/python3.9/site-packages/torchani/aev.py", line 533, in forward aev = compute_aev(species, coordinates, self.triu_index, self.constants(), self.sizes, (cell, shifts)) File "/home/bmooij/.conda/envs/py9/lib/python3.9/site-packages/torchani/aev.py", line 288, in compute_aev atom_index12, shifts = neighbor_pairs(species == -1, coordinates_, cell, shifts, Rcr) File "/home/bmooij/.conda/envs/py9/lib/python3.9/site-packages/torchani/aev.py", line 171, in neighbor_pairs shifts_all = torch.cat([shifts_center, shifts_outside]) RuntimeError: [enforce fail at CPUAllocator.cpp:71] . DefaultCPUAllocator: can't allocate memory: you tried to allocate 26243892000 bytes. Error code 12 (Cannot allocate memory)
It works fine if the box has few molecules in it (i.e. 125 ethanol), but starts to give this error for larger systems (i.e. 750 or 1000 ethanol). A system of 500 ethanol also seems to work, but is terribly slow.
Some reproducible code (where file 'ethanol_1000.pdb' is a box of 1000 ethanol made with packmol):
Best regards,
Bart
The text was updated successfully, but these errors were encountered: