Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restart from checkpoint fails when some weights have been equilibrated, but not all #68

Closed
ajfriedman22 opened this issue Oct 7, 2024 · 0 comments · Fixed by #69
Closed
Labels
bug Something isn't working

Comments

@ajfriedman22
Copy link
Collaborator

ajfriedman22 commented Oct 7, 2024

Describe the bug
When you attempt to restart a MT-REXEE or EEXE simulation from a checkpoint file when some of the simulations have equilibrated and others have not an error occurs. The text of the error is below. I have discovered that the error occurs because the simulation which equilibrated before the checkpoint is counted as a fixed-weight simulation rather than as a variable-weight simulation which has already equilibrated.

** Error Text **

An error occurred on rank 0:
Traceback (most recent call last):
File "/projects/anfr8476/code/ensemble_md/ensemble_md/cli/run_REXEE.py", line 246, in main
_ = REXEE.combine_weights(weights, print_values=False)[1] # just to print the combiend weights # noqa: E501
File "/projects/anfr8476/code/ensemble_md/ensemble_md/replica_exchange_EE.py", line 1353, in combine_weights
weights_modified[i] = self.equilibrated_weights[i]
ValueError: could not broadcast input array from shape (0,) into shape (9,)


To Reproduce

  1. Start a variable weight simulation
  2. When at least one simulation has equilibrated but not all of them stop the simulation (A checkpoint file must be saved after at least one simulation is equilibrated)
  3. Restart the simulation from the checkpoint

Proposed Fix
We can add an additional checkpoint .npy file which will save the equilibration times if individual simulations have already equilibrated and then load these in when we restart from checkpoint.

@ajfriedman22 ajfriedman22 added the bug Something isn't working label Oct 7, 2024
@ajfriedman22 ajfriedman22 linked a pull request Oct 17, 2024 that will close this issue
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant