Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use the same stream as the OpenMM context when replaying a CUDA graph. #122

Merged
merged 2 commits into from
Sep 27, 2023

Conversation

RaulPPelaez
Copy link
Contributor

This issue torchmd/torchmd-net#220 unveiled a race condition in the current graph replay mechanism. By asking torch for a CUDA stream we risk a later OpenMM workload to run on a different one and thus access non-ready data.

In this PR I make sure that the CUDA graph replay uses the same stream as the OpenMM context, while ensuring that CUDA graph capture occurs in a non-default stream (a requirement for capture).

@raimis please confirm this fixes the original issue in your environment.

Use a non-default stream for CUDA graph capturing.
@RaulPPelaez
Copy link
Contributor Author

Sadly this was not enough and the error popped out again.
I had to also increase the warmup steps to at least 5:

            force.setProperty("CUDAGraphWarmupSteps", "5")

I have no idea why 5 is the magic number. Requiring more than 1 (the default) does not surprise me (for instance, DDP requires 11)

The error that is generated makes me think warmup is indeed the issue here, this warning goes away when the reproducer runs.

[W manager.cpp:335] Warning: FALLBACK path has been taken inside: runCudaFusionGroup. This is an indication that codegen Failed for some reason.                             
To debug try disable codegen fallback path via setting the env variable `export PYTORCH_NVFUSER_DISABLE=fallback`                                                            
 (function runCudaFusionGroup)   

I guess torch is sometimes choosing a non-graphable path during the first calls. It sounds like torch is waiting for a few calls to try and pass the model through nvfuser. By CUDA graphing early we break this.

Not sure how to go about this, perhaps increasing the default warmup steps to something like 10 would be sensible?
I feel like we really do not have control over things like torch calling nvfuser...

I could also try to rewrite it so that instead of TorchForce calling the torch model X times during initialization, it tracks the number of times it has been called and only captures after X calls. I am more inclined to leave it as is, since whatever torch does to warmup might only happen if there is no other code in between calls.

@raimis
Copy link
Contributor

raimis commented Sep 26, 2023

@RaulPPelaez good work! I'll test the fix.

@raimis
Copy link
Contributor

raimis commented Sep 26, 2023

@RaulPPelaez I can preliminarily confirm that the fix works. I'll run a complete simulation to verify definitely.

@raimis
Copy link
Contributor

raimis commented Sep 26, 2023

I could also try to rewrite it so that instead of TorchForce calling the torch model X times during initialization, it tracks the number of times it has been called and only captures after X calls. I am more inclined to leave it as is, since whatever torch does to warmup might only happen if there is no other code in between calls.

Yes, I agree. You can leave it as it is.

Copy link
Contributor

@raimis raimis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Just be explicit in the docs.

README.md Show resolved Hide resolved
Copy link
Member

@peastman peastman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

@raimis raimis merged commit a8f467b into openmm:master Sep 27, 2023
4 checks passed
@raimis raimis mentioned this pull request Oct 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants