-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[flang][OpenMP] Privatize locally destroyed values in do concurent
#112
[flang][OpenMP] Privatize locally destroyed values in do concurent
#112
Conversation
d37894a
to
2485e85
Compare
Locally destroyed values are those values for which the Fortran runtime calls `@_FortranADestroy` inside the loops body. If these values are allocated outside the loop, and the loop is mapped to OpenMP, then a runtime error would occur due to multiple teams trying to access the same allocation. In such cases, a local privatized value is created in the OpenMP region to prevent multiple teams of treads from accessing and destroying the same memory block which causes runtime issues.
2485e85
to
27c02ed
Compare
Ping 🔔! Please take a look and let me know if you have any concerns or comments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this looks good, but I'm not a real expert here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks to me that the issue here is more fundamental than just being about temporaries deleted in a loop. This is a race condition which will happen every time the same shared variable is read from and written to from within a do concurrent loop, and it's a result of running in parallel iterations of a loop which is only guaranteed to be correct if each iteration runs as an atomic unit, even if they can run in any order.
I think the proper solution should involve checking that a given loop can be parallelized before trying to do so. However, we're currently giving compiler users the responsibility of checking their do-concurrent loops for such potential race conditions and blindly transforming loops if they tell us to do so.
So, assuming the original loop is actually something that the user verified it can run in parallel (otherwise they can't expect a correct result given the current restrictions), the only way we can run into these race conditions is through our own transformations. My guess is that the problem here is that we're inserting allocas for temporaries during lowering, before the omp.parallel
operation is created by this pass, which would otherwise be the target location for these.
I think that in this pass we should identify every alloca that is only used within the loop being transformed and then sink it into the omp.parallel
operation once it's created (both for host and device), regardless of the presence of "destroy" function calls. Something similar was implemented already by getSinkableAllocas()
in OpenMPToLLVMIRTranslation.cpp, so maybe that helps as a starting point (it's intended for a different case, so I don't think you can just copy-paste it and use it here).
This is, frankly, a bit of a philosophical question. There are basically two camps (I'm camp A):
The current way we are dealing with this, that we will have different levels of parallelization: Since we are focusing on getting thing to work as fast we can, we are ignoring |
Thanks @skatrak and @mjklemm for the discussion. In addition to what Michael said. Somethig is worth pointing out:
This is not entirely true. For example take the following snippet:
where If you convert this loop to its omp equivalent:
the allocation for the temp
which means no alloca sinking is taking place for such case: flang knows from the getgo that the temp aollocation should happen within the boundaries of the parallel region. With But you raise a valid concern, also I did not know about the alloca sinking during LLVM translation. What I will do now is to try to understand how flang is smart enough to emit the alloca for the temp inside the parallel region. Maybe we can learn something. My first guess was that maybe flang takes into account isolated-from-above regions but that is not the case of |
Perhaps I didn't explain myself very well initially. What I tried to get to is that there are two ways in which we will find these kinds of issues:
Flang decides where to place temporary allocations during lowering based on the MLIR |
My first guess was almost correct 😛. FIR builder find the proper alloc block when it creates a temporary. and |
Yes, that's exactly the statement I was making :-) |
That makes sense. I will change the logic in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the changes Kareem, it's looking good. I just have some smaller comments at this point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks! I just have minimal nits at this point, no need for another review by me after addressing them.
…vice Extends ROCm#112. This PR extends support for `do concurrent` mapping to the device a bit more. In particular, it handles localization of loop-local values on the deive. Previously, this was only supported and tested on the host. See docs for `looputils::collectLoopLocalValues` for the definition of "loop-local" values.
…vice Extends ROCm#112. This PR extends support for `do concurrent` mapping to the device a bit more. In particular, it handles localization of loop-local values on the deive. Previously, this was only supported and tested on the host. See docs for `looputils::collectLoopLocalValues` for the definition of "loop-local" values.
…vice Extends ROCm#112. This PR extends support for `do concurrent` mapping to the device a bit more. In particular, it handles localization of loop-local values on the deive. Previously, this was only supported and tested on the host. See docs for `looputils::collectLoopLocalValues` for the definition of "loop-local" values.
…vice (#146) Extends #112. This PR extends support for `do concurrent` mapping to the device a bit more. In particular, it handles localization of loop-local values on the deive. Previously, this was only supported and tested on the host. See docs for `looputils::collectLoopLocalValues` for the definition of "loop-local" values.
Locally destroyed values are those values for which the Fortran runtime calls
@_FortranADestroy
inside the loops body. If these values are allocated outside the loop, and the loop is mapped to OpenMP, then a runtime error would occur due to multiple teams trying to access the same allocation. In such cases, a local privatized value is created in the OpenMP region to prevent multiple teams of treads from accessing and destroying the same memory block which causes runtime issues.This one of the issues uncovered by LBL's inference engine.