[flang][OpenMP] Privatize locally destroyed values in `do concurent` #112

ergawy · 2024-07-09T05:47:53Z

Locally destroyed values are those values for which the Fortran runtime calls @_FortranADestroy inside the loops body. If these values are allocated outside the loop, and the loop is mapped to OpenMP, then a runtime error would occur due to multiple teams trying to access the same allocation. In such cases, a local privatized value is created in the OpenMP region to prevent multiple teams of treads from accessing and destroying the same memory block which causes runtime issues.

This one of the issues uncovered by LBL's inference engine.

Locally destroyed values are those values for which the Fortran runtime calls `@_FortranADestroy` inside the loops body. If these values are allocated outside the loop, and the loop is mapped to OpenMP, then a runtime error would occur due to multiple teams trying to access the same allocation. In such cases, a local privatized value is created in the OpenMP region to prevent multiple teams of treads from accessing and destroying the same memory block which causes runtime issues.

flang/docs/DoConcurrentConversionToOpenMP.md

ergawy · 2024-07-10T12:47:25Z

Ping 🔔! Please take a look and let me know if you have any concerns or comments.

mjklemm

I think this looks good, but I'm not a real expert here.

skatrak

It looks to me that the issue here is more fundamental than just being about temporaries deleted in a loop. This is a race condition which will happen every time the same shared variable is read from and written to from within a do concurrent loop, and it's a result of running in parallel iterations of a loop which is only guaranteed to be correct if each iteration runs as an atomic unit, even if they can run in any order.

I think the proper solution should involve checking that a given loop can be parallelized before trying to do so. However, we're currently giving compiler users the responsibility of checking their do-concurrent loops for such potential race conditions and blindly transforming loops if they tell us to do so.

So, assuming the original loop is actually something that the user verified it can run in parallel (otherwise they can't expect a correct result given the current restrictions), the only way we can run into these race conditions is through our own transformations. My guess is that the problem here is that we're inserting allocas for temporaries during lowering, before the omp.parallel operation is created by this pass, which would otherwise be the target location for these.

I think that in this pass we should identify every alloca that is only used within the loop being transformed and then sink it into the omp.parallel operation once it's created (both for host and device), regardless of the presence of "destroy" function calls. Something similar was implemented already by getSinkableAllocas() in OpenMPToLLVMIRTranslation.cpp, so maybe that helps as a starting point (it's intended for a different case, so I don't think you can just copy-paste it and use it here).

flang/docs/DoConcurrentConversionToOpenMP.md

mjklemm · 2024-07-11T11:44:17Z

I think the proper solution should involve checking that a given loop can be parallelized before trying to do so. However, we're currently giving compiler users the responsibility of checking their do-concurrent loops for such potential race conditions and blindly transforming loops if they tell us to do so.

This is, frankly, a bit of a philosophical question. There are basically two camps (I'm camp A):

Camp A: The programmer has written DO CONCURRENT so they assert that the code is race-free and satisfies the requirements stated in ISO Fortran.
Camp B: The programmer has written DO CONCURRENT and the compiler has to prove that it cannot only execute the loop in any order, but also in parallel.

The current way we are dealing with this, that we will have different levels of parallelization: none (serial execution, maybe default), auto (Camp B behavior), and parallel (for CPU and GPU) (Camp A behavior).

Since we are focusing on getting thing to work as fast we can, we are ignoring auto for the moment, but have it on our radar screen to make sure that this mode will eventually also be supported.,

ergawy · 2024-07-11T11:55:45Z

Thanks @skatrak and @mjklemm for the discussion.

In addition to what Michael said. Somethig is worth pointing out:

This is a race condition which will happen every time the same shared variable is read from and written to from within a do concurrent loop, and it's a result of running in parallel iterations of a loop which is only guaranteed to be correct if each iteration runs as an atomic unit, even if they can run in any order.

This is not entirely true. For example take the following snippet:

    do concurrent (i=1:10)
        a(i) = test_struct(i)
    end do

where test_struct is a user-defined type. Here we are creating a temporary object each loop iteration and assigning it to a(i). In other words, there is no sharing or race condtions going on here.

If you convert this loop to its omp equivalent:

    !$omp parallel do
    do i = 1,10
        a(i) = test_struct(i)
    end do
    !$omp end parallel do

the allocation for the temp test_struct object will happen inside the omp.parallel region even on the hlfir level:

    omp.parallel {
      %22 = fir.alloca !fir.type<_QMstruct_modTtest_struct{x_:!fir.box<!fir.heap<i32>>}> {bindc_name = ".result", pinned}
      ...
    }

which means no alloca sinking is taking place for such case: flang knows from the getgo that the temp aollocation should happen within the boundaries of the parallel region.

With do concurrent we don't have this demarcation of serial/parallel regions, therefore the need to detect code patterns that create temps and sinking these temps inside the region.

But you raise a valid concern, also I did not know about the alloca sinking during LLVM translation.

What I will do now is to try to understand how flang is smart enough to emit the alloca for the temp inside the parallel region. Maybe we can learn something. My first guess was that maybe flang takes into account isolated-from-above regions but that is not the case of omp parallel do.

skatrak · 2024-07-11T12:15:54Z

Perhaps I didn't explain myself very well initially. What I tried to get to is that there are two ways in which we will find these kinds of issues:

If parallelizing the do concurrent loop introduces data races. With our current approach, this is a user error and not something we need to fix in the compiler, but if we want to eventually enable some level of automatic do-concurrent transformations, we'll have to implement these checks.
If we introduce these data races ourselves during lowering. I think this is the case that this PR tries to solve, but I suggest the solution needs to be more general because the problem is not restricted to deallocations.

Flang decides where to place temporary allocations during lowering based on the MLIR OutlineableOpenMPOpInterface (which omp.parallel has), and if a parent with that trait is not present it will use the parent function instead. Since we're introducing the omp.parallel operation after lowering, these allocations are located in the function but they should be moved inside of omp.parallel if they're only used inside of the loop. And I think doing that should take care of this issue and it should hopefully prevent any data races from being introduced by the compiler in this pass.

ergawy · 2024-07-11T12:16:55Z

My first guess was that maybe flang takes into account isolated-from-above regions but that is not the case of omp parallel do.

My first guess was almost correct 😛. FIR builder find the proper alloc block when it creates a temporary. and getAllocaBlock uses the mlir::omp::OutlineableOpenMPOpInterface interface to help with finding that block.

mjklemm · 2024-07-11T12:22:26Z

If parallelizing the do concurrent loop introduces data races. With our current approach, this is a user error and not something we need to fix in the compiler, but if we want to eventually enable some level of automatic do-concurrent transformations, we'll have to implement these checks.

Yes, that's exactly the statement I was making :-)

ergawy · 2024-07-11T12:22:30Z

Since we're introducing the omp.parallel operation after lowering, these allocations are located in the function but they should be moved inside of omp.parallel if they're only used inside of the loop.

That makes sense. I will change the logic in collectLocallyDestroyedValuesInLoop (and its name) to detect values allocated outside the loop and used only inside of it then.

ergawy · 2024-07-11T14:23:22Z

Thanks for the review @mjklemm @skatrak . I think I handled the outstanding issue now.

skatrak

Thank you for the changes Kareem, it's looking good. I just have some smaller comments at this point.

flang/docs/DoConcurrentConversionToOpenMP.md

flang/lib/Optimizer/Transforms/DoConcurrentConversion.cpp

skatrak

LGTM, thanks! I just have minimal nits at this point, no need for another review by me after addressing them.

flang/lib/Optimizer/Transforms/DoConcurrentConversion.cpp

…vice Extends ROCm#112. This PR extends support for `do concurrent` mapping to the device a bit more. In particular, it handles localization of loop-local values on the deive. Previously, this was only supported and tested on the host. See docs for `looputils::collectLoopLocalValues` for the definition of "loop-local" values.

…vice (#146) Extends #112. This PR extends support for `do concurrent` mapping to the device a bit more. In particular, it handles localization of loop-local values on the deive. Previously, this was only supported and tested on the host. See docs for `looputils::collectLoopLocalValues` for the definition of "loop-local" values.

ergawy force-pushed the locally_alloc_loop_destroyed_values branch from d37894a to 2485e85 Compare July 9, 2024 05:49

ergawy requested review from skatrak, jsjodin, mjklemm, raghavendhra, agozillon, kparzysz, TIFitis, DominikAdamski and pbhandar-amd July 9, 2024 05:51

ergawy force-pushed the locally_alloc_loop_destroyed_values branch from 2485e85 to 27c02ed Compare July 9, 2024 07:53

mjklemm reviewed Jul 9, 2024

View reviewed changes

flang/docs/DoConcurrentConversionToOpenMP.md Outdated Show resolved Hide resolved

flang/docs/DoConcurrentConversionToOpenMP.md Outdated Show resolved Hide resolved

Fix docs typos

3ee647d

mjklemm approved these changes Jul 10, 2024

View reviewed changes

skatrak reviewed Jul 11, 2024

View reviewed changes

flang/docs/DoConcurrentConversionToOpenMP.md Outdated Show resolved Hide resolved

handle review comments

5f4b6fc

skatrak reviewed Jul 11, 2024

View reviewed changes

ergawy added 2 commits July 11, 2024 22:24

review comments

efa5efa

fix typo

78f1325

skatrak approved these changes Jul 12, 2024

View reviewed changes

more review comments

3a779eb

ergawy merged commit 24980a6 into ROCm:amd-trunk-dev Jul 14, 2024
3 of 5 checks passed

ergawy mentioned this pull request Aug 19, 2024

[flang][OpenMP] Privatize "loop-local" values in do concurent on device #146

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[flang][OpenMP] Privatize locally destroyed values in `do concurent` #112

[flang][OpenMP] Privatize locally destroyed values in `do concurent` #112

ergawy commented Jul 9, 2024 •

edited

Loading

ergawy commented Jul 10, 2024

mjklemm left a comment

skatrak left a comment

mjklemm commented Jul 11, 2024

ergawy commented Jul 11, 2024 •

edited

Loading

skatrak commented Jul 11, 2024

ergawy commented Jul 11, 2024

mjklemm commented Jul 11, 2024

ergawy commented Jul 11, 2024

ergawy commented Jul 11, 2024

skatrak left a comment

skatrak left a comment

[flang][OpenMP] Privatize locally destroyed values in do concurent #112

[flang][OpenMP] Privatize locally destroyed values in do concurent #112

Conversation

ergawy commented Jul 9, 2024 • edited Loading

ergawy commented Jul 10, 2024

mjklemm left a comment

Choose a reason for hiding this comment

skatrak left a comment

Choose a reason for hiding this comment

mjklemm commented Jul 11, 2024

ergawy commented Jul 11, 2024 • edited Loading

skatrak commented Jul 11, 2024

ergawy commented Jul 11, 2024

mjklemm commented Jul 11, 2024

ergawy commented Jul 11, 2024

ergawy commented Jul 11, 2024

skatrak left a comment

Choose a reason for hiding this comment

skatrak left a comment

Choose a reason for hiding this comment

[flang][OpenMP] Privatize locally destroyed values in `do concurent` #112

[flang][OpenMP] Privatize locally destroyed values in `do concurent` #112

ergawy commented Jul 9, 2024 •

edited

Loading

ergawy commented Jul 11, 2024 •

edited

Loading