Merge from main with loop wrappers + composite support + liboffload #71

skatrak · 2024-05-03T14:28:40Z

This merge brings upstream changes to the representation of OpenMP loop-related operations to the wrapper approach, as well as refactoring of the handling of combined and composite constructs during the PFT to MLIR lowering pass in Flang. The side effects of these changes that needed addressing by this merge are the following:

No other operations apart from a loop wrapper or loop nest and a terminator are allowed directly nested inside of a loop wrapper operation in MLIR. This impacts omp.wsloop, omp.simd, omp.distribute, omp.taskloop and omp.parallel (only if part of a composite construct in the case of omp.parallel). This means that Flang's early privatization had to be updated to create these allocations outside of all loop wrappers. The unfortunate result of this is that, at the end of the compilation flow, some unused copies of private variables end up being passed in as arguments to outlined functions. This is expected to be addressed by transitioning to the delayed privatization approach, which is still under development.
The omp.parallel operation can represent a regular block construct but it can also represent a leaf construct on a composite construct, taking a loop wrapper role. For its first use case, it needs to have the OutlineableOpenMPOpInterface MLIR trait, which is used by some Flang MLIR passes to search for a spot to insert allocations. However, when it's taking a loop wrapper role, allocations must be hoisted outside all wrappers. The solution proposed here is to update these passes to check for this situation, but it kind of defeats the purpose of the OutlineableOpenMPInterface. Long term, it might be preferable to create separate MLIR operations for block and wrapper versions of omp.parallel or dynamically attach which interfaces apply to each instance depending on its role, if possible.
The downstream implementation of PFT to MLIR lowering of loop constructs assumed that combined and composite constructs were the same, so it conflicted with recent upstream changes to process composite constructs separately. Resolving these conflicts basically meant implementing support for all composite constructs (except for TASKLOOP SIMD). Some refactoring of the places of creation and use of the DataSharingProcessor were necessary.
The DoConcurrentConversion pass needed to be slightly updated to create the loop wrapper and omp.loop_nest operations, as expected by the new loop representation approach.
There is some inconsistency as to how MLIR treats terminators of operations having the SingleBlockImplicitTerminator trait. Sometimes they would be skipped and other times they would not be, especially when multiple nested operations having it were nested one inside of the other. This made the LoopWrapperInterface::isWrapper method mistakenly return false whenever the terminator was skipped. The decision made to address this was to mark all wrappers with the SingleBlock trait instead, and to update tests to always explicitly specify the terminator operation. It's possible to update the LoopWrapperInterface::isWrapper method instead to consider the option of not finding a terminator, but I think it's best to have a single consistent representation rather than playing with "syntactic sugar" and complicating some functions as a result.
Updated several Flang unit tests as a result of some reordering of when privatization and clause processing happens, as well as in which order the clauses for each leaf of a composite construct are lowered to MLIR.
Due to the new representation of OpenMP loops in MLIR, the MLIR to LLVM IR translation pass had to be reworked. The processing of the loop itself (and 'COLLAPSE' information) was refactored and reused for 'SIMD' and 'DISTRIBUTE'. This was skipped for 'DO', due to being significantly more difficult to do without inadvertently breaking something else. 'DO SIMD' is still treated the same as 'DO' at this stage, though all of the information is now present in MLIR. The approach followed here can be improved quite a bit still, since the handling of omp.loop_nest is still replicated in convertOmpWsloop and convertLoopNestHelper, and composite constructs are currently handled in a rather unintuitive way.

After applying these changes, all check-mlir, check-flang, check-offload, check-llvm Lit tests are passing locally, and smoke-fort tests are passing as well. To be tested is whether there are any significant performance regressions as a result of passing unused variables to kernels, and checking whether kernel attributes (num teams, num threads, etc) are properly populated in all cases. I have noticed that now implicit map clauses are being introduced for variables that are also explicitly mapped, which doesn't seem to introduce any errors but we should also address.

I made a small typo when writing a test for MathExtras.h, sorry!

… tests (llvm#87094) This also adds a few tests that were missing.

…89625) This reverts commit 8b2ba6a. The uild errors (see below) were likely due to the same issue PR llvm#88074 fixed. Addressed by following that PR. https://lab.llvm.org/buildbot/#/builders/165/builds/52789 https://lab.llvm.org/buildbot/#/builders/91/builds/25273

When responding to review comments, `return {}` was accidentally replaced by `std::nullptr` instead of `return std::nullptr`.

…lvm#89263) This implements a RISCV specific version of the SHL_ADD node proposed in llvm#88791. If that lands, the infrastructure from this patch should seamlessly switch over the to generic DAG node. I'm posting this separately because I've run out of useful multiply strength reduction work to do without having a way to represent MUL X, 3/5/9 as a single instruction. The majority of this change is moving two sets of patterns out of tablgen and into the post-legalize combine. The major reason for this is that I have an upcoming change which needs to reuse the expansion logic, but it also helps common up some code between zba and the THeadBa variants. On the test changes, there's a couple major categories: * We chose a different lowering for mul x, 25. The new lowering involves one fewer register and the same critical path, so this seems like a win. * The order of the two multiplies changes in (3,5,9)*(3,5,9) in some cases. I don't believe this matters. * I'm removing the one use restriction on the multiply. This restriction doesn't really make sense to me, and the test changes appear positive.

This PR massively reorganizes the Test dialect's source files. It moves manually-written op hooks into `TestOpDefs.cpp`, moves format custom directive parsers and printers into `TestFormatUtils`, adds missing comment blocks, and moves around where generated source files are included for types, attributes, enums, etc. into their own source file. This will hopefully help navigate the test dialect source code, but also speeds up compile time of the test dialect by putting generated source files into separate compilation units. This also sets up the test dialect to shard its op definitions, done in the next PR.

Fixes -Werror build after 40137ff.

…lvm#88156) A recent patch added an error message for whole optional dummy argument usage as optional arguments (third or later) to MAX and MIN when those names required type conversion, since that conversion only works when the optional arguments are present. This check shouldn't care about character lengths. Make it so.

…lvm#88184) When a symbol is known to be a procedure due to its being referenced as a function or subroutine, improve the error messages that appear if the symbol is also used as an object by attaching the source location of its procedural use. Also, for errors spotted in name resolution due to how a given symbol has been used, don't unconditionally set the symbol's error flag (which is otherwise generally a good idea, to prevent cascades of errors), so that more unrelated errors related to usage will appear.

llvm#88188) …er powers The code that folds exponentiation by an integer power can report a spurious overflow warning because it calculates one last unnecessary square of the base value. 10.**(+/-32) exposes the problem -- the value of 10.**64 is calculated but not needed. Rearrange the implementation to only calculate squares that are necessary. Fixes llvm#88151.

Addresses llvm#85984 Signed-off-by: Troy-Butler <squintik@outlook.com> Co-authored-by: Troy-Butler <squintik@outlook.com>

…ilt (llvm#89164) Currently, `check-gwp_asan` is added no matter its dependencies are built or not, this is wrong and will cause cmake error when scudo is not built. This patch includes the target in the dependencies check.

Constant folding had a CHECK on array subscript rank that should more gracefully handle a bad program with a subscript that is a matrix or higher rank. Fixes llvm#88112.

When a statement function in a nested scope has a name that clashes with a name that exists in the host scope, the compiler can handle it correctly (with a portability warning)... unless the host scope acquired the name via USE association. Fix. Fixes llvm#88678.

This replaces the old macros LIBC_COPT_TEST_USE_FUCHSIA and LIBC_COPT_TEST_USE_PIGWEED with LIBC_COPT_TEST_ZXTEST and LIBC_COPT_TEST_GTEST, respectively. These are really not about whether the code is in the Fuchsia build or in the Pigweed build, but just about what test framework is being used. The gtest framework can be used in many contexts, and the zxtest framework is not always what's used in the Fuchsia build. The test/UnitTest/Test.h wrapper header now provides the macro LIBC_TEST_HAS_MATCHERS() for use in `#if` conditionals on use of gmock-style matchers, replacing `#if` conditionals that test the framework selection macros directly.

…vm#89429) When the characteristics of a procedure depend on a procedure that hasn't yet been defined, the compiler currently emits an unconditional error message. This includes the case of a procedure whose characteristics depend, perhaps indirectly, on itself. However, in the case where the characteristics of a procedure are needed to resolve a generic, we should not emit an error for a hitherto undefined procedure -- either the call will resolve to another specific procedure, in which case the error is spurious, or it won't, and then an error will issue anyway. Fixes llvm#88677.

Rewrite `LLVM_PARALLEL_{}_JOBS` and `LLVM_RAM_PER_{}_JOB` documentation.

The standard defines C_LOC as being PURE (actually SIMPLE now in F'2023); characterize it appropriately. Fixes llvm#88747.

The intrinsic function OUT_OF_RANGE() lacks support in lowering and the runtime. This patch obviates a need for any such support by implementing OUT_OF_RANGE() via rewriting in semantics. This rewriting of OUT_OF_RANGE() calls replaces the existing code that folds OUT_OF_RANGE() calls with constant arguments. Some changes and fixes were necessary outside of OUT_OF_RANGE()'s folding code (now rewriting code), whose testing exposed some other issues worth fixing. - The common::RealDetails<> template class was recoded in terms of a new base class with a constexpr constructor, so that the the characteristics of the various REAL kinds could be queried dynamically as well. This affected some client usage. - There were bugs in the code that folds TRANSFER() when the type of X or MOLD was REAL(10) -- this is a type that occupies 16 bytes per element in execution memory but only 10 bytes (was 12) in the data of std::vector<Scalar<>> in a Constant<>. - Folds of REAL->REAL conversions weren't preserving infinities.

Trying to address the build failure on the `clang-ve-ninja`bot, which appears hard to repro locally. The target isn't needed currently (there are unit tests exercising the new functionality). Removing it for now to green-ify the build bot.

It has been using isZeroValue(), which is for floats, not integers.

…Control. Updates ExecutionSession to use the ExecutorProcessControl object's TaskDispatcher rather than having a separate dispatch function. This gives the TaskDispatcher a global view of all tasks to be executed, and provides a single point to wait on for tasks to complete when shutting down the JIT.

A recent patch had three declared but unused variables in it, triggering a warning in some build bots. Remove them.

…rProcessControl." This reverts commit 6094b3b. Multiple bots are broken.

This patch introduces HWASan memaccess intrinsics that assume a fixed shadow (with the offset provided by --hwasan-mapping-offset=...), with and without short granule support. The behavior of HWASan is not meaningfully changed by this patch; future work ("Optimize outlined memaccess for fixed shadow on Aarch64": llvm#88544) will make HWASan use these intrinsics. We currently only support lowering the LLVM IR intrinsic to AArch64. The test case is adapted from hwasan-check-memaccess.ll.

…e tests to fail on multiple bots. (llvm#89689) Update the check lines added in llvm#87247 after 14e6f63 updated the output causing the tests to fail. This should hopefully unbreak the bots failing due to these two tests failing.

…ties (llvm#89119) Made the createReadOrMaskedRead and isValidMaskedInputVector utility functions - to be accessible outside of the CU. Needed by the IREE new TopK implementation.

to reflect that there are three variants.

…m#89211) This patch updates verifiers for `omp.ordered`, `omp.ordered.region`, `omp.cancel` and `omp.cancellation_point`, which check for a parent `omp.wsloop`. After transitioning to a loop wrapper-based approach, the expected direct parent will become `omp.loop_nest` instead, so verifiers need to take this into account. This PR on its own will not pass premerge tests. All patches in the stack are needed before it can be compiled and passes tests.

This patch makes changes to the `scf.parallel` to `omp.parallel` + `omp.wsloop` lowering pass in order to introduce a nested `omp.loop_nest` as well, and to follow the new loop wrapper role for `omp.wsloop`. This PR on its own will not pass premerge tests. All patches in the stack are needed before it can be compiled and passes tests.

…9214) This patch introduces minimal changes to the MLIR to LLVM IR translation of `omp.wsloop` to support the loop wrapper approach. There is `omp.loop_nest` related translation code that should be extracted and shared among all loop operations (e.g. `omp.simd`). This would possibly also help in the addition of support for compound constructs later on. This first approach is only intended to keep things running after the transition to loop wrappers and not to add support for other use cases enabled by that transition. This PR on its own will not pass premerge tests. All patches in the stack are needed before it can be compiled and passes tests.

This patch updates lowering from PFT to MLIR of workshare loops to follow the loop wrapper approach. Unit tests impacted by this change are also updated. As the last patch of the stack, this should compile and pass unit tests.

Semantics usually fold SHAPE into an array constructor, but sometimes it cannot (like when the source is a function result that cannot be duplicated in expression analysis). Add lowering handling for shape.

…lane.mask (llvm#89068) When SVE is available we can lower calls to get.active.lane.mask using the SVE whilelo instruction, however in practice since vXi1 types are not legal for NEON we often end up expanding the predicate into a vector of integers, e.g. v4i1 -> v4i32. This usually happens when we have to keep the predicate live out of the block, for example when the predicate is the incoming value to a PHI node in a tail-folded vector loop. Currently in such cases the intrinsic call has a cost of 1, which is far too low when considering the extra instructions required to expand the predicate. This patch fixes that by basing the cost on the number of lane moves required for expansion. This is required for a follow-on patch that adds the cost of the intrinsic call to the vectorisation cost model, so that we can teach the vectoriser to make better choices.

…nctionDefinition (llvm#89801) In the lambda function within clang::Sema::InstantiateFunctionDefinition, the return value of a function that may return null is now checked before dereferencing to avoid potential null pointer dereference issues which can lead to crashes or undefined behavior in the program.

On gfx11 shaders run with PRIV=1, which causes `s_trap 2` to be treated as a nop, which means it isn't a correct lowering for the trap intrinsic. As a workaround, this commit instead lowers the trap intrinsic to instructions that simulate the behavior of s_trap 2. Fixes: SWDEV-438421

…lvm#89485)

…th r2 and pre-R2 (llvm#89881) About unsigned max/min, ANDi is available for all ISA revisions in extend before slt insn. So that we can reduce one instruction.

…ructions (llvm#89867) Since the requirement is EEW=32, it's impossible that EGW=128 needs LMUL=8.

Moxinilian and others added 30 commits April 22, 2024 22:03

[nfc][llvm] Fix a typo in MathExtras.h testing (llvm#89653)

7c58546

I made a small typo when writing a test for MathExtras.h, sorry!

[libc++] Remove _LIBCPP_DISABLE_NODISCARD_EXTENSIONS and refactor the…

83bc7b5

… tests (llvm#87094) This also adds a few tests that were missing.

compiler-rt: fix few builtins build warnings. (llvm#88991)

8482dbd

[Frontend][OpenMP] Add missing "return" statement after 40137ff

b8ff08d

When responding to review comments, `return {}` was accidentally replaced by `std::nullptr` instead of `return std::nullptr`.

[Frontend][OpenMP] Add suggested brackets in array initialization

14e6f63

Fixes -Werror build after 40137ff.

[lldb][Core] Fix pointless if conditon (llvm#89650)

2987fca

Addresses llvm#85984 Signed-off-by: Troy-Butler <squintik@outlook.com> Co-authored-by: Troy-Butler <squintik@outlook.com>

[AMDGPU] simplify VOP3_Real definitions. NFC. (llvm#89656)

ce1b678

[flang] Fix crash on erroneous program (llvm#88192)

138524e

Constant folding had a CHECK on array subscript rank that should more gracefully handle a bad program with a subscript that is a matrix or higher rank. Fixes llvm#88112.

[docs] Rewrite cmake LLVM_RAM_PER_*_JOB description (llvm#88570)

2f77757

Rewrite `LLVM_PARALLEL_{}_JOBS` and `LLVM_RAM_PER_{}_JOB` documentation.

[flang] C_LOC is PURE (llvm#89437)

fde5e47

The standard defines C_LOC as being PURE (actually SIMPLE now in F'2023); characterize it appropriately. Fixes llvm#88747.

Temporarily remove clang_rt.ctx_profile target

579efe0

Trying to address the build failure on the `clang-ve-ninja`bot, which appears hard to repro locally. The target isn't needed currently (there are unit tests exercising the new functionality). Removing it for now to green-ify the build bot.

[GlobalISel] matchSDivByConst should use isNullValue() (llvm#89666)

5fef5e6

It has been using isZeroValue(), which is for floats, not integers.

[flang] Fix build warning (llvm#89686)

2e2ac6f

A recent patch had three declared but unused variables in it, triggering a warning in some build bots. Remove them.

Revert "[ORC] Unify task dispatch across ExecutionSession and Executo…

a28557a

…rProcessControl." This reverts commit 6094b3b. Multiple bots are broken.

[Bazel][libc] Fix build failure

28cea99

Make createReadOrMaskedRead and isValidMaskedInputVector vector utili…

30d4f6a

…ties (llvm#89119) Made the createReadOrMaskedRead and isValidMaskedInputVector utility functions - to be accessible outside of the CU. Needed by the IREE new TopK implementation.

zmodem and others added 17 commits April 24, 2024 15:25

[coro] Tweak comments about CoroAwaitSuspendInst

4c16b12

to reflect that there are three variants.

[flang] lower SHAPE intrinsic (llvm#89785)

3328ccf

Semantics usually fold SHAPE into an array constructor, but sometimes it cannot (like when the source is a function result that cannot be duplicated in expression analysis). Add lowering handling for shape.

[PGO] Test for inheritance relationships for instrprof intrinsics (l…

21ef187

…lvm#89485)

DAG: Fix widening of fptrunc_round vectors (llvm#89918)

50082d6

[Mips] Use ANDi in for zero-extend in subword atomic umax/umin for bo…

e1aa162

…th r2 and pre-R2 (llvm#89881) About unsigned max/min, ANDi is available for all ISA revisions in extend before slt insn. So that we can reduce one instruction.

[IR][test] Fix warning in MMRA tests

a682f52

[clang][RISCV] Remove LMUL=8 scalar input for some vector crypto inst…

418bdb4

…ructions (llvm#89867) Since the requirement is EEW=32, it's impossible that EGW=128 needs LMUL=8.

Merge branch 'main' into amd-trunk-dev

3c30af4

Post-merge fixes

b8211e9

Merge branch 'amd-trunk-dev' into atd/wrapper-merge

e967097

skatrak requested review from gregrodgers, ergawy, jsjodin, abidh, mjklemm, raghavendhra, agozillon, DominikAdamski, dpalermo, TIFitis and kparzysz May 3, 2024 14:28

skatrak merged commit e967097 into ROCm:amd-trunk-dev May 7, 2024
11 of 12 checks passed

skatrak deleted the atd/wrapper-merge branch May 7, 2024 14:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merge from main with loop wrappers + composite support + liboffload #71

Merge from main with loop wrappers + composite support + liboffload #71

skatrak commented May 3, 2024 •

edited

Loading

Merge from main with loop wrappers + composite support + liboffload #71

Merge from main with loop wrappers + composite support + liboffload #71

Conversation

skatrak commented May 3, 2024 • edited Loading

skatrak commented May 3, 2024 •

edited

Loading