Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge from main with loop wrappers + composite support + liboffload #71

Merged
merged 1,204 commits into from
May 7, 2024

Conversation

skatrak
Copy link

@skatrak skatrak commented May 3, 2024

This merge brings upstream changes to the representation of OpenMP loop-related operations to the wrapper approach, as well as refactoring of the handling of combined and composite constructs during the PFT to MLIR lowering pass in Flang. The side effects of these changes that needed addressing by this merge are the following:

  • No other operations apart from a loop wrapper or loop nest and a terminator are allowed directly nested inside of a loop wrapper operation in MLIR. This impacts omp.wsloop, omp.simd, omp.distribute, omp.taskloop and omp.parallel (only if part of a composite construct in the case of omp.parallel). This means that Flang's early privatization had to be updated to create these allocations outside of all loop wrappers. The unfortunate result of this is that, at the end of the compilation flow, some unused copies of private variables end up being passed in as arguments to outlined functions. This is expected to be addressed by transitioning to the delayed privatization approach, which is still under development.
  • The omp.parallel operation can represent a regular block construct but it can also represent a leaf construct on a composite construct, taking a loop wrapper role. For its first use case, it needs to have the OutlineableOpenMPOpInterface MLIR trait, which is used by some Flang MLIR passes to search for a spot to insert allocations. However, when it's taking a loop wrapper role, allocations must be hoisted outside all wrappers. The solution proposed here is to update these passes to check for this situation, but it kind of defeats the purpose of the OutlineableOpenMPInterface. Long term, it might be preferable to create separate MLIR operations for block and wrapper versions of omp.parallel or dynamically attach which interfaces apply to each instance depending on its role, if possible.
  • The downstream implementation of PFT to MLIR lowering of loop constructs assumed that combined and composite constructs were the same, so it conflicted with recent upstream changes to process composite constructs separately. Resolving these conflicts basically meant implementing support for all composite constructs (except for TASKLOOP SIMD). Some refactoring of the places of creation and use of the DataSharingProcessor were necessary.
  • The DoConcurrentConversion pass needed to be slightly updated to create the loop wrapper and omp.loop_nest operations, as expected by the new loop representation approach.
  • There is some inconsistency as to how MLIR treats terminators of operations having the SingleBlockImplicitTerminator trait. Sometimes they would be skipped and other times they would not be, especially when multiple nested operations having it were nested one inside of the other. This made the LoopWrapperInterface::isWrapper method mistakenly return false whenever the terminator was skipped. The decision made to address this was to mark all wrappers with the SingleBlock trait instead, and to update tests to always explicitly specify the terminator operation. It's possible to update the LoopWrapperInterface::isWrapper method instead to consider the option of not finding a terminator, but I think it's best to have a single consistent representation rather than playing with "syntactic sugar" and complicating some functions as a result.
  • Updated several Flang unit tests as a result of some reordering of when privatization and clause processing happens, as well as in which order the clauses for each leaf of a composite construct are lowered to MLIR.
  • Due to the new representation of OpenMP loops in MLIR, the MLIR to LLVM IR translation pass had to be reworked. The processing of the loop itself (and 'COLLAPSE' information) was refactored and reused for 'SIMD' and 'DISTRIBUTE'. This was skipped for 'DO', due to being significantly more difficult to do without inadvertently breaking something else. 'DO SIMD' is still treated the same as 'DO' at this stage, though all of the information is now present in MLIR. The approach followed here can be improved quite a bit still, since the handling of omp.loop_nest is still replicated in convertOmpWsloop and convertLoopNestHelper, and composite constructs are currently handled in a rather unintuitive way.

After applying these changes, all check-mlir, check-flang, check-offload, check-llvm Lit tests are passing locally, and smoke-fort tests are passing as well. To be tested is whether there are any significant performance regressions as a result of passing unused variables to kernels, and checking whether kernel attributes (num teams, num threads, etc) are properly populated in all cases. I have noticed that now implicit map clauses are being introduced for variables that are also explicitly mapped, which doesn't seem to introduce any errors but we should also address.

Moxinilian and others added 30 commits April 22, 2024 22:03
I made a small typo when writing a test for MathExtras.h, sorry!
… tests (llvm#87094)

This also adds a few tests that were missing.
When responding to review comments, `return {}` was accidentally replaced
by `std::nullptr` instead of `return std::nullptr`.
…lvm#89263)

This implements a RISCV specific version of the SHL_ADD node proposed in
llvm#88791.

If that lands, the infrastructure from this patch should seamlessly
switch over the to generic DAG node. I'm posting this separately because
I've run out of useful multiply strength reduction work to do without
having a way to represent MUL X, 3/5/9 as a single instruction.

The majority of this change is moving two sets of patterns out of
tablgen and into the post-legalize combine. The major reason for this is
that I have an upcoming change which needs to reuse the expansion logic,
but it also helps common up some code between zba and the THeadBa
variants.

On the test changes, there's a couple major categories:
* We chose a different lowering for mul x, 25. The new lowering involves
one fewer register and the same critical path, so this seems like a win.
* The order of the two multiplies changes in (3,5,9)*(3,5,9) in some
cases. I don't believe this matters.
* I'm removing the one use restriction on the multiply. This restriction
doesn't really make sense to me, and the test changes appear positive.
This PR massively reorganizes the Test dialect's source files. It moves
manually-written op hooks into `TestOpDefs.cpp`, moves format custom
directive parsers and printers into `TestFormatUtils`, adds missing
comment blocks, and moves around where generated source files are
included for types, attributes, enums, etc. into their own source file.

This will hopefully help navigate the test dialect source code, but also
speeds up compile time of the test dialect by putting generated source
files into separate compilation units.

This also sets up the test dialect to shard its op definitions, done in
the next PR.
…lvm#88156)

A recent patch added an error message for whole optional dummy argument
usage as optional arguments (third or later) to MAX and MIN when those
names required type conversion, since that conversion only works when
the optional arguments are present. This check shouldn't care about
character lengths. Make it so.
…lvm#88184)

When a symbol is known to be a procedure due to its being referenced as
a function or subroutine, improve the error messages that appear if the
symbol is also used as an object by attaching the source location of its
procedural use. Also, for errors spotted in name resolution due to how a
given symbol has been used, don't unconditionally set the symbol's error
flag (which is otherwise generally a good idea, to prevent cascades of
errors), so that more unrelated errors related to usage will appear.
llvm#88188)

…er powers

The code that folds exponentiation by an integer power can report a
spurious overflow warning because it calculates one last unnecessary
square of the base value. 10.**(+/-32) exposes the problem -- the value
of 10.**64 is calculated but not needed. Rearrange the implementation to
only calculate squares that are necessary.

Fixes llvm#88151.
Addresses llvm#85984

Signed-off-by: Troy-Butler <squintik@outlook.com>
Co-authored-by: Troy-Butler <squintik@outlook.com>
…ilt (llvm#89164)

Currently, `check-gwp_asan` is added no matter its dependencies are
built or not, this is wrong and will cause cmake error when scudo is not
built. This patch includes the target in the dependencies check.
Constant folding had a CHECK on array subscript rank that should more
gracefully handle a bad program with a subscript that is a matrix or
higher rank.

Fixes llvm#88112.
When a statement function in a nested scope has a name that clashes with
a name that exists in the host scope, the compiler can handle it
correctly (with a portability warning)... unless the host scope acquired
the name via USE association. Fix.

Fixes llvm#88678.
This replaces the old macros LIBC_COPT_TEST_USE_FUCHSIA and
LIBC_COPT_TEST_USE_PIGWEED with LIBC_COPT_TEST_ZXTEST and
LIBC_COPT_TEST_GTEST, respectively.  These are really not about
whether the code is in the Fuchsia build or in the Pigweed build,
but just about what test framework is being used.  The gtest
framework can be used in many contexts, and the zxtest framework
is not always what's used in the Fuchsia build.

The test/UnitTest/Test.h wrapper header now provides the macro
LIBC_TEST_HAS_MATCHERS() for use in `#if` conditionals on use of
gmock-style matchers, replacing `#if` conditionals that test the
framework selection macros directly.
…vm#89429)

When the characteristics of a procedure depend on a procedure that
hasn't yet been defined, the compiler currently emits an unconditional
error message. This includes the case of a procedure whose
characteristics depend, perhaps indirectly, on itself. However, in the
case where the characteristics of a procedure are needed to resolve a
generic, we should not emit an error for a hitherto undefined procedure
-- either the call will resolve to another specific procedure, in which
case the error is spurious, or it won't, and then an error will issue
anyway.

Fixes llvm#88677.
Rewrite  `LLVM_PARALLEL_{}_JOBS` and `LLVM_RAM_PER_{}_JOB` documentation.
The standard defines C_LOC as being PURE (actually SIMPLE now in
F'2023); characterize it appropriately.

Fixes llvm#88747.
The intrinsic function OUT_OF_RANGE() lacks support in lowering and the
runtime. This patch obviates a need for any such support by implementing
OUT_OF_RANGE() via rewriting in semantics. This rewriting of
OUT_OF_RANGE() calls replaces the existing code that folds
OUT_OF_RANGE() calls with constant arguments.

Some changes and fixes were necessary outside of OUT_OF_RANGE()'s
folding code (now rewriting code), whose testing exposed some other
issues worth fixing.

- The common::RealDetails<> template class was recoded in terms of a new
base class with a constexpr constructor, so that the the characteristics
of the various REAL kinds could be queried dynamically as well. This
affected some client usage.
- There were bugs in the code that folds TRANSFER() when the type of X
or MOLD was REAL(10) -- this is a type that occupies 16 bytes per
element in execution memory but only 10 bytes (was 12) in the data of
std::vector<Scalar<>> in a Constant<>.
- Folds of REAL->REAL conversions weren't preserving infinities.
Trying to address the build failure on the `clang-ve-ninja`bot, which
appears hard to repro locally. The target isn't needed currently (there
are unit tests exercising the new functionality). Removing it for now
to green-ify the build bot.
It has been using isZeroValue(), which is for floats, not integers.
…Control.

Updates ExecutionSession to use the ExecutorProcessControl object's
TaskDispatcher rather than having a separate dispatch function. This gives the
TaskDispatcher a global view of all tasks to be executed, and provides a
single point to wait on for tasks to complete when shutting down the JIT.
A recent patch had three declared but unused variables in it, triggering
a warning in some build bots. Remove them.
…rProcessControl."

This reverts commit 6094b3b.

Multiple bots are broken.
This patch introduces HWASan memaccess intrinsics that assume a fixed
shadow (with the offset provided by --hwasan-mapping-offset=...), with
and without short granule support.

The behavior of HWASan is not meaningfully changed by this patch;
future work ("Optimize outlined memaccess for
fixed shadow on Aarch64": llvm#88544) will make HWASan use these intrinsics.

We currently only support lowering the LLVM IR intrinsic to AArch64.

The test case is adapted from hwasan-check-memaccess.ll.
…e tests to fail on multiple bots. (llvm#89689)

Update the check lines added in llvm#87247 after 14e6f63 updated the output
causing the tests to fail.

This should hopefully unbreak the bots failing due to these two tests
failing.
…ties (llvm#89119)

Made the createReadOrMaskedRead and isValidMaskedInputVector utility
functions - to be accessible outside of the CU. Needed by the IREE new
TopK implementation.
zmodem and others added 17 commits April 24, 2024 15:25
to reflect that there are three variants.
…m#89211)

This patch updates verifiers for `omp.ordered`, `omp.ordered.region`,
`omp.cancel` and `omp.cancellation_point`, which check for a parent
`omp.wsloop`.

After transitioning to a loop wrapper-based approach, the expected
direct parent will become `omp.loop_nest` instead, so verifiers need to
take this into account.

This PR on its own will not pass premerge tests. All patches in the
stack are needed before it can be compiled and passes tests.
This patch makes changes to the `scf.parallel` to `omp.parallel` +
`omp.wsloop` lowering pass in order to introduce a nested
`omp.loop_nest` as well, and to follow the new loop wrapper role for
`omp.wsloop`.

This PR on its own will not pass premerge tests. All patches in the
stack are needed before it can be compiled and passes tests.
…9214)

This patch introduces minimal changes to the MLIR to LLVM IR translation
of `omp.wsloop` to support the loop wrapper approach.

There is `omp.loop_nest` related translation code that should be
extracted and shared among all loop operations (e.g. `omp.simd`). This
would possibly also help in the addition of support for compound
constructs later on. This first approach is only intended to keep things
running after the transition to loop wrappers and not to add support for
other use cases enabled by that transition.

This PR on its own will not pass premerge tests. All patches in the
stack are needed before it can be compiled and passes tests.
This patch updates lowering from PFT to MLIR of workshare loops to
follow the loop wrapper approach. Unit tests impacted by this change are
also updated.

As the last patch of the stack, this should compile and pass unit tests.
Semantics usually fold SHAPE into an array constructor, but sometimes it
cannot (like when the source is a function result that cannot be
duplicated in expression analysis). Add lowering handling for shape.
…lane.mask (llvm#89068)

When SVE is available we can lower calls to get.active.lane.mask using
the SVE whilelo instruction, however in practice since vXi1 types are
not legal for NEON we often end up expanding the predicate into a vector
of integers, e.g. v4i1 -> v4i32. This usually happens when we have to
keep the predicate live out of the block, for example when the predicate
is the incoming value to a PHI node in a tail-folded vector loop.
Currently in such cases the intrinsic call has a cost of 1, which is far
too low when considering the extra instructions required to expand the
predicate. This patch fixes that by basing the cost on the number of
lane moves required for expansion. This is required for a follow-on
patch that adds the cost of the intrinsic call to the vectorisation cost
model, so that we can teach the vectoriser to make better choices.
…nctionDefinition (llvm#89801)

In the lambda function within
clang::Sema::InstantiateFunctionDefinition, the return value of a
function that may return null is now checked before dereferencing to
avoid potential null pointer dereference issues which can lead to
crashes or undefined behavior in the program.
On gfx11 shaders run with PRIV=1, which causes `s_trap 2` to be treated
as a nop, which means it isn't a correct lowering for the trap
intrinsic. As a workaround, this commit instead lowers the trap
intrinsic to instructions that simulate the behavior of s_trap 2.

Fixes: SWDEV-438421
…th r2 and pre-R2 (llvm#89881)

About unsigned max/min, ANDi is available for all ISA revisions in
extend before slt insn.
So that we can reduce one instruction.
…ructions (llvm#89867)

Since the requirement is EEW=32, it's impossible that EGW=128
needs LMUL=8.
@skatrak skatrak merged commit e967097 into ROCm:amd-trunk-dev May 7, 2024
11 of 12 checks passed
@skatrak skatrak deleted the atd/wrapper-merge branch May 7, 2024 14:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.