[SYCL][Graph] Node Profiling #353

…77231) This patch tries to simplify `X | Y` by replacing occurrences of `Y` in `X` with 0. Similarly, it tries to simplify `X & Y` by replacing occurrences of `Y` in `X` with -1. Alive2: https://alive2.llvm.org/ce/z/cNjDTR Note: As the current implementation is too conservative in the one-use checks, I cannot remove other existing hard-coded simplifications if they involves more than two instructions (e.g, `A & ~(A ^ B) --> A & B`). Compile-time impact: http://llvm-compile-time-tracker.com/compare.php?from=a085402ef54379758e6c996dbaedfcb92ad222b5&to=9d655c6685865ffce0ad336fed81228f3071bd03&stat=instructions%3Au |stage1-O3|stage1-ReleaseThinLTO|stage1-ReleaseLTO-g|stage1-O0-g|stage2-O3|stage2-O0-g|stage2-clang| |--|--|--|--|--|--|--| |+0.01%|-0.00%|+0.00%|-0.02%|+0.01%|+0.02%|-0.01%| Fixes #76554.

Some time ago, I did a similar patch for local variables. Initializing global variables can fail as well: ```c++ constexpr int a = 1/0; static_assert(a == 0); ``` ... would succeed in the new interpreter, because we never saved the fact that `a` has not been successfully initialized.

The three commits from "[RFC] compiler-rt builtins cleanup and refactoring" rewrote lots of code in compiler-rt builtins. - 082b89b: [builtins] Reformat builtins with clang-format - 0ba22f5: [builtins] Use single line C++/C99 comment style - 84da0e1: [builtins] Use aliases for function redirects

After this change, all current compiler-rt:* labels on GitHub are covered.

* Print `ReturnLoc`, `ReturnVal`, and `ThisPointeeLoc` if applicable. * For entries in `LocToVal` that correspond to declarations, print the names of the declarations next to them. I've removed the FIXME because all relevant fields are now being dumped. I'm not sure we actually need the capability for the caller to specify which fields to dump, so I've simply deleted this part of the comment. Some examples of the output: ![image](https://github.com/llvm/llvm-project/assets/29098113/17d0978f-b86d-4555-8a61-d1f2021f8d59) ![image](https://github.com/llvm/llvm-project/assets/29098113/021dbb24-5fe2-4720-8a08-f48dcf4b88f8)

Make the wmma intrinsic type signatures to be canonical. We need a type signature as long as the type is not fixed. However, when an argument's type matches a previous argument's type, we do not need the signature for this argument. This patch fixes three general cases: 1. add missing signatures 2. remove signatures for matching arguments 3. reorer the signatures -- return type signature should always appear first

…0108) This re-applies 30155fc with a fix for clangd. ### Description clang don't evaluate the object argument of `static operator()` and `static operator[]` currently, for example: ```cpp #include <iostream> struct Foo { static int operator()(int x, int y) { std::cout << "Foo::operator()" << std::endl; return x + y; } static int operator[](int x, int y) { std::cout << "Foo::operator[]" << std::endl; return x + y; } }; Foo getFoo() { std::cout << "getFoo()" << std::endl; return {}; } int main() { std::cout << getFoo()(1, 2) << std::endl; std::cout << getFoo()[1, 2] << std::endl; } ``` `getFoo()` is expected to be called, but clang don't call it currently (17.0.6). This PR fixes this issue. Fixes #67976, reland #68485. ### Walkthrough - **clang/lib/Sema/SemaOverload.cpp** - **`Sema::CreateOverloadedArraySubscriptExpr` & `Sema::BuildCallToObjectOfClassType`** Previously clang generate `CallExpr` for static operators, ignoring the object argument. In this PR `CXXOperatorCallExpr` is generated for static operators instead, with the object argument as the first argument. - **`TryObjectArgumentInitialization`** `const` / `volatile` objects are allowed for static methods, so that we can call static operators on them. - **clang/lib/CodeGen/CGExpr.cpp** - **`CodeGenFunction::EmitCall`** CodeGen changes for `CXXOperatorCallExpr` with static operators: emit and ignore the object argument first, then emit the operator call. - **clang/lib/AST/ExprConstant.cpp** - **`‎ExprEvaluatorBase::handleCallExpr‎`** Evaluation of static operators in constexpr also need some small changes to work, so that the arguments won't be out of position. - **clang/lib/Sema/SemaChecking.cpp** - **`Sema::CheckFunctionCall`** Code for argument checking also need to be modify, or it will fail the test `clang/test/SemaCXX/overloaded-operator-decl.cpp`. - **clang-tools-extra/clangd/InlayHints.cpp** - **`InlayHintVisitor::VisitCallExpr`** Now that the `CXXOperatorCallExpr` for static operators also have object argument, we should also take care of this situation in clangd. ### Tests - **Added:** - **clang/test/AST/ast-dump-static-operators.cpp** Verify the AST generated for static operators. - **clang/test/SemaCXX/cxx2b-static-operator.cpp** Static operators should be able to be called on const / volatile objects. - **Modified:** - **clang/test/CodeGenCXX/cxx2b-static-call-operator.cpp** - **clang/test/CodeGenCXX/cxx2b-static-subscript-operator.cpp** Matching the new CodeGen. ### Documentation - **clang/docs/ReleaseNotes.rst** Update release notes. --------- Co-authored-by: Shafik Yaghmour <shafik@users.noreply.github.com> Co-authored-by: cor3ntin <corentinjabot@gmail.com> Co-authored-by: Aaron Ballman <aaron@aaronballman.com>

…79999) Here's an example of the output: ![image](https://github.com/llvm/llvm-project/assets/29098113/63cd509e-c2a7-4794-b758-ea73812ff09f)

This patch replaces the template trick with a constexpr function that is more readable. Once C++20 is available in our code base, we can remove the constexpr function in favor of std::bit_ceil.

Enable, test, and document the support for fusing rounded range kernels. This mostly worked already – we just have to query the original kernel's global size, and use that to compute the private memory size used for internalization. --------- Signed-off-by: Julian Oppermann <julian.oppermann@codeplay.com>

This patch folds: ``` ((bitcast X to int) <s 0 ? -X : X) -> fabs(X) ((bitcast X to int) >s -1 ? X : -X) -> fabs(X) ((bitcast X to int) <s 0 ? X : -X) -> -fabs(X) ((bitcast X to int) >s -1 ? -X : X) -> -fabs(X) ``` Alive2: https://alive2.llvm.org/ce/z/rGepow

Updates the commit tag for the OCK.

…ts (intel#12529) This will make the two tests run in the presence of either CPU OR GPU and not requiring both to be present to run.

See llvm/llvm-project#79261 for details. It shows that clang-repl uses a different target triple with clang so that it may be problematic if the calng-repl reads the generated BMI from clang in a different target triple. While the underlying issue is not easy to fix, this patch tries to make this test green to not bother developers.

…l#12524) Show detailed error messages when users try to fuse kernels with incompatible ND-ranges, showing different errors for each different scenario. Also combine the validation and fusion logic to reduce the number of ND-ranges list traversals. --------- Signed-off-by: Victor Perez <victor.perez@codeplay.com>

…80079) Similar to #78403, but for scalable `vwadd(u).wv`, given that #76785 is recommited. ### Code ``` define <vscale x 8 x i64> @vwadd_wv_mask_v8i32(<vscale x 8 x i32> %x, <vscale x 8 x i64> %y) { %mask = icmp slt <vscale x 8 x i32> %x, shufflevector (<vscale x 8 x i32> insertelement (<vscale x 8 x i32> poison, i32 42, i64 0), <vscale x 8 x i32> poison, <vscale x 8 x i32> zeroinitializer) %a = select <vscale x 8 x i1> %mask, <vscale x 8 x i32> %x, <vscale x 8 x i32> zeroinitializer %sa = sext <vscale x 8 x i32> %a to <vscale x 8 x i64> %ret = add <vscale x 8 x i64> %sa, %y ret <vscale x 8 x i64> %ret } ``` ### Before this patch [Compiler Explorer](https://godbolt.org/z/xsoa5xPrd) ``` vwadd_wv_mask_v8i32: li a0, 42 vsetvli a1, zero, e32, m4, ta, ma vmslt.vx v0, v8, a0 vmv.v.i v12, 0 vmerge.vvm v24, v12, v8, v0 vwadd.wv v8, v16, v24 ret ``` ### After this patch ``` vwadd_wv_mask_v8i32: li a0, 42 vsetvli a1, zero, e32, m4, ta, ma vmslt.vx v0, v8, a0 vsetvli zero, zero, e32, m4, tu, mu vwadd.wv v16, v16, v8, v0.t vmv8r.v v8, v16 ret ```

The `memref.subview` verifier currently checks result shape, element type, memory space and offset of the result type. However, the strides of the result type are currently not verified. This commit adds verification of result strides for non-rank reducing ops and fixes invalid IR in test cases. Verification of result strides for ops with rank reductions is more complex (and there could be multiple possible result types). That is left for a separate commit. Also refactor the implementation a bit: * If `computeMemRefRankReductionMask` could not compute the dropped dimensions, there must be something wrong with the op. Return `FailureOr` instead of `std::optional`. * `isRankReducedMemRefType` did much more than just checking whether the op has rank reductions or not. Inline the implementation into the verifier and add better comments. * `produceSubViewErrorMsg` does not have to be templatized.

…ocOrder (#80015) Previously we called ignoreCSRForAllocationOrder on every alias of every CSR which was expensive on targets like AMDGPU which define a very large number of overlapping register tuples. On such targets it is simpler and faster to call ignoreCSRForAllocationOrder once for every physical register. Differential Revision: https://reviews.llvm.org/D146735

Reverts llvm/llvm-project#79865 I think there is a bug in the stride computation in `SubViewOp::inferResultType`. (Was already there before this change.) Reverting this commit for now and updating the original pull request with a fix and more test cases.

…12527) split-dwarf feature can help reducing compile time and build footprint See examples from: https://www.productive-cpp.com/improving-cpp-builds-with-split-dwarf/ Locally measured size reduction using debug build shows around 20% reduction for static linked build. Footprint reduction using after compile.py: 48G -> 37G (23%) after check-all: 170G -> 140G (18%) Debugability should not be affected. Should help with compile time, especially incremental build as well. -gsplit-dwarf not yet supported on windows, so not turn it on for now.

…ID (intel#12526) **Problem:** Currently, the image id of an RTDeviceBinaryImage instance is simply the pointer value of the underlying pi_device_binary (in [getImageID(](https://github.com/intel/llvm/blob/sycl/sycl/source/detail/device_binary_image.hpp#L221))). However, consider the following scenario: 1) We create a device image 2) Put into cache 3) Destroy the image (when it goes out of scope) 4) Create another image that _happens to be created at the same memory address_ (thus having same image ID) This causes two instances of RTDeviceBinaryImage to share the same image id, which ends up causing a collision in the KernelProgramCache. **Solution (Proposed in this PR)** Have a counter in RTDeviceBinaryImage that increments upon instantiation of this class. The counter value is added to the image id to ensure that no two instances have the same ID. **Alternative Solutions** 1. Remove the entry from the KernelProgramCache when the image is destroyed. This solution would require more work as the KernelProgramCache, currently, [does not support arbitrary element-wise eviction](https://github.com/intel/llvm/blob/sycl/sycl/doc/design/KernelProgramCache.md#in-memory-cache-eviction) (eviction follows a LRU strategy when cache size exceeds the threshold). Moreover, I expect this to have additional performance overhead of having to lock the cache and evicting. The proposed solution is much more simpler.

Add (de)serialization support for them, like we do for Floating values.

…(#72097) Make transform.structured.tile_using_forall be able to take param type tile sizes. Examples: ``` %tile_sizes = transform.param.constant 16 : i64 -> !transform.param<i64> transform.structured.tile_using_forall %matmul tile_sizes [%tile_sizes : !transform.param<i64>, 32] ( mapping = [#gpu.block<x>, #gpu.block<y>] ) : (!transform.any_op) -> (!transform.any_op, !transform.any_op) ``` ``` %c10 = transform.param.constant 10 : i64 -> !transform.any_param %c20 = transform.param.constant 20 : i64 -> !transform.any_param %tile_sizes = transform.merge_handles %c10, %c20 : !transform.any_param transform.structured.tile_using_forall %matmul tile_sizes *(%tile_sizes : !transform.any_param) ( mapping = [#gpu.block<x>, #gpu.block<y>] ) : (!transform.any_op) -> (!transform.any_op, !transform.any_op) ```

… smstart/smstop. (#78294) This patch introduces a 'COALESCER_BARRIER' which is a pseudo node that expands to a 'nop', but which stops the register allocator from coalescing a COPY node when its use/def crosses a SMSTART or SMSTOP instruction. For example: %0:fpr64 = COPY killed $d0 undef %2.dsub:zpr = COPY %0 // <- Do not coalesce this COPY ADJCALLSTACKDOWN 0, 0 MSRpstatesvcrImm1 1, 0, csr_aarch64_smstartstop, implicit-def dead $d0 $d0 = COPY killed %0 BL @use_f64, csr_aarch64_aapcs If the COPY would be coalesced, that would lead to: $d0 = COPY killed %0 being replaced by: $d0 = COPY killed %2.dsub which means the whole ZPR reg would be live upto the call, causing the MSRpstatesvcrImm1 (smstop) to spill/reload the ZPR register: str q0, [sp] // 16-byte Folded Spill smstop sm ldr z0, [sp] // 16-byte Folded Reload bl use_f64 which would be incorrect for two reasons: 1. The program may load more data than it has allocated. 2. If there are other SVE objects on the stack, the compiler might use the 'mul vl' addressing modes to access the spill location. By disabling the coalescing, we get the desired results: str d0, [sp, #8] // 8-byte Folded Spill smstop sm ldr d0, [sp, #8] // 8-byte Folded Reload bl use_f64

… (#80005) This patch implements the zabha (Byte and Halfword Atomic Memory Operations) v1.0-rc1 extension. See also https://github.com/riscv/riscv-zabha/blob/v1.0-rc1/zabha.adoc.

… (#79626) As far as I am aware, there is no simple way to match on elementwise ops. I propose to add an `elementwise` criteria to the `match.structured.body` op. Although my only hesitation is that elementwise is not only determined by the body, but also the indexing maps. So if others find this too awkward, I can implement a separate match op instead.

This patch introduces support for 2-way widening outer products. This enables the fusion of 2 'arm_sme.outerproduct' operations that are chained via the accumulator into a 2-way widening outer product operation. Changes: - Add 'llvm.aarch64.sme.[us]mop[as].za32' intrinsics for 2-way variants. These map to instruction variants added in SME2 and use different intrinsics. Intrinsics are already implemented for widening variants from SME1. - Adds the following operations: - fmopa_2way, fmops_2way - smopa_2way, smops_2way - umopa_2way, umops_2way - Implements conversions for the above ops to intrinsics in ArmSMEToLLVM. - Adds a pass 'arm-sme-outer-product-fusion' that fuses 'arm_sme.outerproduct' operations. For a detailed description of these operations see the 'arm_sme.fmopa_2way' description. The reason for introducing many operations rather than one is the signed/unsigned variants can't be distinguished with types (e.g., ui16, si16) since 'arith.extui' and 'arith.extsi' only support signless integers. A single operation would require this information and an attribute (for example) for the sign doesn't feel right if floating-point types are also supported where this wouldn't apply. Furthermore, the SME FP8 extensions (FEAT_SME_F8F16, FEAT_SME_F8F32) introduce FMOPA 2-way (FP8 to FP16) and 4-way (FP8 to FP32) variants but no subtract variant. Whilst these are not supported in this patch, it felt simpler to have separate ops for add/subtract given this.

…tors (#79979) vector.shuffle is not supported for scalable vectors (outside of splats)

The `memref.subview` verifier currently checks result shape, element type, memory space and offset of the result type. However, the strides of the result type are currently not verified. This commit adds verification of result strides for non-rank reducing ops and fixes invalid IR in test cases. Verification of result strides for ops with rank reductions is more complex (and there could be multiple possible result types). That is left for a separate commit. Also refactor the implementation a bit: * If `computeMemRefRankReductionMask` could not compute the dropped dimensions, there must be something wrong with the op. Return `FailureOr` instead of `std::optional`. * `isRankReducedMemRefType` did much more than just checking whether the op has rank reductions or not. Inline the implementation into the verifier and add better comments. * `produceSubViewErrorMsg` does not have to be templatized. * Fix comment and add additional assert to `ExpandStridedMetadata.cpp`, to make sure that the memref.subview verifier is in sync with the memref.subview -> memref.reinterpret_cast lowering. Note: This change is identical to #79865, but with a fixed comment and an additional assert in `ExpandStridedMetadata.cpp`. (I reverted #79865 in #80116, but the implementation was actually correct, just the comment in `ExpandStridedMetadata.cpp` was confusing.)

Due to the way the inliner works, the launched function may become very large and go above the inline threshold. This results with a short kernel which only call one function. The patch adds an always_inline on the call site to force the user function to be inline in the SYCL kernel to reduce overhead. Signed-off-by: Victor Lomuller <victor@codeplay.com>

CONFLICT (content): Merge conflict in clang/lib/Basic/Targets/NVPTX.cpp CONFLICT (content): Merge conflict in clang/test/Driver/cuda-cross-compiling.c

…PR from a new contributor (#78292) This change adds a comment to the first PR from a new contributor that is merged, which tells them what to expect post merge from the build bots. How they will be notified, where to ask questions, that you're more likely to be reverted than in other projects, etc. The information overlaps with, and links to https://llvm.org/docs/MyFirstTypoFix.html#myfirsttypofix-issues-after-landing-your-pr. So that users who simply read the email are still aware, and know where to follow up if they do get reports. To do this, I have added a hidden HTML comment to the new contributor greeting comment. This workflow will look for that to tell if the author of the PR was a new contributor at the time they opened the merge. It has to be done this way because as soon as the PR is merged, they are by GitHub's definition no longer a new contributor and I suspect that their author association will be "contributor" instead. I cannot 100% confirm that without a whole lot of effort and probably breaking GitHub's terms of service, but it's fairly cheap to work around anyway. It seems rare / almost impossible to reopen a PR in llvm at least, but in case it does happen the buildbot info comment has its own hidden HTML comment. If we find this we will not post another copy of the same information.

If the demanded bits of an instruction are full, we don't have to recurse to its users, but we may still have to clear flags on the instruction itself. Fixes llvm/llvm-project#80113.

…Before` (#79579) This commit adds a new method to the rewriter API: `moveBlockBefore`. This op is utilized by `inlineRegionBefore` and covered by dialect conversion test cases. Also fixes a bug in `moveOpBefore`, where the previous op location was not passed correctly. Adds a test case to `test-strict-pattern-driver.mlir`.

…SRForAllocOrder (#80015)" This reverts commit f852503. It was supposed to speed things up but llvm-compile-time-tracker.com showed a slight slow down.

…KnownFPClass` (#76360) This patch merges the logic of `cannotBeOrderedLessThanZeroImpl` into `computeKnownFPClass` to improve the signbit inference. --------- Co-authored-by: Matt Arsenault <arsenm2@gmail.com>

PAL uses ELF REL (not RELA) relocations which can only store a 32-bit addend in the instruction, even for reloc types like R_AMDGPU_ABS32_HI which require the upper 32 bits of a 64-bit address calculation to be correct. This means that it is not safe to fold an arbitrary offset into a GlobalAddressSDNode, so stop doing that. In practice this is mostly a problem for small negative offsets which do not work as expected because PAL treats the 32-bit addend as unsigned.

Added by f2a78e6. Wouldn't normally bother but it's showing up in some CI checks, just want to reduce the noise.

…#12109) Add the ability to specify unique addressing modes per dimension to the bindless_image_sampler Corresponding CUDA adapter UR PR: oneapi-src/unified-runtime#1168 --------- Co-authored-by: Kenneth Benzie (Benie) <k.benzie83@gmail.com>

…#80022) Make it so that when the top-level (root) operation itself is being modified, it is also used as the root for debug output in PatternApplicator. Fix #80021

The `verbatim` operation produces no results and the value is emitted as is followed by a line break ('\n' character) during translation. Note: Use with caution. This operation can have arbitrary effects on the semantics of the emitted code. Use semantically more meaningful operations whenever possible. Additionally this op is *NOT* intended to be used to inject large snippets of code. This operation can be used in situations where a more suitable operation is not yet implemented in the dialect or where preprocessor directives interfere with the structure of the code. Co-authored-by: Marius Brehler <marius.brehler@iml.fraunhofer.de>

@foo

…nterprets a value of 'kernel_arg_type' (#78730) The goal of this PR is to tolerate differences between description of formal arguments by function metadata (represented by "kernel_arg_type") and LLVM actual parameter types. A compiler may use "kernel_arg_type" of function metadata fields to encode detailed type information, whereas LLVM IR may utilize for an actual parameter a more general type, in particular, opaque pointer type. This PR proposes to resolve this by a fallback to LLVM actual parameter types during the lowering of formal function arguments in cases when the type can't be created by string content of "kernel_arg_type", i.e., when "kernel_arg_type" contains a type unknown for the SPIR-V Backend. An example of the issue manifestation is https://github.com/KhronosGroup/SPIRV-LLVM-Translator/blob/main/test/transcoding/KernelArgTypeInOpString.ll, where a compiler generates for the following kernel function detailed `kernel_arg_type` info in a form of `!{!"image_kernel_data*", !"myInt", !"struct struct_name*"}`, and in LLVM IR same arguments are referred to as `@foo(ptr addrspace(1) %in, i32 %out, ptr addrspace(1) %outData)`. Both definitions are correct, and the resulting LLVM IR is correct, but lowering stage of SPIR-V Backend fails to generate SPIR-V type. ``` typedef int myInt; typedef struct { int width; int height; } image_kernel_data; struct struct_name { int i; int y; }; void kernel foo(__global image_kernel_data* in, __global struct struct_name *outData, myInt out) {} ``` ``` define spir_kernel void @foo(ptr addrspace(1) %in, i32 %out, ptr addrspace(1) %outData) ... !kernel_arg_type !7 ... { entry: ret void } ... !7 = !{!"image_kernel_data*", !"myInt", !"struct struct_name*"} ``` The PR changes a contract of `SPIRVType *getArgSPIRVType(...)` in a way that it may return `nullptr` to signal that the metadata string content is not recognized, so corresponding comments are added and a couple of checks for `nullptr` are inserted where appropriate.

…add gnux32 triple tests

…noise We try to only use X32 for gnux32 triple tests.

We try to only use X32 for gnux32 triple tests.

…CHECK prefix) We try to only use X32 for gnux32 triple tests.

…984) Patch ensures that host runtime functions are not called for handling OpenMP teams clause on the device. GPU code for pragma `omp target teams distribute parallel do` will require only one call to OpenMP loop-worksharing GPU runtime. Support for it will be added later. This patch does not include changes required for handling `omp target teams` for the host side.

The comment was incorrect: !range also applies to calls, and we do need to drop it in some cases.

…79855) If `-allow-incomplete-ir` is enabled, automatically insert declarations for missing globals. If a global is only used in calls with the same function type, insert a function declaration with that type. Otherwise, insert a dummy i8 global. The fallback case could be extended with various heuristics (e.g. we could look at load/store types), but I've chosen to keep it simple for now, because I'm unsure to what degree this would really useful without more experience. I expect that in most cases the declaration type doesn't really matter (note that the type of an external global specifies a *minimum* size only, not a precise size). This is a followup to llvm/llvm-project#78421.

This is a support for " #pragma omp atomic compare weak". It has Parser & AST support for now. --------- Authored-by: Sunil Kuravinakop <kuravina@pe28vega.us.cray.com>

Calling a `__arm_locally_streaming` function from a function that is not a streaming-SVE function would lead to incorrect inlining. The issue didn't surface because the tests were not testing what they were supposed to test.

Miscompilation arises due to instruction combining of cast pairs of the type `bitcast bfloat to half` + `<FPOp> bfloat to half` or `bitcast half to bfloat` + `<FPOp half to bfloat`. For example `bitcast bfloat to half`+`fpext half to double` or `bitcast bfloat to half`+`fpext bfloat to double` respectively reduce to `fpext bfloat to double` and `fpext half to double`. This is an incorrect conversion as it assumes the representation of `bfloat` and `half` are equivalent due to having the same width. As a consequence miscompilation arises. Fixes #61984

This is already supported in llvm-cvtres, so only a small change is needed.

This adds a new pass (`-arm-sme-vector-legalization`) which legalizes vector operations so that they can be lowered to ArmSME. This initial patch adds decomposition for `vector.outerproduct`, `vector.transfer_read`, and `vector.transfer_write` when they operate on vector types larger than a single SME tile. For example, a [8]x[8]xf32 outer product would be decomposed into four [4]x[4]xf32 outer products, which could then be lowered to ArmSME. These three ops have been picked as supporting them alone allows lowering matmuls that use all ZA accumulators to ArmSME. For it to be possible to legalize a vector type it has to be a multiple of an SME tile size, but other than that any shape can be used. E.g. `vector<[8]x[8]xf32>`, `vector<[4]x[16]xf32>`, `vector<[16]x[4]xf32>` can all be lowered to four `vector<[4]x[4]xf32>` operations. In future, this pass will be extended with more SME-specific rewrites to legalize unrolling the reduction dimension of matmuls (which is not type-decomposition), which is why the pass has quite a general name.

…d of repeated getMaskElt calls. Use a simpler for-range loop to append all shuffle mask elements

… X86 and expose address math We try to only use X32 for gnux32 triple tests. Use no_x86_scrub_mem_shuffle so the test shows updated shuffle intermediate and the +4 offset into the constant pool vector entry

We try to only use X32 for gnux32 triple tests.

Silence unused variable warning which tripped post-commit checks for intel#12492. Signed-off-by: Julian Oppermann <julian.oppermann@codeplay.com>

…, concat(b,d)) (#79464) We can convert concat(v4i16 uhadd(a,b), v4i16 uhadd(c,d)) to v8i16 uhadd(concat(a,c), concat(b,d)), which can lead to further simplifications.

Broadcast of a single float should not be any slower than loading 32B using vmovaps. So remat it can help reduce register spill when there is big register pressure.

__builtin_amdgcn_mfma* and __builtin_amdgcn_smfmac*

…T instructions (#79954) Two variants: promoted legacy, NF (no flags update). The syntax of NF instructions is aligned with GNU binutils. https://sourceware.org/pipermail/binutils/2023-September/129545.html

Fixes #78038.

Update createScalarIVSteps to take an insert point as parameter. This ensures that the inserted scalar steps are in the same order as the recipes they replace (vs in reverse order as currently). This helps to reduce the diff for follow-up changes.

When a block is inlined into another block, the nested operations are moved into another block and the `notifyOperationInserted` callback should be triggered. This commit adds the missing notifications for: * `RewriterBase::inlineBlockBefore` * `RewriterBase::mergeBlocks`

The uses of the attribute were removed in code review of #79584, but it's definition was inadvertently kept.

When a block is split with `RewriterBase::splitBlock`, a `notifyBlockInserted` notification, followed by `notifyOperationInserted` notifications (for moving over the operations into the new block) should be sent. This commit adds those notifications.

…287) This ensures the odd/even pseudo instructions are allocated to the same register range. This fixes #71763

…n-rules. (#79821) The options `-fcx-limited-range` and `-fcx-fortran-rules` were added in _https://github.com/llvm/llvm-project/pull/70244_ The code adding the options introduced an erroneous warning. `$ clang -c -fcx-limited-range t1.c` `clang: warning: overriding '' option with '-fcx-limited-range' [-Woverriding-option]` and `$ clang -c -fcx-fortran-rules t1.c` `clang: warning: overriding '' option with '-fcx-fortran-rules' [-Woverriding-option]` The warning doesn't make sense. This patch removes it.

JumpThreading may perform AA queries while the dominator tree is not up to date, which may result in miscompilations. Fix this by adding a new AAQI option to disable the use of the dominator tree in BasicAA. Fixes llvm/llvm-project#79175.

This PR moves lowering of math dialect later in the pipeline. Because math dialect is lowered correctly by `createConvertGpuOpsToNVVMOps` for GPU target, and it needs to run it first.

This returns (probably temporarily) array-referring NTTP behavior to which was prior to #78041 because ~~I'm fed up~~ have no time to fix regressions.

Extra space causes the checks generated by update_mir_test_checks to be unavailable. ``` # NOTE: Assertions have been autogenerated by utils/update_mir_test_checks.py UTC_ARGS: --version 4 # RUN: llc -mtriple=x86_64-- -o - %s -run-pass=none -verify-machineinstrs -simplify-mir | FileCheck %s --- name: foo body: | ; CHECK-LABEL: name: foo ; CHECK: bb.0: ; CHECK-NEXT: successors: ; CHECK-NEXT: {{ $}} ; CHECK-NEXT: {{ $}} ; CHECK-NEXT: bb.1: ; CHECK-NEXT: RET 0, $eax bb.0: successors: bb.1: RET 0, $eax ... ``` The failure log is as follows: ``` llvm/test/CodeGen/MIR/X86/unreachable-block-print.mir:9:16: error: CHECK-NEXT: is on the same line as previous match ; CHECK-NEXT: {{ $}} ^ <stdin>:21:13: note: 'next' match was here successors: ^ <stdin>:21:13: note: previous match ended here successors: ```

This reverts commit 4effff2. It makes `complex.abs(-1)` return `-1`.

…ntel#12557) Upstream now canonicalizes constant GEPs to represent byte offsets, i.e. using `i8` as source element type. This PR adapts the internalization pass to this change by also remapping GEPs with a constant offset, if that offset is a multiple of the internalized accessor's element size. Signed-off-by: Julian Oppermann <julian.oppermann@codeplay.com>

Finish plugging-in ASYNCHRONOUS IO in lowering (GetAsynchronousId was not used yet). Add a runtime implementation for GetAsynchronousId (only the signature was defined). Always return zero since flang runtime "fakes" asynchronous IO (data transfer are always complete, see flang/docs/IORuntimeInternals.md). Update all runtime integer argument and results for IDs to use the AsynchronousId int alias for consistency. In lowering, asynchronous attribute is added on the hlfir.declare of ASYNCHRONOUS variable, but nothing else is done. This is OK given the synchronous aspects of flang IO, but it would be safer to treat these variable as volatile (prevent code motion of related store/loads) since the asynchronous data change can also be done by C defined user procedure (see 18.10.4 Asynchronous communication). Flang lowering anyway does not give enough info for LLVM to do such code motions (the variables that are passed in a call are not given the noescape attribute, so LLVM will assume any later opaque call may modify the related data and would not move load/stores of such variables before/after calls even if it could from a pure Fortran point of view without ASYNCHRONOUS).

…… (#80144) …lt align crash (#78400)" This reverts commit 7b33899. A regression was discovered here: llvm/llvm-project#78400 and the author requested a revert to give time to review.

* Split out `MeshDialect.h` form `MeshOps.h` that defines the dialect class. Reduces include clutter if you care only about the dialect and not the ops. * Expose functions `getMesh` and `collectiveProcessGroupSize`. There functions are useful for outside users of the dialect. * Remove unused code. * Remove examples and tests of mesh.shard attribute in tensor encoding. Per the decision that Spmdization would be performed on sharding annotations and there will be no tensors with sharding specified in the type. For more info see this RFC comment: https://discourse.llvm.org/t/rfc-sharding-framework-design-for-device-mesh/73533/81

Currently, the `PPCMergeStringPool` merges the global variable after the `AsmPrinter` initializer adds the global variables to its symbol list. This is to move the merging work of `PPCMergeStringPool` to its initializer, just like what GlobalMerge does, to avoid adding merged global variables to the `AsmPrinter` symbol lis.

…is_msvc_triple) (#80071) Adding quotes around the `${target_triple}` Fix: #78530

The ability to dump AST nodes is important to ad-hoc debugging, and the fact this doesn't work with TypeLoc nodes is an obvious missing feature in e.g. clang-query (`set output dump` simply does nothing). Having TypeLoc::dump(), and enabling DynTypedNode::dump() for such nodes seems like a clear win. It looks like this: ``` int main(int argc, char **argv); FunctionProtoTypeLoc <test.cc:3:1, col:31> 'int (int, char **)' cdecl |-ParmVarDecl 0x30071a8 <col:10, col:14> col:14 argc 'int' | `-BuiltinTypeLoc <col:10> 'int' |-ParmVarDecl 0x3007250 <col:20, col:27> col:27 argv 'char **' | `-PointerTypeLoc <col:20, col:26> 'char **' | `-PointerTypeLoc <col:20, col:25> 'char *' | `-BuiltinTypeLoc <col:20> 'char' `-BuiltinTypeLoc <col:1> 'int' ``` It dumps the lexically nested tree of type locs. This often looks similar to how types are dumped, but unlike types we don't look at desugaring e.g. typedefs, as their underlying types are not lexically spelled here. --- Less clear is exactly when to include these nodes in existing text AST dumps rooted at (TranslationUnit)Decls. These already omit supported nodes sometimes, e.g. NestedNameSpecifiers are often mentioned but not recursively dumped. TypeLocs are a more extreme case: they're ~always more verbose than the current AST dump. So this patch punts on that, TypeLocs are only ever printed recursively as part of a TypeLoc::dump() call. It would also be nice to be able to invoke `clang` to dump a typeloc somehow, like `clang -cc1 -ast-dump`. But I don't know exactly what the best verison of that is, so this patch doesn't do it. --- There are similar (less critical!) nodes: TemplateArgumentLoc etc, these also don't have dump() functions today and are obvious extensions. I suspect that we should add these, and Loc nodes should dump each other (e.g. the ElaboratedTypeLoc `vector<int>::iterator` should dump the NestedNameSpecifierLoc `vector<int>::`, which dumps the TemplateSpecializationTypeLoc `vector<int>::` etc). Maybe this generalizes further to a "full syntactic dump" mode, where even Decls and Stmts would print the TypeLocs they lexically contain. But this may be more complex than useful. --- While here, ConceptReference JSON dumping must be implemented. It's not totally clear to me why this implementation wasn't required before but is now...

This is a follow up of 75d820d, adding more opcodes to the combine target hook enabling more LDP creation. Patch co-authored by Cameron McInally.

This amends 8d1b1c9 which added the functionality the release note refers to.

Adds or add-like-or's of 1 can both be turned into csinc, which can help fold more instructions into a csinc.

Just handle this like two primtive casts.

classifyComplexElementType used to return a std::optional, seems like this was left in a PR and not re-tested. This broke build bots, e.g. https://lab.llvm.org/buildbot/#/builders/68/builds/67930

llvm/llvm-project#78171 added support for non-consecutive local value numbers. This extends the support for global value numbers (for globals and functions). This means that it is now possible to delete an unnamed global definition/declaration without breaking the IR. This is a lot less common than unnamed local values, but it seems like something we should support for consistency. (Unnamed globals are used a lot in Rust though.)

…` (#79608) When calling `Environment::getResultObjectLocation` with a CXXOperatorCallExpr that is a prvalue, we just hit an assert because no record was ever created. --------- Co-authored-by: martinboehme <mboehme@google.com>

…ter argument. (#80072) This PR adds support for NULL intrinsic to have a procedure pointer argument.

…nfo. NFC.

… (#78556)" This reverts commit 74bf0b1. The test always fails. | mlir/test/Dialect/GPU/test-nvvm-pipeline.mlir:23:16: error: CHECK-PTX: expected string not found in input | // CHECK-PTX: __nv_expf https://lab.llvm.org/buildbot/#/builders/61/builds/53789

…cat(a,c), concat(b,d))" (#80157) Reverts llvm/llvm-project#79464 while figuring out why the tests are failing.

This allows it to work with disjoint or's as well as computing the known bits.

Fix linking error: "ld: error: can't create dynamic relocation R_X86_64_64 against local symbol in readonly segment; recompile object files with -fPIC or pass '-Wl,-z,notext' to allow text relocations in the output"

…t (#80080) Since I've formatted the epsilon value, I don't think it's necessary to escape it.

…roughout formatters (#80133) This avoids duplicating the logic to get the first element of a libc++ `__compressed_pair`. This will be useful in supporting upcoming changes to the layout of `__compressed_pair`. Drive-by changes: * Renamed `m_item` to `size_node` for readability; `m_item` suggests it's a member variable, which it is not.

This patch is the next piece of work in my Large Watchpoint proposal, https://discourse.llvm.org/t/rfc-large-watchpoint-support-in-lldb/72116 This patch breaks a user's watchpoint into one or more WatchpointResources which reflect what the hardware registers can cover. This means we can watch objects larger than 8 bytes, and we can watched unaligned address ranges. On a typical 64-bit target with 4 watchpoint registers you can watch 32 bytes of memory if the start address is doubleword aligned. Additionally, if the remote stub implements AArch64 MASK style watchpoints (e.g. debugserver on Darwin), we can watch any power-of-2 size region of memory up to 2GB, aligned to that same size. I updated the Watchpoint constructor and CommandObjectWatchpoint to create a CompilerType of Array<UInt8> when the size of the watched region is greater than pointer-size and we don't have a variable type to use. For pointer-size and smaller, we can display the watched granule as an integer value; for larger-than-pointer-size we will display as an array of bytes. I have `watchpoint list` now print the WatchpointResources used to implement the watchpoint. I added a WatchpointAlgorithm class which has a top-level static method that takes an enum flag mask WatchpointHardwareFeature and a user address and size, and returns a vector of WatchpointResources covering the request. It does not take into account the number of watchpoint registers the target has, or the number still available for use. Right now there is only one algorithm, which monitors power-of-2 regions of memory. For up to pointer-size, this is what Intel hardware supports. AArch64 Byte Address Select watchpoints can watch any number of contiguous bytes in a pointer-size memory granule, that is not currently supported so if you ask to watch bytes 3-5, the algorithm will watch the entire doubleword (8 bytes). The newly default "modify" style means we will silently ignore modifications to bytes outside the watched range. I've temporarily skipped TestLargeWatchpoint.py for all targets. It was only run on Darwin when using the in-tree debugserver, which was a proxy for "debugserver supports MASK watchpoints". I'll be adding the aforementioned feature flag from the stub and enabling full mask watchpoints when a debugserver with that feature is enabled, and re-enable this test. I added a new TestUnalignedLargeWatchpoint.py which only has one test but it's a great one, watching a 22-byte range that is unaligned and requires four 8-byte watchpoints to cover. I also added a unit test, WatchpointAlgorithmsTests, which has a number of simple tests against WatchpointAlgorithms::PowerOf2Watchpoints. I think there's interesting possible different approaches to how we cover these; I note in the unit test that a user requesting a watch on address 0x12e0 of 120 bytes will be covered by two watchpoints today, a 128-bytes at 0x1280 and at 0x1300. But it could be done with a 16-byte watchpoint at 0x12e0 and a 128-byte at 0x1300, which would have fewer false positives/private stops. As we try refining this one, it's helpful to have a collection of tests to make sure things don't regress. I tested this on arm64 macOS, (genuine) x86_64 macOS, and AArch64 Ubuntu. I have not modifed the Windows process plugins yet, I might try that as a standalone patch, I'd be making the change blind, but the necessary changes (see ProcessGDBRemote::EnableWatchpoint) are pretty small so it might be obvious enough that I can change it and see what the Windows CI thinks. There isn't yet a packet (or a qSupported feature query) for the gdb remote serial protocol stub to communicate its watchpoint capabilities to lldb. I'll be doing that in a patch right after this is landed, having debugserver advertise its capability of AArch64 MASK watchpoints, and have ProcessGDBRemote add eWatchpointHardwareArmMASK to WatchpointAlgorithms so we can watch larger than 32-byte requests on Darwin. I haven't yet tackled WatchpointResource *sharing* by multiple Watchpoints. This is all part of the goal, especially when we may be watching a larger memory range than the user requested, if they then add another watchpoint next to their first request, it may be covered by the same WatchpointResource (hardware watchpoint register). Also one "read" watchpoint and one "write" watchpoint on the same memory granule need to be handled, making the WatchpointResource cover all requests. As WatchpointResources aren't shared among multiple Watchpoints yet, there's no handling of running the conditions/commands/etc on multiple Watchpoints when their shared WatchpointResource is hit. The goal beyond "large watchpoint" is to unify (much more) the Watchpoint and Breakpoint behavior and commands. I have a feeling I may be slowly chipping away at this for a while. rdar://108234227

Summary: A previous patch removed creating these entries in clang in favor of the backend emitting a callable kernel and having the runtime call that if present. The support for the old style was kept around in LLVM 18.0 but now that we have forked to 19.0 we should remove the support. The effect of this would be that an application linking against a newer libomptarget that still had the old constructors will no longer be called. In that case, they can either recompile or use the `libomptarget.so.18` that comes with the previous release.

Chrome rolls libc++ and libc++abi as separate projects. As a result, they may not always be updated in lockstep, and this can lead to build failures when mixing libc++ that doesn't have <__thread/support.h> with libc++abi that requires it. This patch adds a workaround to make libc++abi work with both versions. While Chrome's setup is not supported, this workaround will allow them to go back to green and do the required work needed to roll libc++ and libc++abi in lockstep. This workaround will be short-lived -- I have a reminder to go back and remove it by EOW.

The way the locals are laid out on the stack on x86-64 Debian is resulting in a test failure with the new large watchpoint support. Collecting more logging before I revert/debug it.

This documents some of the architectural direction for DXIL and tries to provide a bit of a map for where to implement different aspects of DXIL support. Pull Request: llvm/llvm-project#78221

…el#12510) This implements the unified memory API for scatter with USM pointers. --------- Signed-off-by: Sarnie, Nick <nick.sarnie@intel.com>

Enable chained fixups in lld when all platform and version criteria are met. This is an attempt at simplifying the logic used in ld 907: https://github.com/apple-oss-distributions/ld64/blob/93d74eafc37c0558b4ffb88a8bc15c17bed44a20/src/ld/Options.cpp#L5458-L5549 Some changes were made to simplify the logic: - only enable chained fixups for macOS from 13.0 to avoid the arch check - only enable chained fixups for iphonesimulator from 16.0 to avoid the arch check - don't enable chained fixups for not specifically listed platforms - don't enable chained fixups for arm64_32

Watchpoint test fails on arm-ubuntu and x86-64-debian

When verbose lldb watch channel is enabled, print the user requested watchpoint and the resources we've broken it up into.

All GitHub Actions workflows added by intel/llvm project follow similar naming notation: 1. Name starts with `sycl` prefix. 2. Use dash `-` instead of underscore `_` to separate words.

Reverts intel#12525 In addition to file renaming, we need to update file names referenced inside the workflow files.

They started failing in the recent driver update. I can't reproduce it locally with the same driver version but the hardware we have is a little different, maybe that's why. I made an internal tracker for this. Signed-off-by: Sarnie, Nick <nick.sarnie@intel.com>

Deprecated since clang-tidy 17. The rule DCL21-CPP has been removed from the CERT guidelines, so it does not make sense to keep the check. Fixes #42788 Co-authored-by: Carlos Gálvez <carlos.galvez@zenseact.com>

This test is being added as a way to check the behaviour of how progress events are broadcasted when reports are started and ended with the current implementation of progress reports. Here we're mainly checking and ensuring that the current behaviour is that progress events are broadcasted individually and placed in the event queue in order of their creation and deletion.

A DEALLOCATE statement on a pointer should always use PointerDeallocate() in the runtime, even if there's no STAT= or polymorphism or derived types, so that it can be checked to ensure that it is indeed a whole allocation of a pointer.

When testing the arguments to see whether they are integers, check first that they are within the maximum range of a 64-bit integer; otherwise, a value of larger magnitude will set an invalid operand exception flag.

…039) Ensure that #include FOO undergoes macro replacement. But, as is the case with C/C++, continue to not perform macro replacement in a #include directive with <angled brackets>.

When a compilation unit has an interface to an external subroutine or function, and there is a global object (like a module) with the same name, we're emitting an error. This is too strong, the program will still build. This comes up in real applications, too. Downgrade the error to a warning.

…533)" This reverts commit 51e0d1b. That commit breaks a unit test: ``` Failed Tests (1): lldb-unit :: Core/./LLDBCoreTests/4/8 ```

intel#6837 enabled asynchronous buffer destruction for buffers constructed without host data. However, initial fallback assert implementation in intel#3767 predates it and as such had to place the buffer inside `queue_impl` to avoid unintended synchronization point. I don't know if there was the same crash observed on the end-to-end test added as part of this PR prior to intel#3767, but it doesn't even matter because the "new" implementation is both simpler and doesn't result in a crash. I suspect that without it (with the buffer for fallback assert implementation being a data member of `sycl::queue_impl`) we had a cyclic dependency somewhere leading to resource leak and ultimately to the assert in `DeviceGlobalUSMMem::~DeviceGlobalUSMMem()`.

The conflict resoultion removed sycl related changes, this is to bring it back.

This reverts commit c84f2ba.

This reverts commit fa42589.

This reverts commit d6e1ae2.

This reverts commit cf2533e.

This reverts commit dad50fe.

This reverts commit 57c66b3.

See intel#12397, the test is flaky in post-commit.

…(#80090) Since we already add a `-fmodule-map-file=` argument for every used modulemap, we can remove all `ModuleMapFiles` entries before adding them. This reduces the number of module variants when `-fmodule-map-file=` appears on the original command line.

As mentioned in llvm/llvm-project#74747, this case is triggering a particularly high cost trip count expansion.

This patch adjusts the Docker container intended for CI use to contain a PGO+ThinLTO+BOLT optimized clang. The toolchain is built within a Github action and takes ~3.5 hours. No caching is utilized. The current PGO optimization is fairly minimal, only running clang over hello world. This can be adjusted as needed.

…encies Removes the MaterializationResponsibility::addDependencies and addDependenciesForAll methods, and transfers dependency registration to the notifyEmitted operation. The new dependency registration allows dependencies to be specified for arbitrary subsets of the MaterializationResponsibility's symbols (rather than just single symbols or all symbols) via an array of SymbolDependenceGroups (pairs of symbol sets and corresponding dependencies for that set). This patch aims to both improve emission performance and simplify dependence tracking. By eliminating some states (e.g. symbols having registered dependencies but not yet being resolved or emitted) we make some errors impossible by construction, and reduce the number of error cases that we need to check. NonOwningSymbolStringPtrs are used for dependence tracking under the session lock, which should reduce ref-counting operations, and intra-emit dependencies are resolved outside the session lock, which should provide better performance when JITing concurrently (since some dependence tracking can happen in parallel). The Orc C API is updated to account for this change, with the LLVMOrcMaterializationResponsibilityNotifyEmitted API being modified and the LLVMOrcMaterializationResponsibilityAddDependencies and LLVMOrcMaterializationResponsibilityAddDependenciesForAll operations being removed.

The inf and nan string index bounds checks were after the index was being used. This patch moves the index usage to the end of the condition. Fixes #79988

…TOC (#79530) This patch adds support for common and local symbols in the TOC for AIX. Note that we need to update isVirtualSection so as a common symbol in TOC will have the symbol type XTY_CM and will be initialized when placed in the TOC so sections with this type are no longer virtual. --------- Co-authored-by: Zaara Syeda <syzaara@ca.ibm.com>

…issue. (#79398) There are currently a few checkers that don't fill in the bug report's "decl-with-issue" field (typically a function in which the bug is found). The new attribute `[[clang::suppress]]` uses decl-with-issue to reduce the size of the suppression source range map so that it didn't need to do that for the entire translation unit. I'm already seeing a few problems with this approach so I'll probably redesign it in some point as it looks like a premature optimization. Not only checkers shouldn't be required to pass decl-with-issue (consider clang-tidy checkers that never had such notion), but also it's not necessarily uniquely determined (consider leak suppressions at allocation site). For now I'm adding a simple stop-gap solution that falls back to building the suppression map for the entire TU whenever decl-with-issue isn't specified. Which won't happen in the default setup because luckily all default checkers do provide decl-with-issue. --------- Co-authored-by: Balazs Benics <benicsbalazs@gmail.com>

Add a new node `AArch64ISD::URSHR_I_PRED`. `srl(add(X, 1 << (ShiftValue - 1)), ShiftValue)` is transformed to `urshr`, or to `rshrnb` (as before) if the result it truncated. `uzp1(rshrnb(uunpklo(X),C), rshrnb(uunpkhi(X), C))` is converted to `urshr(X, C)` (tested by the wide_trunc tests). Pattern matching code in `canLowerSRLToRoundingShiftForVT` is taken from prior code in rshrnb. It returns true if the add has NUW or if the number of bits used in the return value allow us to not care about the overflow (tested by rshrnb test cases).

…cs (#80209)

Add TableGen patterns to convert more instructions to boolean expressions: - **mul -> and/or**: i1 multiply instructions currently cannot be selected causing the compiler to crash. See llvm/llvm-project#57404 - **select -> and/or**: Converting selects to and/or can enable more optimizations. `InstCombine` cannot do this as aggressively due to poison semantics.

If we can't produce a large enough index vector in i8, we may need to legalize the shuffle (via scalarization - which in turn gets lowered into stack usage). This change makes two related changes: * Deferring legalization until we actually need to generate the vrgather instruction. With the new recursive structure, this only happens when doing the fallback for one of the arms. * Check the actual mask values for something outside of the representable range. Both are covered by recently added tests.

The purpose of m_being_created in these classes was to prevent broadcasting an event related to these Breakpoints during the creation of the breakpoint (i.e. in the constructor). In Breakpoint and Watchpoint, m_being_created had no effect. That is to say, removing it does not change behavior. However, BreakpointLocation does still use m_being_created. In the constructor, SetThreadID is called which does broadcast an event only if `m_being_created` is false. Instead of having this logic be roundabout, the constructor instead calls `SetThreadIDInternal`, which actually changes the thread ID. `SetThreadID` also will call `SetThreadIDInternal` in addition to broadcasting a changed event.

This is a follow up to an item I noted in my submission comment for e947f95. I don't have a real world example where this is triggering unprofitably, but avoiding the transform when we estimate the loop to be short running from profiling seems quite reasonable. It's also now come up as a possibility in a regression twice in two days, so I'd like to get this in to close out the possibility if nothing else. The original review dropped the threshold for short trip count loops. I will return to that in a separate review if this lands.

This partially reverts commit aa964f1 because it caused perf regressions in rccl due to drop of -mllvm -amgpu-kernarg-preload-count=16 from the linker step. Potentially it could cause similar regressions for other HIP apps using -mllvm options with -fgpu-rdc. Fixes: SWDEV-443345

All GitHub Actions workflows added by intel/llvm project are expected to use following naming notation: 1. Name starts with `sycl` prefix. 2. Use dash `-` to separate words (instead of underscore `_`). This patches fixes naming of workflows which do not follow this notation.

…533)" This reverts commit 209fe1f. The original commit failed to due an assertion failure in the unit test `ProgressReportTest` that the commit added. The Debugger::Initialize() function was called more than once which triggered the assertion, so this commit calls that function under a `std::call_once`.

…rts (#79533)"" This reverts commit a5a8cbb. The test being added by that commit still fails on the assertion that Debugger::Initialize has been called more than once.

We don't have an AMO instruction for Nand, so with the A extension we use an LR/SC loop. If we have Zacas we can use a CAS loop instead. According to the Zacas spec, a CAS loop scales to highly parallel systems better than LR/SC.

Fix rst comment, add checks for recently implemented functions+macro.

The read function wasn't properly unpoisoning its result under msan, causing test failures downstream when I tried to roll it out. This patch adds the msan unpoison call that fixes the issue.

This PR adds patterns to convert a sub-byte vector transpose into a sequence of instructions that perform the transpose on i8 vector elements. Whereas this rewrite may not lead to the absolute peak performance, it should ensure correctness when dealing with sub-byte transposes.

* Use APFloat conversion function instead of going through double to check if fold results in information loss. * Support folding vector constants.

…on(NFC) (#79874)

So that it can be used by clang-format.

Change AfterPlacementOperator to a boolean and deprecate SBPO_Never, which meant never inserting a space except when after new/delete. Fixes #78892.

…#80130) This makes it easier to count how many iterations an analysis takes to complete. It also makes it easier to compare how a change to the analysis code affects the timeline. Here's a sample screenshot: ![image](https://github.com/llvm/llvm-project/assets/29098113/b3f44b4d-7037-4f28-9532-5418663250e1)

This patch is the next piece of work in my Large Watchpoint proposal, https://discourse.llvm.org/t/rfc-large-watchpoint-support-in-lldb/72116 This patch breaks a user's watchpoint into one or more WatchpointResources which reflect what the hardware registers can cover. This means we can watch objects larger than 8 bytes, and we can watched unaligned address ranges. On a typical 64-bit target with 4 watchpoint registers you can watch 32 bytes of memory if the start address is doubleword aligned. Additionally, if the remote stub implements AArch64 MASK style watchpoints (e.g. debugserver on Darwin), we can watch any power-of-2 size region of memory up to 2GB, aligned to that same size. I updated the Watchpoint constructor and CommandObjectWatchpoint to create a CompilerType of Array<UInt8> when the size of the watched region is greater than pointer-size and we don't have a variable type to use. For pointer-size and smaller, we can display the watched granule as an integer value; for larger-than-pointer-size we will display as an array of bytes. I have `watchpoint list` now print the WatchpointResources used to implement the watchpoint. I added a WatchpointAlgorithm class which has a top-level static method that takes an enum flag mask WatchpointHardwareFeature and a user address and size, and returns a vector of WatchpointResources covering the request. It does not take into account the number of watchpoint registers the target has, or the number still available for use. Right now there is only one algorithm, which monitors power-of-2 regions of memory. For up to pointer-size, this is what Intel hardware supports. AArch64 Byte Address Select watchpoints can watch any number of contiguous bytes in a pointer-size memory granule, that is not currently supported so if you ask to watch bytes 3-5, the algorithm will watch the entire doubleword (8 bytes). The newly default "modify" style means we will silently ignore modifications to bytes outside the watched range. I've temporarily skipped TestLargeWatchpoint.py for all targets. It was only run on Darwin when using the in-tree debugserver, which was a proxy for "debugserver supports MASK watchpoints". I'll be adding the aforementioned feature flag from the stub and enabling full mask watchpoints when a debugserver with that feature is enabled, and re-enable this test. I added a new TestUnalignedLargeWatchpoint.py which only has one test but it's a great one, watching a 22-byte range that is unaligned and requires four 8-byte watchpoints to cover. I also added a unit test, WatchpointAlgorithmsTests, which has a number of simple tests against WatchpointAlgorithms::PowerOf2Watchpoints. I think there's interesting possible different approaches to how we cover these; I note in the unit test that a user requesting a watch on address 0x12e0 of 120 bytes will be covered by two watchpoints today, a 128-bytes at 0x1280 and at 0x1300. But it could be done with a 16-byte watchpoint at 0x12e0 and a 128-byte at 0x1300, which would have fewer false positives/private stops. As we try refining this one, it's helpful to have a collection of tests to make sure things don't regress. I tested this on arm64 macOS, (genuine) x86_64 macOS, and AArch64 Ubuntu. I have not modifed the Windows process plugins yet, I might try that as a standalone patch, I'd be making the change blind, but the necessary changes (see ProcessGDBRemote::EnableWatchpoint) are pretty small so it might be obvious enough that I can change it and see what the Windows CI thinks. There isn't yet a packet (or a qSupported feature query) for the gdb remote serial protocol stub to communicate its watchpoint capabilities to lldb. I'll be doing that in a patch right after this is landed, having debugserver advertise its capability of AArch64 MASK watchpoints, and have ProcessGDBRemote add eWatchpointHardwareArmMASK to WatchpointAlgorithms so we can watch larger than 32-byte requests on Darwin. I haven't yet tackled WatchpointResource *sharing* by multiple Watchpoints. This is all part of the goal, especially when we may be watching a larger memory range than the user requested, if they then add another watchpoint next to their first request, it may be covered by the same WatchpointResource (hardware watchpoint register). Also one "read" watchpoint and one "write" watchpoint on the same memory granule need to be handled, making the WatchpointResource cover all requests. As WatchpointResources aren't shared among multiple Watchpoints yet, there's no handling of running the conditions/commands/etc on multiple Watchpoints when their shared WatchpointResource is hit. The goal beyond "large watchpoint" is to unify (much more) the Watchpoint and Breakpoint behavior and commands. I have a feeling I may be slowly chipping away at this for a while. Re-landing this patch after fixing two undefined behaviors in WatchpointAlgorithms found by UBSan and by failures on different CI bots. rdar://108234227

Pull Request: llvm/llvm-project#80125

… flag (#79882) FileCheck test added ``` ./bin/llvm-lit -sv llvm/test/tools/llvm-gsymutil/X86/elf-dwo.yaml ``` Manual test steps: - Create binary with split-dwarf: ``` clang++ -g -gdwarf-4 -gsplit-dwarf main.cpp -o main_split ``` - Remove or remane the dwo file to a different name so llvm-gsymutil can't find it ``` mv main_split-main.dwo main_split-main__.dwo ``` - Now run llvm-gsymutil conversion, it should print out warning with and without the `--quiet` flag ``` $ ./bin/llvm-gsymutil --convert=./main_split Input file: ./main_split Output file (x86_64): ./main_split.gsym warning: Unable to retrieve DWO .debug_info section for main_split-main.dwo Loaded 0 functions from DWARF. Loaded 12 functions from symbol table. Pruned 0 functions, ended with 12 total ``` ``` $ ./bin/llvm-gsymutil --convert=./main_split --quiet Input file: ./main_split Output file (x86_64): ./main_split.gsym warning: Unable to retrieve DWO .debug_info section for some object files. (Remove the --quiet flag for full output) Pruned 0 functions, ended with 12 total ```

@mizvekov

Close llvm/llvm-project#79240 Cite the comment from @mizvekov in //github.com/llvm/llvm-project/issues/79240: > There are two kinds of bugs / issues relevant here: > > Clang bugs that this change hides > Here we can add a Frontend flag that disables the GMF ODR check, just > so > we can keep tracking, testing and fixing these issues. > The Driver would just always pass that flag. > We could add that flag in this current issue. > Bugs in user code: > I don't think it's worth adding a corresponding Driver flag for > controlling the above Frontend flag, since we intend it's behavior to > become default as we fix the problems, and users interested in testing > the more strict behavior can just use the Frontend flag directly. This patch follows the suggestion: - Introduce the CC1 flag `-fskip-odr-check-in-gmf` which is by default off, so that the every existing test will still be tested with checking ODR violations. - Passing `-fskip-odr-check-in-gmf` in the driver to keep the behavior we intended. - Edit the document to tell the users who are still interested in more strict checks can use `-Xclang -fno-skip-odr-check-in-gmf` to get the existing behavior.

… (#71701) Add AllowStringArrays option, enabling the exclusion of array types with deduced sizes constructed from string literals. This includes only var declarations of array of characters constructed directly from c-strings. Closes #59475

If clang-format is not sure whether a `requires` keyword starts a requires clause or a requires expression, it looks ahead to see if any token disqualifies it from being a requires clause. Among these tokens was `decltype`, since it fell through the switch. This patch allows decltype to exist in a require clause. I'm not 100% sure this change won't have repercussions, but that just means we need more test coverage! Fixes llvm/llvm-project#78645

CI bot crash when running this unittest. The printfs aren't printing into the CI log output.

The added test case would trigger the removed assertion.

Only the stage2-distribution target is built by default for the stage2 distribution installation target. This means that we don't get a BOLT optimized binary. This patch explicitly builds the stage2-clang-bolt target before the distribution installation target so that the clang binary is optimized before it gets installed.

Initialize the first element to 0 and the second element to the value of the subexpression.

Promoted and NF LZCNT/POPCNT/TZCNT were supported in #79954. B/c null_frag is used in the patterns for these variants, tablgen can not infer mayLoad = 1 for them. This can be tested by MCA tests, which will be added after -mcpu=<cpu_with_apx> is supported.

We create them more often in C, so it's more likely to happen there.

Just delegate to the resulting expression.

…ng has no effect (NFC) (#80129) This aims to clean-up confusing uses of builder.createOrFold<ConstantOp> since folding of constants fails.

Go back to the original form of this file before I add temp workaround.

After iterating with the arm-ubuntu CI bot, I found the crash (a std::bad_alloc exception being thrown) was caused by these two entries when built on a 32-bit machine. I probably have an assumption about size_t being 64-bits in WatchpointAlgorithms and we have a problem when it's actually 32-bits and we're dealing with a real 64-bit address. All of the cases where the address can be represented in the low 32-bits of the addr_t work correctly, so for now I'm skipping these two unit tests when building lldb on a 32-bit host until I can review that method and possibly switch to explicit uin64_t's. .

…(#78312) The greedy pattern rewrite driver has multiple "expensive checks" to detect invalid rewrite pattern API usage. As part of these checks, it computes fingerprints for every op that is in scope, and compares the fingerprints before and after an attempted pattern application. Until now, each computed fingerprint took into account all nested operations. That is quite expensive because it walks the entire IR subtree. It is also redundant in the expensive checks because we already compute a fingerprint for every op. This commit significantly improves the running time of the "expensive checks" in the greedy pattern rewrite driver.

Derived type translation is proving expensive in modern fortran apps with many big derived types with dozens of components and parents. Extending the cache that prevent recursion is proving to have little cost on apps with small derived types and significant gain (can divide compile time by 2) on modern fortran apps. It is legal since the cache lifetime is less than the MLIRContext lifetime that owns the cached mlir::Type. Doing so also exposed that the current caching was incorrect, the type symbol is the same for kind parametrized derived types regardless of the kind parameters. Instances with different kinds should lower to different MLIR types. See added test. Using the type scopes fixes the problem.

Don't search for unnecessary libs when linking the shared lib. This allows the test to run in chroot environment.

…#79612) This adds a `func`, `call` and `return` operation to the EmitC dialect, closely related to the corresponding operations of the Func dialect. In contrast to the operations of the Func dialect, the EmitC operations do not support multiple results. The `emitc.func` op features a `specifiers` argument that for example allows, with corresponding support in the emitter, to emit `inline static` functions. Furthermore, this adds patterns and a pass to convert the Func dialect to EmitC. A `func.func` op that is `private` is converted to `emitc.func` with a `"static"` specifier.

These two are intertwined enough so it doesn't really make sense to have it standalone and hack around it by putting headers into both.

…nd on This is a mess and needs to be cleaned up some day.

…7153) This patch replaces --num-repetitions with --min-instructions to make it more clear that the value refers to the minimum number of instructions in the final assembled snippet rather than the number of repetitions of the snippet. This patch also refactors some llvm-exegesis internal variable names to reflect the name change. Fixes #76890.

@jeanPerier

The verifiers are currently very strict: requiring intrinsic operations to be used only in cases where the Fortran standard permits the intrinsic to be used. There have now been a lot of cases where these verifiers have caused bugs in corner cases. In a recent ticket, @jeanPerier pointed out that it could be useful for future optimizations if somewhat invalid uses of these operations could be allowed in dead code. See this comment: llvm/llvm-project#79995 (comment) In response to all of this, I have decided to relax the intrinsic operation verifiers. The intention is now to only disallow operation uses that are likely to crash the compiler. Other checks are still available under `-strict-intrinsic-verifier`. The disadvantage of this approach is that IR can now represent intrinsic invocations which are incorrect. The lowering and implementation of these intrinsic functions is unlikely to do the right thing in all of these cases, and as they should mostly be impossible to generate using normal Fortran code, these edge cases will see very little testing, before some new optimization causes them to become more common. Fixes #79995

This updates clang's target defines to include the ACLE changes covering the FEAT_PAuth_LR architecture extension. The changes include: * The new `__ARM_FEATURE_PAUTH_LR` feature macro, which is set to 1 when FEAT_PAuth_LR is available in the target. * A new bit field for the existing `__ARM_FEATURE_PAC_DEFAULT` macro, indicating the use of PC as a diversifier for Pointer Authentication (from -mbranch-protection=pac-ret+pc). The approved changes to the ACLE spec can be found here: ARM-software/acle#292

…#79816) According to its doc-comment `isImplicit` is meant to return true if the expression is an implicit location description (describes an object or part of an object which has no location by computing the value from available program state). There's a brief entry for `DW_OP_LLVM_tag_offset` in the LangRef and there's some info in the original commit fb9ce10. From what I can tell it doesn't look like `DW_OP_LLVM_tag_offset` affects whether or not the location is implicit; the opcode doesn't get included in the final location description but instead is added as an attribute to the variable. This was tripping an assertion in the latest application of the fix to #76545, #78606, where an expression containing a `DW_OP_LLVM_tag_offset` is split into a fragment (i.e., describe a part of the whole variable).

When the markdown link renders the line gets a lot shorter.

…. (#79512) We are replacing with a wider increment. If both OrigInc and IsomorphicInc are NUW/NSW, then we can preserve them on the wider increment; the narrower IsomorphicInc would wrap before the wider OrigInc, so the replacement won't make IsomorphicInc's uses more poisonous. PR: llvm/llvm-project#79512

…s` (#79978) Implements P2652R2 <https://wg21.link/P2652R2>: - https://eel.is/c++draft/allocator.requirements.general - https://eel.is/c++draft/memory.syn - https://eel.is/c++draft/allocator.traits.general - https://eel.is/c++draft/allocator.traits.members - https://eel.is/c++draft/diff.cpp20.concepts - https://eel.is/c++draft/diff.cpp20.utilities --------- Co-authored-by: Zingam <zingam@outlook.com>

…#12538) Do not allow fusion when one of the kernels has an explicit local size and it requires ID remapping, i.e., it has a different number of dimensions w.r.t. the fused ND-range or different global size in dimensions [2, N). In this case, two work-items belonging to the same work-group may not belong to the same work-group in the fused ND-range. Signed-off-by: Victor Perez <victor.perez@codeplay.com> --------- Signed-off-by: Victor Perez <victor.perez@codeplay.com>

…2521) oneapi-src/unified-runtime#1070 and intel#11952 introduced a new variant of the `enableCUDATracing` function that takes a context pointer parameter, replacing the parameterless variant of that function. The older variant will be removed from UR once this PR is merged.

Improved joint_matrix layout test coverage. The test framework that the cuda backend tests use has been updated to support all possible `joint_matrix` gemm API combinations, including all matrix layouts. the gemm header is backend agnostic; hence all backends could use this test framework in the future. This test framework can also act as an example to show how to deal with different layout combinations when computing a general GEMM. Signed-off-by: JackAKirk <jack.kirk@codeplay.com>

Split out from #78417. Reviewers: topperc, asb, kito-cheng Reviewed By: asb Pull Request: llvm/llvm-project#79248

…v_pulldown

Original commit: KhronosGroup/SPIRV-LLVM-Translator@fd22f8e

Add support for load/store operations for a cooperative matrix such that original matrix shape is known and implementations are able to reason about how to deal with the out of bounds. CapabilityCooperativeMatrixCheckedInstructionsINTEL = 6192 CooperativeMatrixLoadCheckedINTEL = 6193 CooperativeMatrixStoreCheckedINTEL = 6194 Original commit: KhronosGroup/SPIRV-LLVM-Translator@b62cb55

The goal of the PR is to add API to SPIR-V LLVM Translator to query error message by an error code as discussed in intel#2298 A need and possible application is a way to generate human-readable error info by error codes returned by other SPIRV Translator API calls, including getSpirvReport(). Original commit: KhronosGroup/SPIRV-LLVM-Translator@afe1971

Map @llvm.frexp intrinsic to OpenCL Extended Instruction frexp builtin. The difference in signatures and return values is covered by extracting/combining values from and into composite type. LLVM IR: { float %fract, i32 %exp } @llvm.frexp.f32.i32(float %val) SPIR-V: { float %fract } ExtInst frexp (float %val, i32 %exp) Original commit: KhronosGroup/SPIRV-LLVM-Translator@e8b2018

…ntel#2288) The translator failed assertion with V->user_empty() during regularize function when shl i1 or lshr i1 result is used. E.g. %2 = shl i1 %0 %1 store %2, ptr addrspace(1) @G.1, align 1 Instruction shl i1 is converted to lshr i32 which arithmetic have the same behavior. Original commit: KhronosGroup/SPIRV-LLVM-Translator@239fbd4

…#2331) Add support for checked matrix construct instruction. Specification draft: https://github.com/intel/llvm/blob/2fa153ee852ea3d7d64df097f1f494cddacee90e/sycl/doc/design/spirv-extensions/SPV_INTEL_joint_matrix.asciidoc Original commit: KhronosGroup/SPIRV-LLVM-Translator@a1b1f49

Replace some deprecated 'startswith' and 'endswith' with 'starts_with' and 'ends_with' to clear some warnings when building SYCL compiler. --------- Signed-off-by: jinge90 <ge.jin@intel.com>

This reverts commit 3d4c6c7. Due to | * 6e6aa44 2024-01-31 Revert "[Clang][Sema] fix outline member function template with defau… (#80144) ekeane@nvidia.com

We currently support -O3 for Linux compilations, expand this to also be available on Windows. This also better aligns with our existing product offerings.

The compiler was crashing when the user requested fp-accuracy for the functions in a call of the form f1(f2(f3 ...), where f1, f2 and f3 were fpbuiltin but the innermost function didn't have an fpbuiltin. The current builtinID was used instead of getting the builtinID from the current function. that created a crash in the compiler. This patch fixes the issue and renames the function EmitFPBuiltinIndirectCall to MaybeEmitFPBuiltinofFD .

intel#12297) We want to change the signature of `piMemGetNativeHandle` for reasons explained here oneapi-src/unified-runtime#1199 Corresponding UR PR: oneapi-src/unified-runtime#1226 A previous PR added a new entry point intel#12199 but it was decided that it is better to modify the existing entry point

…:intel::math (intel#12571) Signed-off-by: jinge90 <ge.jin@intel.com>

LLVM: llvm/llvm-project@178719e SPIRV-LLVM-Translator: KhronosGroup/SPIRV-LLVM-Translator@a1b1f49

Problems found by Gregory (thanks!): 1) There were some duplicated tests, remove those 2) We didn't test non-LSC mask on Gen12 3) We get an ambiguous call because we had an old function that didn't have VS, but the new functions have default VS=1, so we don't need the old one. 4) When we pass a simd_view for the vals, we got a template match failure. This is the same issue we hit in the compile-time tests where even if we have a simd_view overload the compiler can't infer N, so we need to provide T,N anyway, so add that in the tests. I tested this on Gen12. Signed-off-by: Sarnie, Nick <nick.sarnie@intel.com>

…12579) Adding `EXCLUDE_FROM_ALL` to the `add_subdirectory` for the OneAPI Construction Kit, in order to to avoid building its components unless they are required by the SYCL toolchain.

The FPGA emulator is currently affected by the same issue as the CPU runtime. Signed-off-by: John Pennycook <john.pennycook@intel.com>

…ntel#12548) We have flakyness in nightly testing results. Having more variety would helpfully provide some insights on conditions when it happens. The task is only executed once a day, so extra resources needed shouldn't affect the load on the runners much.

oneapi-src/unified-runtime#1302

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SYCL][Graph] Node Profiling #353

[SYCL][Graph] Node Profiling #353

Commits on Jan 31, 2024

Commits on Feb 1, 2024

Commits on Feb 2, 2024