Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SYCL][Graph] Update spec supported features #338

Closed
wants to merge 899 commits into from

Conversation

EwanC
Copy link
Collaborator

@EwanC EwanC commented Oct 30, 2023

@EwanC EwanC added the Graph Specification Extension Specification related label Oct 30, 2023
Comment on lines 1170 to 1171
the Explicit API. Empty nodes can be used instead of barriers when a user is
building a graph with the explicit API.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently if users using the Explicit API want to implement a barrier that waits for all previous nodes to complete, they have to create an empty node that explicitly depends on all previous nodes. Does it make sense to change the definition of empty nodes such that if users add an empty node with no dependencies, this node will automatically take all previous nodes as dependencies (general barrier)? Or, if we want to keep a empty node without dependency (not sure why should be necessary?) to propose a wildcard (shortcut) to add all previous nodes as dependencies?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That idea makes sense to me 👍 keeping the existing empty node with no dependency semantics doesn't really add any value to the user, but making it depend on previous leaf nodes corresponds to barrier semantics to is useful and better aligns with the message of an alternative to the barrier extension.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like having that automatic creation of dependencies on the explicit API is a little bit counter to the existing behaviour where the explicit API doesn't do things like this in the background. This perhaps makes it a bit unintuitive.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tend to agree with @Bensuo here. However, we might want to add a barrier shortcut if this is frequently used in applications.

. Using reductions in a graph node.
. Using sycl streams in a graph node.
. Using a kernel bundle in a graph node.
. Profiling an event returned from graph submission with
`event::get_profiling_info()`.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should also add a bullet point here about passing the no immediate command-list property on queue creation to workaround current issue with immediate command-lists.

Dinistro and others added 23 commits November 1, 2023 08:40
This commit removes the typed pointer support from the LaunchFunc's
lowering to Vukan dialect. Typed pointers have been deprecated for a
while now and it's planned to soon remove them from the LLVM dialect.

Related PSA:
https://discourse.llvm.org/t/psa-removal-of-typed-pointers-from-the-llvm-dialect/74502
Recent versions of GNU binutils starting from 2.39 support symbol+offset
lookup in addition to the usual numeric address lookup. This change adds
symbol lookup to llvm-symbolize and llvm-addr2line.

Now llvm-symbolize behaves closer to GNU addr2line, - if the value specified
as address in command line or input stream is not a number, it is treated as
a symbol name. For example:

    llvm-symbolize --obj=abc.so func_22
    llvm-symbolize --obj=abc.so "CODE func_22"

This lookup is now supported only for functions. Specification with
offset is not supported yet.

This is a recommit of 2b27948, reverted
in 39fec54 because the test
llvm/test/Support/interrupts.test started failing on Windows. The test was
changed in 18f036d and is also updated in
this commit.

Differential Revision: https://reviews.llvm.org/D149759
This patch moves `RecordDecl::ArgPassingKind` to DeclBase.h to namespace scope, so that it's complete at the time bit-field is declared.
This commit removes the support for lowering GPU to ROCDL dialect with
typed pointers. Typed pointers have been deprecated for a while now and
it's planned to soon remove them from the LLVM dialect.

Related PSA:
https://discourse.llvm.org/t/psa-removal-of-typed-pointers-from-the-llvm-dialect/74502
This patch adds missing dependencies required by the new unittest introduced by #68406.
This patch moves `OMPDeclareReductionDecl::InitKind` to DeclBase.h, so that it's complete at the point where corresponding bit-field is declared. This patch also converts it to scoped enum named `OMPDeclareReductionInitKind`
This adds a writable attribute, which in conjunction with
dereferenceable(N) states that a spurious store of N bytes is
introduced on function entry. This implies that this many bytes
are writable without trapping or introducing data races. See
https://llvm.org/docs/Atomics.html#optimization-outside-atomic for
why the second point is important.

This attribute can be added to sret arguments. I believe Rust will
also be able to use it for by-value (moved) arguments. Rust likely
won't be able to use it for &mut arguments (tree borrows does not
appear to allow spurious stores).

In this patch the new attribute is only used by LICM scalar promotion.
However, the actual motivation for this is to fix a correctness issue
in call slot optimization, which needs this attribute to avoid
optimization regressions.

Followup to the discussion on D157499.

Differential Revision: https://reviews.llvm.org/D158081
…or is not declared in the base class

Fixes #70464

When ctor is not declared in the base class, initializing the base class
with the initializer list will not trigger a proper assignment of the
base region, as a CXXConstructExpr doing that is not available in the
AST.

This patch checks whether the init expr is an InitListExpr under a base
initializer, and adds a binding if so.
…9945)"

This reverts commit 5bfd89b.

It was causing build failures on ffmpeg on i686.
Adds additional test coverage for Issue #68466
This is intended as the replacement for ConstantExpr::getIntegerCast(),
which does not require availability of the corresponding constant
expressions. It just forwards to ConstantFoldCastOperand with the
correct opcode.
This always works on a constant integer or integer splat, so the
constant fold here should always succeed.
Use ConstantFoldIntegerCast() instead, to remove the reliance on
constant expressions.
…rted to other translation unit. (#68774)

Fixes: #68769

Co-authored-by: miaozhiyuan <miaozhiyuan@feysh.com>
…SS/PACKUS

truncateVectorWithPACK handling of sub-128-bit result types was improved some time ago, so remove the old 64-bit limit

Fixes #68466
This patch moves `ObjCMethodDecl::ImplementationControl` to a DeclBase.h so that it's complete at the point where corresponsing bit-field is declared. This patch also converts it to a scoped enum `clang::ObjCImplementationControl`.
SME2 is documented as part of the main SME supplement:
https://developer.arm.com/documentation/ddi0616/latest/

The one change for debug is this new ZT0 register. This register
contains data to be used with new table lookup instructions.
It's size is always 512 bits (not scalable) and can be
interpreted in many different ways depending on the instructions
that use it. 

The kernel has implemented this as a new register set containing
this single register. It always returns register data (with no header,
unlike ZA which does have a header).

https://docs.kernel.org/arch/arm64/sme.html

ZT0 is only active when ZA is active (when SVCR.ZA is 1). In the 
inactive state the kernel returns 0s for its contents. Therefore
lldb doesn't need to create 0s like it does for ZA. 

However, we will skip restoring the value of ZT0 if we know that
ZA is inactive. As writing to an inactive ZT0 sets SVCR.ZA to 1,
which is not desireable as it would activate ZA also. Whether
SVCR.ZA is set will be determined only by the ZA data we restore.

Due to this, I've added a new save/restore kind SME2. This is easier
than accounting for the variable length ZA in the SME data. We'll only
save an SME2 data block if ZA is active. If it's not we can get fresh
0s back from the kernel for ZT0 anyway so there's nothing for us to
restore.

This new register will only show up if the system has SME2 therefore
the SME set presented to the user may change, and I've had to account
for that in in a few places.

I've referred to it internally as simply "ZT" as the kernel does in
NT_ARM_ZT, but the architecture refers to the specific register as "ZT0"
so that's what you'll see in lldb.

```
(lldb) register read -s 6
Scalable Matrix Extension Registers:
      svcr = 0x0000000000000000
       svg = 0x0000000000000004
        za = {0x00 <...> 0x00}
       zt0 = {0x00 <...> 0x00}
```
Looks like there's code out there that, instead of using
'__attribute__((constructor(x)))' to add constructor functions, they
just declare a global function pointer and use
'__attribute__((section('.ctors')))' instead.

Problem is, with memtag-globals, we pad the global function pointer to
be 16 bytes large. This of course means we have an 8-byte real function
pointer, then 8 bytes of zero padding, and this trips up the loader when
it processes this section.

Fixes #69939
konradkusiak97 and others added 27 commits November 7, 2023 07:53
Small fix to strip down additional CPU information from AMD query, like:
`gfx90a:sramecc+:xnack-` to just `gfx90a`
Signed-off-by: Tikhomirova, Kseniya <kseniya.tikhomirova@intel.com>
…intel#11793)

QA asked us to do this.

---------

Signed-off-by: Sarnie, Nick <nick.sarnie@intel.com>
…d joint_matrix_mad (intel#11738)

This patch adds two new properties `joint_matrix` and `joint_matrix_mad`
to device requirements in
sycl-post-link.
SYCL RT reads these properties and throws exception if objects of
`joint_matrix` type or `joint_matrix_mad` functions are not supported by
the current device. "Unsupported" means matrix type and sizes provided
by user are not compatible with the list of all supported matrix types
and sizes from the runtime query in
`get_info<...matrix_combinations>`.
Since bool vectors have a backing storage of chars, the unary minus
operator was in fact producing `(char)-1`, and thus not adhering to the
ABI where bools should be either 0 or 1.

This could manifest itself in bugs, for instance where the "bool"
elements wouldn't compare equal to other bools, such as those in arrays.

This only manifested itself in device code (possibly because the array
of bools is also an array of bytes under the hood), hence the test case
that differs in style somewhat from the rest.
Old version is temporary kept in SPV_INTEL_joint_matrix_legacy.asciidoc

---------

Signed-off-by: Sidorov, Dmitry <dmitry.sidorov@intel.com>
Since vectors of bools are backed by byte-sized storage, we must ensure
that conversion results from other vector types are correctly brought
into the expected range of bools, e.g., `(char)0` or `(char)1`.
…#11809)

Previously, we committed intel#9642 aiming
to break dependency on metadata order for check_has.cpp but the commit
was broken unexpectedly. This PR re-lands it.

---------

Signed-off-by: jinge90 <ge.jin@intel.com>
- Enables specialization constants handling in SYCL-Graph extension.
- Adds E2E tests that verify this behavior.
- Removes unittests tests that checked for unsupported feature exception
throwing.

---------

Co-authored-by: Maxime France-Pillois <maxime.francepillois@codeplay.com>
Signed-off-by: Tikhomirova, Kseniya <kseniya.tikhomirova@intel.com>
…el#11665)

This patch also implements support for 8- and 16-bit data types in
slm_block_load().

---------

Signed-off-by: Klochkov, Vyacheslav N <vyacheslav.n.klochkov@intel.com>
intel#11832)

Cuda specifies the launch bounds as:
```
__launch_bounds__(maxThreadsPerBlock, minBlocksPerMultiprocessor, maxBlocksPerCluster)
```
making it impossible to specify `maxBlocksPerCluster` without the
preceding two attributes (similarly for `minBlocksPerMultiprocessor`),
issue warnings and ignore attributes if the condition is not met.
…1815)

Fixes generating filters like this one:
`DeviceName:{{gfx90a:sramecc\+:xnack\-}},DriverVersion:{{HIP 50422.80}}`
When using -O0 to disable optimizations, also set
-fsycl-disable-range-rounding to further disable optimizations to
improve debugability.
Unused variable - clean it up.
This started causing a hang on a newer GPU driver with O0, we have an
internal tracker for this. Force O2 for now.

Signed-off-by: Sarnie, Nick <nick.sarnie@intel.com>
… op (intel#11847)

Even though std::max is completely legal/correct to use in SYCL/ESIMD
especially in 'constexpr' context, it may cause problems on Windows in
some non-trivial configurations with some odd order of includes of
system and SYCL header files.

Signed-off-by: Klochkov, Vyacheslav N <vyacheslav.n.klochkov@intel.com>
intel#11850)

The element-size address alignment is valid from correctness point of
view, but using 1-byte and 2-byte alignment implicitly causes
performance regression for block_load(const int8_t *, ...) and
block_load(const int16_t *, ...) because GPU BE have to generate slower
GATHER instead of more efficient BLOCK-LOAD. Without this fix block-load
causes up to 44% performance slow-down on some apps that used
block_load() with alignment assumptions used before block_load(usm, ...,
compile_time_props) was implemented.

The reasoning for the expected/assumed alignment from element-size to
4-bytes for byte- and word-vectors is such:
   The idea of block_load() call (opposing to gather() call) is to have
   efficient block-load, and thus the assumed alignment is such that
   allows to generate block-load. This is a bit more tricky for user
   but that is how block_load/store API always worked before: block-load
   had restrictions that needed to be honored.
   To be on safer side, user can always pass the guaranteed alignment.

---------

Signed-off-by: Klochkov, Vyacheslav N <vyacheslav.n.klochkov@intel.com>
The following features defined in the specification
as unsupported, have working implementations upstream.

* intel#11418
* intel#11505
* intel#11556
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Graph Specification Extension Specification related
Projects
None yet
Development

Successfully merging this pull request may close these issues.