[libclc][hip] Fix half shuffles and reenable reduction test #13016

JackAKirk · 2024-03-13T19:29:30Z

Fix broken half shuffles on amd.
Reenable Reduction test.

Fix is to bitcast to the storage type of half (unsigned short) without doing a type conversion, before then extending to int for the shuffle.

Signed-off-by: JackAKirk <jack.kirk@codeplay.com>

jchlanda · 2024-03-14T07:43:46Z

libclc/amdgcn-amdhsa/libspirv/misc/sub_group_shuffle.cl

+_CLC_DEF half _Z28__spirv_SubgroupShuffleINTELIDF16_ET_S0_j(
+    half Data, unsigned int InvocationId) {
+  union {
+    int i;


This is a nit, but I'd go with unsigned int, even thought we're type punning back to half.

Signed-off-by: JackAKirk <jack.kirk@codeplay.com>

steffenlarsen · 2024-03-14T12:05:41Z

libclc/amdgcn-amdhsa/libspirv/misc/sub_group_shuffle.cl

+    half h;
+  } tmp;
+  tmp.h = Data;
+  tmp.i = _Z28__spirv_SubgroupShuffleINTELIiET_S0_j(tmp.i, InvocationId);


Beware that this relies on UB. See https://en.cppreference.com/w/cpp/language/union:

It is undefined behavior to read from the member of the union that wasn't most recently written.

That said, I don't know of a better way to get it done... Unless we can use some sort of bitcast builtin here. Have you had a look at alternatives?

Yeah it is UB according to openclc for one half of the 32 bits (which one depends on endianness).
If openclc inherited memcpy from c99 it would be possible to do it that way cleanly, but it doesn't. It has a section that discusses the issue briefly: https://registry.khronos.org/OpenCL/specs/3.0-unified/html/OpenCL_C.html#reinterpreting-data-as-another-type
One alternative that is not UB is to first do a conversion to float, then do as_int(float), and do the reverse after the shuffle. I tried this and compared the result with doing it the way I have here to check that it gives the same result. However doing it that way generates the extra machine instructions to do the float converts etc.
HIP fp16 headers is open source and I can see that they use the union trick. Hence I figured that if they do it we may as well do it. But at the same time I could just do it via the float route.

And yeah and I looked at amd clang builtins and I don't think they have a bitcast one (actually cuda doesn't either). For cuda I used inline ptx asm, but for amd this is more dangerous because they don't have an equivalent to ptx that is arch generation/family independent.

Ah, fair point. I missed the fact that this is OpenCL C. I think it is still better to not rely on UB, unless the performance overhead is high. The compiler moves fast and it wouldn't be the first time UB suddenly becomes an issue.

It makes the add test cases in the reduction test I reenabled in this PR ~10% slower if I go via the float route. I think it is fine to do it like that if you want? I guess that fp16 shuffles are not so likely to be performance critical for much stuff, although I can't say that with confidence.

Ah, I didn't realize you meant cast the pointer to unsigned short *. I still think this is an alias violation, so the compiler could easily go do some magic that gives us the wrong behavior.

If half has a unsigned int storage type behind the scenes, could we maybe use some inline AMD instruction helpers here to do the conversion? It might even be as simple as a no-op or ID operation, assuming the representation is the same in the resulting AMD representation.

Ah, I didn't realize you meant cast the pointer to unsigned short *. I still think this is an alias violation, so the compiler could easily go do some magic that gives us the wrong behavior.

I see thanks.

If half has a unsigned int storage type behind the scenes, could we maybe use some inline AMD instruction helpers here to do the conversion? It might even be as simple as a no-op or ID operation, assuming the representation is the same in the resulting AMD representation.

Generally I think inline AMD should be avoided since it is kind of like having inline assembly for SASS if it were cuda. For amd I think clang builtins should always be used that are lowered to the appropriate AMD asm.
I've found a non UB way of doing it anyway via as_ushort. I had tried this before, but with using as_short etc, which didn't work, but the as_ushort etc works perfectly.

I've found a non UB way of doing it anyway via as_ushort. I had tried this before, but with using as_short etc, which didn't work, but the as_ushort etc works perfectly.

Actually as_short works too, I must have just messed it up the first time.

Great, thank you! I like this solution. 😄

FYI Type punning through a union is not UB in C99 but this is moot now this code uses as_uchar

The original C standard missed some wording that should have come from c89, and they corrected it in the early 2000s:

https://open-std.org/jtc1/sc22/wg14/www/docs/dr_283.htm

OpenCL C seems to have missed this, and also declared type punning through a union as the ordained way to do this as per https://registry.khronos.org/OpenCL/specs/3.0-unified/html/OpenCL_C.html#reinterpreting-types-using-unions

Signed-off-by: JackAKirk <jack.kirk@codeplay.com>

steffenlarsen

💯

fix half shuffles in amd libclc.

d4a205c

Signed-off-by: JackAKirk <jack.kirk@codeplay.com>

JackAKirk requested review from a team as code owners March 13, 2024 19:29

JackAKirk requested a review from steffenlarsen March 13, 2024 19:29

JackAKirk temporarily deployed to WindowsCILock March 13, 2024 19:29 — with GitHub Actions Inactive

JackAKirk temporarily deployed to WindowsCILock March 13, 2024 19:52 — with GitHub Actions Inactive

jchlanda approved these changes Mar 14, 2024

View reviewed changes

Switch to unsigned.

e698a26

Signed-off-by: JackAKirk <jack.kirk@codeplay.com>

JackAKirk temporarily deployed to WindowsCILock March 14, 2024 11:31 — with GitHub Actions Inactive

JackAKirk temporarily deployed to WindowsCILock March 14, 2024 11:58 — with GitHub Actions Inactive

steffenlarsen reviewed Mar 14, 2024

View reviewed changes

swap union usage with explicit copy.

78a704e

Signed-off-by: JackAKirk <jack.kirk@codeplay.com>

JackAKirk temporarily deployed to WindowsCILock March 14, 2024 17:35 — with GitHub Actions Inactive

JackAKirk had a problem deploying to WindowsCILock March 14, 2024 19:36 — with GitHub Actions Failure

copy to unsigned short/extend to int.

453ae5a

Signed-off-by: JackAKirk <jack.kirk@codeplay.com>

JackAKirk temporarily deployed to WindowsCILock March 15, 2024 12:11 — with GitHub Actions Inactive

JackAKirk temporarily deployed to WindowsCILock March 15, 2024 12:38 — with GitHub Actions Inactive

use as_ushort() etc to fix UB.

3a26020

Signed-off-by: JackAKirk <jack.kirk@codeplay.com>

JackAKirk temporarily deployed to WindowsCILock March 15, 2024 13:57 — with GitHub Actions Inactive

steffenlarsen approved these changes Mar 15, 2024

View reviewed changes

JackAKirk temporarily deployed to WindowsCILock March 15, 2024 14:16 — with GitHub Actions Inactive

Merge branch 'sycl' into amd-fix-half-shuffle

cc89e97

JackAKirk temporarily deployed to WindowsCILock March 18, 2024 17:31 — with GitHub Actions Inactive

JackAKirk temporarily deployed to WindowsCILock March 18, 2024 18:36 — with GitHub Actions Inactive

ldrumm merged commit b13a3c4 into intel:sycl Mar 19, 2024
12 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[libclc][hip] Fix half shuffles and reenable reduction test #13016

[libclc][hip] Fix half shuffles and reenable reduction test #13016

JackAKirk commented Mar 13, 2024 •

edited

Loading

jchlanda Mar 14, 2024

JackAKirk Mar 14, 2024

steffenlarsen Mar 14, 2024

JackAKirk Mar 14, 2024 •

edited

Loading

JackAKirk Mar 14, 2024

steffenlarsen Mar 14, 2024

JackAKirk Mar 14, 2024

steffenlarsen Mar 15, 2024

JackAKirk Mar 15, 2024 •

edited

Loading

JackAKirk Mar 15, 2024

steffenlarsen Mar 15, 2024

ldrumm Mar 19, 2024

steffenlarsen left a comment

[libclc][hip] Fix half shuffles and reenable reduction test #13016

[libclc][hip] Fix half shuffles and reenable reduction test #13016

Conversation

JackAKirk commented Mar 13, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JackAKirk Mar 14, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JackAKirk Mar 15, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

steffenlarsen left a comment

Choose a reason for hiding this comment

JackAKirk commented Mar 13, 2024 •

edited

Loading

JackAKirk Mar 14, 2024 •

edited

Loading

JackAKirk Mar 15, 2024 •

edited

Loading