Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[libclc][hip] Fix half shuffles and reenable reduction test #13016

Merged
merged 6 commits into from
Mar 19, 2024

Conversation

JackAKirk
Copy link
Contributor

@JackAKirk JackAKirk commented Mar 13, 2024

  • Fix broken half shuffles on amd.
  • Reenable Reduction test.

Fix is to bitcast to the storage type of half (unsigned short) without doing a type conversion, before then extending to int for the shuffle.

Signed-off-by: JackAKirk <jack.kirk@codeplay.com>
_CLC_DEF half _Z28__spirv_SubgroupShuffleINTELIDF16_ET_S0_j(
half Data, unsigned int InvocationId) {
union {
int i;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a nit, but I'd go with unsigned int, even thought we're type punning back to half.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ta done

Signed-off-by: JackAKirk <jack.kirk@codeplay.com>
half h;
} tmp;
tmp.h = Data;
tmp.i = _Z28__spirv_SubgroupShuffleINTELIiET_S0_j(tmp.i, InvocationId);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Beware that this relies on UB. See https://en.cppreference.com/w/cpp/language/union:

It is undefined behavior to read from the member of the union that wasn't most recently written.

That said, I don't know of a better way to get it done... Unless we can use some sort of bitcast builtin here. Have you had a look at alternatives?

Copy link
Contributor Author

@JackAKirk JackAKirk Mar 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it is UB according to openclc for one half of the 32 bits (which one depends on endianness).
If openclc inherited memcpy from c99 it would be possible to do it that way cleanly, but it doesn't. It has a section that discusses the issue briefly: https://registry.khronos.org/OpenCL/specs/3.0-unified/html/OpenCL_C.html#reinterpreting-data-as-another-type
One alternative that is not UB is to first do a conversion to float, then do as_int(float), and do the reverse after the shuffle. I tried this and compared the result with doing it the way I have here to check that it gives the same result. However doing it that way generates the extra machine instructions to do the float converts etc.
HIP fp16 headers is open source and I can see that they use the union trick. Hence I figured that if they do it we may as well do it. But at the same time I could just do it via the float route.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And yeah and I looked at amd clang builtins and I don't think they have a bitcast one (actually cuda doesn't either). For cuda I used inline ptx asm, but for amd this is more dangerous because they don't have an equivalent to ptx that is arch generation/family independent.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, fair point. I missed the fact that this is OpenCL C. I think it is still better to not rely on UB, unless the performance overhead is high. The compiler moves fast and it wouldn't be the first time UB suddenly becomes an issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes the add test cases in the reduction test I reenabled in this PR ~10% slower if I go via the float route. I think it is fine to do it like that if you want? I guess that fp16 shuffles are not so likely to be performance critical for much stuff, although I can't say that with confidence.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I didn't realize you meant cast the pointer to unsigned short *. I still think this is an alias violation, so the compiler could easily go do some magic that gives us the wrong behavior.

If half has a unsigned int storage type behind the scenes, could we maybe use some inline AMD instruction helpers here to do the conversion? It might even be as simple as a no-op or ID operation, assuming the representation is the same in the resulting AMD representation.

Copy link
Contributor Author

@JackAKirk JackAKirk Mar 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I didn't realize you meant cast the pointer to unsigned short *. I still think this is an alias violation, so the compiler could easily go do some magic that gives us the wrong behavior.

I see thanks.

If half has a unsigned int storage type behind the scenes, could we maybe use some inline AMD instruction helpers here to do the conversion? It might even be as simple as a no-op or ID operation, assuming the representation is the same in the resulting AMD representation.

Generally I think inline AMD should be avoided since it is kind of like having inline assembly for SASS if it were cuda. For amd I think clang builtins should always be used that are lowered to the appropriate AMD asm.
I've found a non UB way of doing it anyway via as_ushort. I had tried this before, but with using as_short etc, which didn't work, but the as_ushort etc works perfectly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've found a non UB way of doing it anyway via as_ushort. I had tried this before, but with using as_short etc, which didn't work, but the as_ushort etc works perfectly.

Actually as_short works too, I must have just messed it up the first time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, thank you! I like this solution. 😄

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI Type punning through a union is not UB in C99 but this is moot now this code uses as_uchar

The original C standard missed some wording that should have come from c89, and they corrected it in the early 2000s:

https://open-std.org/jtc1/sc22/wg14/www/docs/dr_283.htm

OpenCL C seems to have missed this, and also declared type punning through a union as the ordained way to do this as per https://registry.khronos.org/OpenCL/specs/3.0-unified/html/OpenCL_C.html#reinterpreting-types-using-unions

Signed-off-by: JackAKirk <jack.kirk@codeplay.com>
Signed-off-by: JackAKirk <jack.kirk@codeplay.com>
Signed-off-by: JackAKirk <jack.kirk@codeplay.com>
Copy link
Contributor

@steffenlarsen steffenlarsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯

@ldrumm ldrumm merged commit b13a3c4 into intel:sycl Mar 19, 2024
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants