-
Notifications
You must be signed in to change notification settings - Fork 738
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[libclc][hip] Fix half shuffles and reenable reduction test #13016
Conversation
Signed-off-by: JackAKirk <jack.kirk@codeplay.com>
_CLC_DEF half _Z28__spirv_SubgroupShuffleINTELIDF16_ET_S0_j( | ||
half Data, unsigned int InvocationId) { | ||
union { | ||
int i; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a nit, but I'd go with unsigned int
, even thought we're type punning back to half
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ta done
Signed-off-by: JackAKirk <jack.kirk@codeplay.com>
half h; | ||
} tmp; | ||
tmp.h = Data; | ||
tmp.i = _Z28__spirv_SubgroupShuffleINTELIiET_S0_j(tmp.i, InvocationId); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Beware that this relies on UB. See https://en.cppreference.com/w/cpp/language/union:
It is undefined behavior to read from the member of the union that wasn't most recently written.
That said, I don't know of a better way to get it done... Unless we can use some sort of bitcast builtin here. Have you had a look at alternatives?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah it is UB according to openclc for one half of the 32 bits (which one depends on endianness).
If openclc inherited memcpy from c99 it would be possible to do it that way cleanly, but it doesn't. It has a section that discusses the issue briefly: https://registry.khronos.org/OpenCL/specs/3.0-unified/html/OpenCL_C.html#reinterpreting-data-as-another-type
One alternative that is not UB is to first do a conversion to float, then do as_int(float), and do the reverse after the shuffle. I tried this and compared the result with doing it the way I have here to check that it gives the same result. However doing it that way generates the extra machine instructions to do the float converts etc.
HIP fp16 headers is open source and I can see that they use the union trick. Hence I figured that if they do it we may as well do it. But at the same time I could just do it via the float route.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And yeah and I looked at amd clang builtins and I don't think they have a bitcast one (actually cuda doesn't either). For cuda I used inline ptx asm, but for amd this is more dangerous because they don't have an equivalent to ptx that is arch generation/family independent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, fair point. I missed the fact that this is OpenCL C. I think it is still better to not rely on UB, unless the performance overhead is high. The compiler moves fast and it wouldn't be the first time UB suddenly becomes an issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It makes the add test cases in the reduction test I reenabled in this PR ~10% slower if I go via the float route. I think it is fine to do it like that if you want? I guess that fp16 shuffles are not so likely to be performance critical for much stuff, although I can't say that with confidence.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I didn't realize you meant cast the pointer to unsigned short *
. I still think this is an alias violation, so the compiler could easily go do some magic that gives us the wrong behavior.
If half
has a unsigned int
storage type behind the scenes, could we maybe use some inline AMD instruction helpers here to do the conversion? It might even be as simple as a no-op or ID operation, assuming the representation is the same in the resulting AMD representation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I didn't realize you meant cast the pointer to
unsigned short *
. I still think this is an alias violation, so the compiler could easily go do some magic that gives us the wrong behavior.
I see thanks.
If
half
has aunsigned int
storage type behind the scenes, could we maybe use some inline AMD instruction helpers here to do the conversion? It might even be as simple as a no-op or ID operation, assuming the representation is the same in the resulting AMD representation.
Generally I think inline AMD should be avoided since it is kind of like having inline assembly for SASS if it were cuda. For amd I think clang builtins should always be used that are lowered to the appropriate AMD asm.
I've found a non UB way of doing it anyway via as_ushort
. I had tried this before, but with using as_short
etc, which didn't work, but the as_ushort
etc works perfectly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've found a non UB way of doing it anyway via
as_ushort
. I had tried this before, but with usingas_short
etc, which didn't work, but theas_ushort
etc works perfectly.
Actually as_short
works too, I must have just messed it up the first time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, thank you! I like this solution. 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI Type punning through a union is not UB in C99 but this is moot now this code uses as_uchar
The original C standard missed some wording that should have come from c89, and they corrected it in the early 2000s:
https://open-std.org/jtc1/sc22/wg14/www/docs/dr_283.htm
OpenCL C seems to have missed this, and also declared type punning through a union as the ordained way to do this as per https://registry.khronos.org/OpenCL/specs/3.0-unified/html/OpenCL_C.html#reinterpreting-types-using-unions
Signed-off-by: JackAKirk <jack.kirk@codeplay.com>
Signed-off-by: JackAKirk <jack.kirk@codeplay.com>
Signed-off-by: JackAKirk <jack.kirk@codeplay.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💯
Fix is to bitcast to the storage type of half (unsigned short) without doing a type conversion, before then extending to int for the shuffle.