-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KernelAbstractions support #147
base: master
Are you sure you want to change the base?
Conversation
Ah, I guess while I'm here, I'll briefly explain the differences with CUDA syntactically:
The tricky thing about this PR was removing the CUDA dependency outside of the kernels. There is still one call in zygote.jl I gotta figure out: https://github.com/leios/Molly.jl/blob/KA_support/src/zygote.jl#L698 |
Great work so far. Making the code compatible with generic array types is a nice change, and having the kernels work on different devices would be a selling point of the package. I would be interested to see the performance of the kernels compared to the CUDA versions. Also whether it plays nicely with Enzyme. Good luck with the runtime errors. |
I think I can finish this up today or else early next week (emphasis on think), but to quickly answer the questions:
|
Great. Not urgent, but how well does KernelAbstractions.jl deal with warp-level code, e.g. |
That's a good question. We can probably expose the APIs available from CUDA, but I am not sure how AMDGPU deals with these. We would also just need to figure out what that corresponds to on parallel CPU. I think these are the tools we need: https://rocm.docs.amd.com/projects/rocPRIM/en/latest/warp_ops/index.html Ah, as an important note (that I somehow failed to mention before), an advantage of KA is that it also provides a parallel CPU implentation, so the kernels can be written once and executed everywhere. I didn't do that in this PR because that brings up design questions related to Molly internals. |
Yeah we can discuss that after this PR. I would be okay with switching if there was no performance hit. Judging from discussion on the linked PR there is not currently warp support in KA. It may be necessary to leave that CUDA kernel in and have a separate KA kernel for other backends until warp support comes to KA. |
Ok, so a couple of quick notes here:
|
That is okay.
I wouldn't worry about this. Currently I only merge stuff that I am able to maintain, or where I think I can skill up to the point of maintaining it. The changes here seem reasonable and worth merging once any errors and performance regressions are fixed. There is a wider question about whether KernelAbstractions.jl will continue to be maintained compared to CUDA.jl, but it seems to have decent traction now. |
Yeah, the plan is for KA to be used even within GPUArrays, so it's not going anywhere anytime soon. Speaking of which, the "correct" course of action for KA in Molly would be to get the KA in GPUArrays first and then use that to implement any missing features on the GPUArrays level. Would it be better for me to separate this PR then? Maybe one doing the generic Array stuff and then another with the KA support? |
I would try and get this PR working as is. Only if that becomes difficult would it be worth splitting out and merging the generic array support. If KA is here for the long haul then there is a benefit to switching the kernels even if only CUDA works currently. Because then when changes happen elsewhere, AMDGPU will work without any changes required in Molly. |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #147 +/- ##
==========================================
- Coverage 71.68% 69.40% -2.29%
==========================================
Files 37 38 +1
Lines 5549 5726 +177
==========================================
- Hits 3978 3974 -4
- Misses 1571 1752 +181 ☔ View full report in Codecov by Sentry. |
Getting around to this and noticed a bunch of segfaults in the CPU tests. I then found that there's a strange conflict between
But only if
st:
Note that using a single thread "fixes" the issue. It seems to be a UCX / MPI issue, but I am not loading them and neither are in the Manifest. |
This looks exactly like https://juliaparallel.org/MPI.jl/stable/knownissues/#Multi-threading-and-signal-handling What is |
The fix mentioned there seems to work:
Note
|
Wild... What is |
Is it a linux thing like libpthread? |
# This triggers an error but it isn't printed | ||
# See https://discourse.julialang.org/t/error-handling-in-cuda-kernels/79692 | ||
# for how to throw a more meaningful error | ||
error("wrong force unit returned, was expecting $F but got $(unit(f[1]))") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The interpolation here is particularly tricky. I would avoid that if at all possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be clear: are you referring to the error(...)
call in an inlined function within an @kernel
? Or carrying the units through to this stage in the first place?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
error("Oops")
is fine, error("Oops, $F")
is sadly not since it string interpolation is really tough on the GPU compiler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can remove that, no problem.
Right, having an issue right now with zygote tests. Not really sure hot to go about debugging it, so I'll paste it here and then think about it
|
As a note, it looks like #182 is close-ish to being done? If so, it might be best to rework this PR once those changes are in. As it stands, this PR can work as a branch for any non-CUDA GPU (As long as the user does not need differentiable sims). |
Thanks for keeping going on this. Yes #182 will get merged soon, I was waiting on full Enzyme GPU broadcasting support to avoid GPU differentiable simulations breaking, but I might just merge it and wait for that to arrive later. Consequently I wouldn't worry about the Zygote error. I'm happy to help update this PR after #182, I realise that the ground has been moving under you. |
Yeah, no worries. I should go ahead and close the #99 PR since this one supersedes it. On my end, I am happy to wait a little longer and rebase up when you are happy with #182. I have other projects to work on in the mean time. Could you link the Enzyme issue? Is it this one? EnzymeAD/Enzyme.jl#1672 |
Okay great. The example I posted on JuliaGPU/CUDA.jl#2471 doesn't work as it stands, that and similar issues around reduction and broadcasting are the problem. |
@jgreener64 if there isn't an open issue on Enzyme.jl associated with it it is very likely that Billy and I will lose track of it. |
Billy has favoured moving CUDA-specific issues to CUDA.jl, e.g. EnzymeAD/Enzyme.jl#1454 (comment), in either case I can find some MWEs in the next few days of what is erroring and make sure they are tracked in issues. |
#182 is now merged. |
Particular issue is this one: JuliaGPU/CUDA.jl#2455 Which is essentially that we haven't yet implemented the CUDA.jl reduction of broadcast derivative (just added the reduction of array function in cuda.jl) |
The KernelAbstractions branch now compiles, so I thought I would put forward a quick draft PR while I figure out all the runtime bugs.
Notes:
CUDA
has been removed and replaced withKernelAbstractions
andGPUArrays
. As an important note here, GPUArrays is not strictly necessary except to replicate the behavior of the boolean GPU flag (ieisa(a, AbstractGPUArray)
).