Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KernelAbstractions support #147

Draft
wants to merge 9 commits into
base: master
Choose a base branch
from
Draft

Conversation

leios
Copy link
Contributor

@leios leios commented Sep 7, 2023

The KernelAbstractions branch now compiles, so I thought I would put forward a quick draft PR while I figure out all the runtime bugs.

Notes:

  1. This builds off of adding preliminary AMDGPU support #99 and should replace it entirely
  2. CUDA has been removed and replaced with KernelAbstractions and GPUArrays. As an important note here, GPUArrays is not strictly necessary except to replicate the behavior of the boolean GPU flag (ie isa(a, AbstractGPUArray)).
  3. If this is merged, other GPU types (Metal, AMD, Intel) will also be supported, but I can only test on AMD (and maybe Metal if I can get someone to try it with a mac).
  4. I need to add in the changes from Non-atomic pairwise force summation kernels. #133. If there is something we are missing on the KernelAbstractions side, I can try to add it in, but I think we are good to go.

@leios
Copy link
Contributor Author

leios commented Sep 7, 2023

Ah, I guess while I'm here, I'll briefly explain the differences with CUDA syntactically:

  1. Indexing is easier: @index(Global / Group / Local, Linear / NTuple / CartesianIndex) vs (blockIdx().x - 1) * blockDim().x + threadIdx().x for CUDA
  2. Kernels run off of an ndrange for the range of elements (OpenCL inspired syntax)
  3. Launching kernels requires configuration with a backend, see: https://github.com/leios/Molly.jl/blob/KA_support/src/kernels.jl#L21
  4. Certain functions now execute on the backend CUDA.zeros(...) -> zeros(backend, args...)

The tricky thing about this PR was removing the CUDA dependency outside of the kernels. There is still one call in zygote.jl I gotta figure out: https://github.com/leios/Molly.jl/blob/KA_support/src/zygote.jl#L698

@jgreener64
Copy link
Collaborator

Great work so far. Making the code compatible with generic array types is a nice change, and having the kernels work on different devices would be a selling point of the package.

I would be interested to see the performance of the kernels compared to the CUDA versions. Also whether it plays nicely with Enzyme. Good luck with the runtime errors.

@leios
Copy link
Contributor Author

leios commented Sep 8, 2023

I think I can finish this up today or else early next week (emphasis on think), but to quickly answer the questions:

  1. KA essentially just writes vendor-specific code (ie CUDA) from the generic code input, so if we don't have identical performance to CUDA, then that's a bug. I'll do the performance testing similar to Non-atomic pairwise force summation kernels. #133 once the code is cleaned up.
  2. Enzyme should also not be an issue; however, there are some reports of error handling being an issue: Enzyme + KA Stalls on Error instead of reporting it EnzymeAD/Enzyme.jl#365

@jgreener64
Copy link
Collaborator

Great. Not urgent, but how well does KernelAbstractions.jl deal with warp-level code, e.g. warpsize() and sync_warp()?

@leios
Copy link
Contributor Author

leios commented Sep 8, 2023

That's a good question. We can probably expose the APIs available from CUDA, but I am not sure how AMDGPU deals with these. We would also just need to figure out what that corresponds to on parallel CPU.

I think these are the tools we need: https://rocm.docs.amd.com/projects/rocPRIM/en/latest/warp_ops/index.html
So they are available, it's just a matter of exposing them in KA and figuring out what it corresponds to for different backends.

Ah, as an important note (that I somehow failed to mention before), an advantage of KA is that it also provides a parallel CPU implentation, so the kernels can be written once and executed everywhere. I didn't do that in this PR because that brings up design questions related to Molly internals.

@jgreener64
Copy link
Collaborator

I didn't do that in this PR because that brings up design questions related to Molly internals.

Yeah we can discuss that after this PR. I would be okay with switching if there was no performance hit.

Judging from discussion on the linked PR there is not currently warp support in KA. It may be necessary to leave that CUDA kernel in and have a separate KA kernel for other backends until warp support comes to KA.

@leios
Copy link
Contributor Author

leios commented Sep 9, 2023

Ok, so a couple of quick notes here:

  1. There are a few host calls that are not yet supported by AMDGPU (such as findall). My understanding was that such calls would eventually be ported to GPUArrays, but I don't think that has happened yet. Note that some of the stalling here is because we are waiting to get KA into GPUArrays (Transition GPUArrays to KernelAbstractions JuliaGPU/GPUArrays.jl#451). At least for findall, the kernel is not that complex: https://github.com/JuliaGPU/CUDA.jl/blob/master/src/indexing.jl#L23, so we could put it into AMDGPU or something for now; however, we are stuck on an older version of AMDGPU due to some package conflicts. The quick fix would be to do it the ol' fashioned way and just stick the necessary kernels in Molly under a file like, kernel_hacks,.jl or something. Such issues were also what stalled adding preliminary AMDGPU support #99.
  2. Non-atomic pairwise force summation kernels. #133 seems to only use warpsize and warp_sync for warp-level semantics. The KA kernel would probably get the warpsize on the host and then pass it in as a parameter. warp_sync is a bit more interesting because, well, at least in the old days warps didn't need any synchronizing. It seems that things changed in Volta and most people missed the memo. Because of this, the easiest thing to do would be to keep the CUDA dependency for that one kernel. We could also add in warp-level semantics to KA, but that would take some time to propagate to all the independent GPU APIs and (as mentioned in 1), we are kinda stuck on older versions of AMDGPU and CUDA because of compatability with other packages.
  3. I am realizing that there is a greater conflict with this PR. Namely, I don't know if I have the bandwidth to do any sort of maintainence on Molly after this PR is in. I don't know if it's fair to ask you to merge 1000 lines of code with a new API and then leave. On the other hand, getting this to work on AMD would be great and really useful. Let me think on that.

@jgreener64
Copy link
Collaborator

Because of this, the easiest thing to do would be to keep the CUDA dependency for that one kernel.

That is okay.

I don't know if it's fair to ask you to merge 1000 lines of code with a new API and then leave.

I wouldn't worry about this. Currently I only merge stuff that I am able to maintain, or where I think I can skill up to the point of maintaining it. The changes here seem reasonable and worth merging once any errors and performance regressions are fixed. There is a wider question about whether KernelAbstractions.jl will continue to be maintained compared to CUDA.jl, but it seems to have decent traction now.

@leios
Copy link
Contributor Author

leios commented Sep 10, 2023

Yeah, the plan is for KA to be used even within GPUArrays, so it's not going anywhere anytime soon. Speaking of which, the "correct" course of action for KA in Molly would be to get the KA in GPUArrays first and then use that to implement any missing features on the GPUArrays level.

Would it be better for me to separate this PR then? Maybe one doing the generic Array stuff and then another with the KA support?

@jgreener64
Copy link
Collaborator

I would try and get this PR working as is. Only if that becomes difficult would it be worth splitting out and merging the generic array support.

If KA is here for the long haul then there is a benefit to switching the kernels even if only CUDA works currently. Because then when changes happen elsewhere, AMDGPU will work without any changes required in Molly.

Copy link

codecov bot commented Jun 28, 2024

Codecov Report

Attention: Patch coverage is 12.82660% with 367 lines in your changes missing coverage. Please review.

Project coverage is 69.40%. Comparing base (18a1991) to head (06de12a).

Files with missing lines Patch % Lines
src/kernels.jl 1.07% 184 Missing ⚠️
src/chain_rules.jl 0.99% 100 Missing ⚠️
src/interactions/implicit_solvent.jl 5.26% 36 Missing ⚠️
ext/MollyCUDAExt.jl 0.00% 12 Missing ⚠️
src/neighbors.jl 28.57% 10 Missing ⚠️
src/spatial.jl 33.33% 8 Missing ⚠️
src/zygote.jl 0.00% 8 Missing ⚠️
ext/MollyPythonCallExt.jl 0.00% 2 Missing ⚠️
src/energy.jl 33.33% 2 Missing ⚠️
src/force.jl 33.33% 2 Missing ⚠️
... and 2 more
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #147      +/-   ##
==========================================
- Coverage   71.68%   69.40%   -2.29%     
==========================================
  Files          37       38       +1     
  Lines        5549     5726     +177     
==========================================
- Hits         3978     3974       -4     
- Misses       1571     1752     +181     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

src/chain_rules.jl Outdated Show resolved Hide resolved
@leios
Copy link
Contributor Author

leios commented Sep 30, 2024

Getting around to this and noticed a bunch of segfaults in the CPU tests. I then found that there's a strange conflict between AMDGPU and Molly. Even on the master branch, this script will create a segfault:

using Molly

n_atoms = 100
atom_mass = 10.0f0u"g/mol"
boundary = CubicBoundary(2.0f0u"nm")
temp = 100.0f0u"K"
cpu_coords = place_atoms(n_atoms, boundary; min_dist=0.3u"nm")
cpu_atoms = Array([Atom(mass=atom_mass, σ=0.3f0u"nm", ϵ=0.2f0u"kJ * mol^-1") for
 i in 1:n_atoms])
cpu_velocities = Array([random_velocity(atom_mass, temp) for i in 1:n_atoms])
cpu_simulator = VelocityVerlet(dt=0.002f0u"ps")

cpu_sys = System(
    atoms=cpu_atoms,
    coords=cpu_coords,
    boundary=boundary,
    velocities=cpu_velocities,
    pairwise_inters=(LennardJones(),),
    loggers=(
        temp=TemperatureLogger(typeof(1.0f0u"K"), 10),
        coords=CoordinateLogger(typeof(1.0f0u"nm"), 10),
    ),
)

simulate!(deepcopy(cpu_sys), cpu_simulator, 20) # Compile function
simulate!(cpu_sys, cpu_simulator, 2000)

But only if AMDGPU is loaded before include("cpu.jl"). Not sure how to go about debugging this on, but writing it down so it is documented somewhere. The segfault:


julia> include("tmp/cpu.jl")
System with 100 atoms, boundary CubicBoundary{Quantity{Float32, 𝐋, Unitful.FreeUnits{(nm,), 𝐋, nothing}}}(Quantity{Float32, 𝐋, Unitful.FreeUnits{(nm,), 𝐋, nothing}}[2.0f0 nm, 2.0f0 nm, 2.0f0 nm])

julia> 
[leios@noema Molly.jl]$ julia --project -t 12
               _
   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.10.2 (2024-03-01)
 _/ |\__'_|_|_|\__'_|  |  Official https://julialang.org/ release
|__/                   |

julia> using AMDGPU

julia> include("tmp/cpu.jl")
[1727708527.809644] [noema:37885:0]        spinlock.c:29   UCX  WARN  ucs_recursive_spinlock_destroy() failed: busy
[1727708527.809644] [noema:37885:1]        spinlock.c:29   UCX  WARN  ucs_recursive_spinlock_destroy() failed: busy
[noema:37885:0:37894] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x756d3781c008)
[noema:37885:1:37897] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x756d3781c008)
[1727708527.809644] [noema:37885:3]        spinlock.c:29   UCX  WARN  ucs_recursive_spinlock_destroy() failed: busy
[noema:37885:3:37893] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x756d3781c008)
[1727708527.809649] [noema:37885:2]           debug.c:1297 UCX  WARN  ucs_debug_disable_signal: signal 1 was not set in ucs
[noema:37885:2:37892] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x756d3781c008)
[1727708527.809644] [noema:37885:4]        spinlock.c:29   UCX  WARN  ucs_recursive_spinlock_destroy() failed: busy
[noema:37885:4:37889] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x756d3781c008)
[noema:37885:6:37890] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x756d3781c008)
[1727708527.809728] [noema:37885:0]        spinlock.c:29   UCX  WARN  ucs_recursive_spinlock_destroy() failed: busy
[noema:37885:7:37888] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x756d3781c008)
[noema:37885:8:37891] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x756d3781c008)
[noema:37885:9:37895] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x756d3781c008)
[1727708527.809730] [noema:37885:1]        spinlock.c:29   UCX  WARN  ucs_recursive_spinlock_destroy() failed: busy
[1727708527.809735] [noema:37885:5]        spinlock.c:29   UCX  WARN  ucs_recursive_spinlock_destroy() failed: busy
[noema:37885:5:37898] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x756d3781c008)
[1727708527.809741] [noema:37885:3]        spinlock.c:29   UCX  WARN  ucs_recursive_spinlock_destroy() failed: busy
[noema:37885:10:37896] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x756d3781c008)
==== backtrace (tid:  37894) ====
 0 0x000000000004d212 ucs_event_set_fd_get()  ???:0
 1 0x000000000004d3dd ucs_event_set_fd_get()  ???:0
 2 0x000000000003d1d0 __sigaction()  ???:0
 3 0x00000000000845d4 ijl_process_events()  /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/jl_uv.c:277
 4 0x0000000000097f8d ijl_task_get_next()  /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/partr.c:524
 5 0x0000000001cb0bd8 julia_poptask_75383()  ./task.jl:985
 6 0x0000000001cb0bd8 julia_poptask_75383()  ./task.jl:987
 7 0x0000000000997f72 julia_wait_74665()  ./task.jl:994
 8 0x0000000000962c1c julia_task_done_hook_75296()  ./task.jl:675
 9 0x0000000001443a97 jfptr_task_done_hook_75297.1()  :0
10 0x0000000000046a0e _jl_invoke()  /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/gf.c:2894
11 0x0000000000069c17 jl_apply()  /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/julia.h:1982
12 0x0000000000069d9e start_task()  /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/task.c:1249
=================================

[37885] signal (11.-6): Segmentation fault
in expression starting at /home/leios/projects/CESMIX/Molly.jl/tmp/cpu.jl:25
ijl_process_events at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/jl_uv.c:277
ijl_task_get_next at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/partr.c:524
poptask at ./task.jl:985
wait at ./task.jl:994
task_done_hook at ./task.jl:675
jfptr_task_done_hook_75297.1 at /home/leios/builds/julia-1.10.2/lib/julia/sys.so (unknown line)
_jl_invoke at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/gf.c:2894 [inlined]
ijl_apply_generic at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/gf.c:3076
jl_apply at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/julia.h:1982 [inlined]
jl_finish_task at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/task.c:320
start_task at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/task.c:1249
Allocations: 39536048 (Pool: 39464158; Big: 71890); GC: 46
Segmentation fault (core dumped)

st:

(Molly) pkg> st
Project Molly v0.21.1
Status `~/projects/CESMIX/Molly.jl/Project.toml`
  [a9b6321e] Atomix v0.1.0
⌅ [a963bdd2] AtomsBase v0.3.5
  [a3e0e189] AtomsCalculators v0.2.2
  [de9282ab] BioStructures v4.2.0
⌃ [052768ef] CUDA v5.4.3
  [69e1c6dd] CellListMap v0.9.6
  [082447d4] ChainRules v1.71.0
  [d360d2e6] ChainRulesCore v1.25.0
  [46823bd8] Chemfiles v0.10.41
  [861a8166] Combinatorics v1.0.2
  [864edb3b] DataStructures v0.18.20
  [b4f34e82] Distances v0.10.11
  [31c24e10] Distributions v0.25.112
⌅ [7da242da] Enzyme v0.12.36
  [8f5d6c58] EzXML v1.2.0
  [cc61a311] FLoops v0.2.2
  [f6369f11] ForwardDiff v0.10.36
  [86223c79] Graphs v1.12.0
  [5ab0869b] KernelDensity v0.6.9
  [b8a86587] NearestNeighbors v0.4.20
  [7b2266bf] PeriodicTable v1.2.1
  [189a3867] Reexport v1.2.2
⌅ [64031d72] SimpleCrystals v0.2.0
  [90137ffa] StaticArrays v1.9.7
  [1986cc42] Unitful v1.21.0
  [a7773ee8] UnitfulAtomic v1.0.0
  [f31437dd] UnitfulChainRules v0.1.2
  [d80eeb9a] UnsafeAtomicsLLVM v0.2.1
  [e88e6eb3] Zygote v0.6.71
  [37e2e46d] LinearAlgebra
  [9a3f8284] Random
  [2f01184e] SparseArrays v1.10.0
  [10745b16] Statistics v1.10.0
Info Packages marked with ⌃ and ⌅ have new versions available. Those with ⌃ may be upgradable, but those with ⌅ are restricted by compatibility constraints from upgrading. To see why use `status --outdated`

Note that using a single thread "fixes" the issue. It seems to be a UCX / MPI issue, but I am not loading them and neither are in the Manifest.

@vchuravy
Copy link

vchuravy commented Sep 30, 2024

@leios
Copy link
Contributor Author

leios commented Sep 30, 2024

The fix mentioned there seems to work:

[leios@noema Molly.jl]$ export UCX_ERROR_SIGNALS="SIGILL,SIGBUS,SIGFPE"
[leios@noema Molly.jl]$ julia --project -t 12
               _
   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.10.2 (2024-03-01)
 _/ |\__'_|_|_|\__'_|  |  Official https://julialang.org/ release
|__/                   |

julia> using AMDGPU

julia> include("tmp/cpu.jl")
System with 100 atoms, boundary CubicBoundary{Quantity{Float32, 𝐋, Unitful.FreeUnits{(nm,), 𝐋, nothing}}}(Quantity{Float32, 𝐋, Unitful.FreeUnits{(nm,), 𝐋, nothing}}[2.0f0 nm, 2.0f0 nm, 2.0f0 nm])

julia> 

Note st - m has no MPI or UCX

(Molly) pkg> st -m
Project Molly v0.21.1
Status `~/projects/CESMIX/Molly.jl/Manifest.toml`
  [621f4979] AbstractFFTs v1.5.0
  [7d9f7c33] Accessors v0.1.38
  [79e6a3ab] Adapt v4.0.4
  [66dad0bd] AliasTables v1.1.3
  [dce04be8] ArgCheck v2.3.0
  [ec485272] ArnoldiMethod v0.4.0
  [a9b6321e] Atomix v0.1.0
⌅ [a963bdd2] AtomsBase v0.3.5
  [a3e0e189] AtomsCalculators v0.2.2
  [13072b0f] AxisAlgorithms v1.1.0
  [ab4f0b2a] BFloat16s v0.5.0
  [198e06fe] BangBang v0.4.3
  [9718e550] Baselet v0.1.1
  [47718e42] BioGenerics v0.1.5
  [de9282ab] BioStructures v4.2.0
  [3c28c6f8] BioSymbols v5.1.3
  [fa961155] CEnum v0.5.0
⌃ [052768ef] CUDA v5.4.3
  [1af6417a] CUDA_Runtime_Discovery v0.3.5
  [69e1c6dd] CellListMap v0.9.6
  [082447d4] ChainRules v1.71.0
  [d360d2e6] ChainRulesCore v1.25.0
  [46823bd8] Chemfiles v0.10.41
  [944b1d66] CodecZlib v0.7.6
  [3da002f7] ColorTypes v0.11.5
  [5ae59095] Colors v0.12.11
  [861a8166] Combinatorics v1.0.2
  [bbf7d656] CommonSubexpressions v0.3.1
  [34da2185] Compat v4.16.0
  [a33af91c] CompositionsBase v0.1.2
  [187b0558] ConstructionBase v1.5.8
  [6add18c4] ContextVariablesX v0.1.3
  [a8cc5b0e] Crayons v4.1.1
  [9a962f9c] DataAPI v1.16.0
  [a93c6f00] DataFrames v1.7.0
  [864edb3b] DataStructures v0.18.20
  [e2d170a0] DataValueInterfaces v1.0.0
  [244e2a9f] DefineSingletons v0.1.2
  [163ba53b] DiffResults v1.1.0
  [b552c78f] DiffRules v1.15.1
  [b4f34e82] Distances v0.10.11
  [31c24e10] Distributions v0.25.112
  [ffbed154] DocStringExtensions v0.9.3
⌅ [7da242da] Enzyme v0.12.36
⌅ [f151be2c] EnzymeCore v0.7.8
  [e2ba6199] ExprTools v0.1.10
  [8f5d6c58] EzXML v1.2.0
  [7a1cc6ca] FFTW v1.8.0
  [cc61a311] FLoops v0.2.2
  [b9860ae5] FLoopsBase v0.1.1
  [1a297f60] FillArrays v1.13.0
  [53c48c17] FixedPointNumbers v0.8.5
  [1fa38f19] Format v1.3.7
  [f6369f11] ForwardDiff v0.10.36
  [0c68f7d7] GPUArrays v10.3.1
  [46192b85] GPUArraysCore v0.1.6
⌅ [61eb1bfa] GPUCompiler v0.26.7
  [86223c79] Graphs v1.12.0
  [34004b35] HypergeometricFunctions v0.3.24
  [7869d1d1] IRTools v0.4.14
  [d25df0c9] Inflate v0.1.5
  [22cec73e] InitialValues v0.3.1
  [842dd82b] InlineStrings v1.4.2
  [a98d9a8b] Interpolations v0.15.1
  [3587e190] InverseFunctions v0.1.17
  [41ab1584] InvertedIndices v1.3.0
  [92d709cd] IrrationalConstants v0.2.2
  [82899510] IteratorInterfaceExtensions v1.0.0
  [692b3bcd] JLLWrappers v1.6.0
  [b14d175d] JuliaVariables v0.2.4
⌃ [63c18a36] KernelAbstractions v0.9.26
  [5ab0869b] KernelDensity v0.6.9
⌅ [929cbde3] LLVM v8.1.0
  [8b046642] LLVMLoopInfo v1.0.0
  [b964fa9f] LaTeXStrings v1.3.1
  [2ab3a3ac] LogExpFunctions v0.3.28
  [d8e11817] MLStyle v0.4.17
  [1914dd2f] MacroTools v0.5.13
  [128add7d] MicroCollections v0.2.0
  [e1d29d7a] Missings v1.2.0
  [5da4648a] NVTX v0.3.4
  [77ba4419] NaNMath v1.0.2
  [71a1bf82] NameResolution v0.1.5
  [b8a86587] NearestNeighbors v0.4.20
  [d8793406] ObjectFile v0.4.2
  [6fe1bfb0] OffsetArrays v1.14.1
  [bac558e1] OrderedCollections v1.6.3
  [90014a1f] PDMats v0.11.31
  [d96e819e] Parameters v0.12.3
  [7b2266bf] PeriodicTable v1.2.1
  [2dfb63ee] PooledArrays v1.4.3
  [aea7be01] PrecompileTools v1.2.1
  [21216c6a] Preferences v1.4.3
  [8162dcfd] PrettyPrint v0.2.0
  [08abe8d2] PrettyTables v2.4.0
  [92933f4c] ProgressMeter v1.10.2
  [43287f4e] PtrArrays v1.2.1
  [1fd47b50] QuadGK v2.11.1
  [74087812] Random123 v1.7.0
  [e6cf234a] RandomNumbers v1.6.0
  [c84ed2f1] Ratios v0.4.5
  [c1ae055f] RealDot v0.1.0
  [3cdcf5f2] RecipesBase v1.3.4
  [189a3867] Reexport v1.2.2
  [ae029012] Requires v1.3.0
  [79098fc4] Rmath v0.8.0
  [6c6a2e73] Scratch v1.2.1
  [91c51154] SentinelArrays v1.4.5
  [efcf1570] Setfield v1.1.1
⌅ [64031d72] SimpleCrystals v0.2.0
  [699a6c99] SimpleTraits v0.9.4
  [a2af1166] SortingAlgorithms v1.2.1
  [dc90abb0] SparseInverseSubset v0.1.2
  [276daf66] SpecialFunctions v2.4.0
  [171d559e] SplittablesBase v0.1.15
  [90137ffa] StaticArrays v1.9.7
  [1e83bf80] StaticArraysCore v1.4.3
  [82ae8749] StatsAPI v1.7.0
  [2913bbd2] StatsBase v0.34.3
  [4c63d2b9] StatsFuns v1.3.2
  [892a3eda] StringManipulation v0.4.0
  [09ab397b] StructArrays v0.6.18
  [53d494c1] StructIO v0.3.1
  [3783bdb8] TableTraits v1.0.1
  [bd369af6] Tables v1.12.0
  [1c621080] TestItems v1.0.0
  [a759f4b9] TimerOutputs v0.5.24
  [3bb67fe8] TranscodingStreams v0.11.2
  [28d57a85] Transducers v0.4.82
  [3a884ed6] UnPack v1.0.2
  [1986cc42] Unitful v1.21.0
  [a7773ee8] UnitfulAtomic v1.0.0
  [f31437dd] UnitfulChainRules v0.1.2
  [013be700] UnsafeAtomics v0.2.1
  [d80eeb9a] UnsafeAtomicsLLVM v0.2.1
  [efce3f68] WoodburyMatrices v1.0.0
  [e88e6eb3] Zygote v0.6.71
  [700de1a5] ZygoteRules v0.2.5
⌅ [4ee394cb] CUDA_Driver_jll v0.9.2+0
⌅ [76a88914] CUDA_Runtime_jll v0.14.1+0
  [78a364fa] Chemfiles_jll v0.10.4+0
⌅ [7cc45869] Enzyme_jll v0.0.148+0
  [f5851436] FFTW_jll v3.3.10+1
  [1d5cc7b8] IntelOpenMP_jll v2024.2.1+0
  [9c1d0b0a] JuliaNVTXCallbacks_jll v0.2.1+0
⌅ [dad2f222] LLVMExtra_jll v0.0.31+0
  [94ce4f54] Libiconv_jll v1.17.0+0
  [856f044c] MKL_jll v2024.2.0+0
  [e98f9f5b] NVTX_jll v3.1.0+2
  [efe28fd5] OpenSpecFun_jll v0.5.5+0
  [f50d1b31] Rmath_jll v0.5.1+0
  [02c8fc9c] XML2_jll v2.13.3+0
  [1317d2d5] oneTBB_jll v2021.12.0+0
  [0dad84c5] ArgTools v1.1.1
  [56f22d72] Artifacts
  [2a0f44e3] Base64
  [ade2ca70] Dates
  [8ba89e20] Distributed
  [f43a241f] Downloads v1.6.0
  [7b1f6079] FileWatching
  [9fa8497b] Future
  [b77e0a4c] InteractiveUtils
  [4af54fe1] LazyArtifacts
  [b27032c2] LibCURL v0.6.4
  [76f85450] LibGit2
  [8f399da3] Libdl
  [37e2e46d] LinearAlgebra
  [56ddb016] Logging
  [d6f4376e] Markdown
  [a63ad114] Mmap
  [ca575930] NetworkOptions v1.2.0
  [44cfe95a] Pkg v1.10.0
  [de0858da] Printf
  [3fa0cd96] REPL
  [9a3f8284] Random
  [ea8e919c] SHA v0.7.0
  [9e88b42a] Serialization
  [1a1011a3] SharedArrays
  [6462fe0b] Sockets
  [2f01184e] SparseArrays v1.10.0
  [10745b16] Statistics v1.10.0
  [4607b0f0] SuiteSparse
  [fa267f1f] TOML v1.0.3
  [a4e569a6] Tar v1.10.0
  [8dfed614] Test
  [cf7118a7] UUIDs
  [4ec0a83e] Unicode
  [e66e0078] CompilerSupportLibraries_jll v1.1.0+0
  [deac9b47] LibCURL_jll v8.4.0+0
  [e37daf67] LibGit2_jll v1.6.4+0
  [29816b5a] LibSSH2_jll v1.11.0+1
  [c8ffd9c3] MbedTLS_jll v2.28.2+1
  [14a3606d] MozillaCACerts_jll v2023.1.10
  [4536629a] OpenBLAS_jll v0.3.23+4
  [05823500] OpenLibm_jll v0.8.1+2
  [bea87d4a] SuiteSparse_jll v7.2.1+1
  [83775a58] Zlib_jll v1.2.13+1
  [8e850b90] libblastrampoline_jll v5.8.0+1
  [8e850ede] nghttp2_jll v1.52.0+1
  [3f19e933] p7zip_jll v17.4.0+2
Info Packages marked with ⌃ and ⌅ have new versions available. Those with ⌃ may be upgradable, but those with ⌅ are restricted by compatibility constraints from upgrading. To see why use `status --outdated -m`

@vchuravy
Copy link

vchuravy commented Oct 1, 2024

Wild... What is Libc.dllist()? Who loads this darn library

@leios
Copy link
Contributor Author

leios commented Oct 1, 2024

julia> Libc.Libdl.dllist()
32-element Vector{String}:
 "linux-vdso.so.1"
 "/usr/lib/libdl.so.2"
 "/usr/lib/libpthread.so.0"
 "/usr/lib/libc.so.6"
 "/home/leios/builds/julia-1.10.2/bin/../lib/libjulia.so.1.10"
 "/lib64/ld-linux-x86-64.so.2"
 "/home/leios/builds/julia-1.10.2/bin/../lib/julia/libgcc_s.so.1"
 "/home/leios/builds/julia-1.10.2/bin/../lib/julia/libopenlibm.so"
 "/usr/lib/libstdc++.so.6"
 "/usr/lib/libm.so.6"
 "/home/leios/builds/julia-1.10.2/bin/../lib/julia/libjulia-internal.so.1.10"
 "/home/leios/builds/julia-1.10.2/bin/../lib/julia/libunwind.so.8"
 "/usr/lib/librt.so.1"
 "/home/leios/builds/julia-1.10.2/bin/../lib/julia/libz.so.1"
 "/home/leios/builds/julia-1.10.2/bin/../lib/julia/libatomic.so.1"
 "/home/leios/builds/julia-1.10.2/bin/../lib/julia/libjulia-codegen.so.1.10"
 "/home/leios/builds/julia-1.10.2/bin/../lib/julia/libLLVM-15jl.so"
 "/home/leios/builds/julia-1.10.2/lib/julia/sys.so"
 "/home/leios/builds/julia-1.10.2/bin/../lib/julia/libpcre2-8.so"
 "/home/leios/builds/julia-1.10.2/bin/../lib/julia/libgmp.so.10"
 "/home/leios/builds/julia-1.10.2/bin/../lib/julia/libmpfr.so.6"
 "/home/leios/builds/julia-1.10.2/bin/../lib/julia/libgfortran.so.5"
 "/home/leios/builds/julia-1.10.2/bin/../lib/julia/libquadmath.so.0"
 "/home/leios/builds/julia-1.10.2/bin/../lib/julia/libopenblas64_.so"
 "/home/leios/builds/julia-1.10.2/bin/../lib/julia/libblastrampoline.so.5"
 "/home/leios/builds/julia-1.10.2/bin/../lib/julia/libmbedcrypto.so.7"
 "/home/leios/builds/julia-1.10.2/bin/../lib/julia/libmbedtls.so.14"
 "/home/leios/builds/julia-1.10.2/bin/../lib/julia/libmbedx509.so.1"
 "/home/leios/builds/julia-1.10.2/bin/../lib/julia/libssh2.so.1"
 "/home/leios/builds/julia-1.10.2/bin/../lib/julia/libgit2.so.1.6"
 "/home/leios/builds/julia-1.10.2/bin/../lib/julia/libnghttp2.so.14"
 "/home/leios/builds/julia-1.10.2/bin/../lib/julia/libcurl.so.4"

Is it a linux thing like libpthread?

# This triggers an error but it isn't printed
# See https://discourse.julialang.org/t/error-handling-in-cuda-kernels/79692
# for how to throw a more meaningful error
error("wrong force unit returned, was expecting $F but got $(unit(f[1]))")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The interpolation here is particularly tricky. I would avoid that if at all possible.

Copy link
Contributor Author

@leios leios Oct 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear: are you referring to the error(...) call in an inlined function within an @kernel? Or carrying the units through to this stage in the first place?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

error("Oops") is fine, error("Oops, $F") is sadly not since it string interpolation is really tough on the GPU compiler.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can remove that, no problem.

@leios
Copy link
Contributor Author

leios commented Oct 24, 2024

Right, having an issue right now with zygote tests. Not really sure hot to go about debugging it, so I'll paste it here and then think about it

Differentiable simulation: Error During Test at /home/leios/projects/CESMIX/Molly.jl/test/zygote.jl:30
  Got exception outside of a @test
  MethodError: _pullback(::Zygote.Context{false}, ::typeof(Base.Broadcast.broadcasted), ::CUDA.CuArrayStyle{1, CUDA.DeviceMemory}, ::typeof(mass), ::CuArray{Atom{Float64, Float64, Float64, Float64}, 1, CUDA.DeviceMemory}) is ambiguous.
  
  Candidates:
    _pullback(__context__::ZygoteRules.AContext, var"639"::typeof(Base.Broadcast.broadcasted), var"640"::AbstractGPUArrayStyle, f, args...)
      @ Zygote ~/.julia/packages/ZygoteRules/M4xmc/src/adjoint.jl:66
    _pullback(__context__::ZygoteRules.AContext, var"387"::typeof(Base.Broadcast.broadcasted), var"388"::Base.Broadcast.AbstractArrayStyle, f::typeof(mass), args...)
      @ Molly ~/.julia/packages/ZygoteRules/M4xmc/src/adjoint.jl:66
  
  Possible fix, define
    _pullback(::ZygoteRules.AContext, ::typeof(Base.Broadcast.broadcasted), ::AbstractGPUArrayStyle, ::typeof(mass), ::Vararg{Any})
  
  Stacktrace:
    [1] _apply(::Function, ::Vararg{Any})
      @ Core ./boot.jl:838
    [2] adjoint
      @ ~/.julia/packages/Zygote/NRp5C/src/lib/lib.jl:203 [inlined]
    [3] _pullback
      @ ~/.julia/packages/ZygoteRules/M4xmc/src/adjoint.jl:67 [inlined]
    [4] broadcasted
      @ ./broadcast.jl:1341 [inlined]
    [5] #System#3
      @ ~/projects/CESMIX/Molly.jl/src/types.jl:565 [inlined]
    [6] _pullback(::Zygote.Context{false}, ::Molly.var"##System#3", ::CuArray{Atom{Float64, Float64, Float64, Float64}, 1, CUDA.DeviceMemory}, ::CuArray{SVector{3, Float64}, 1, CUDA.DeviceMemory}, ::CubicBoundary{Float64}, ::CuArray{SVector{3, Float64}, 1, CUDA.DeviceMemory}, ::Vector{Any}, ::Nothing, ::Tuple{LennardJones{false, DistanceCutoff{Float64, Float64, Float64}, Int64, Int64, Unitful.FreeUnits{(), NoDims, nothing}, Unitful.FreeUnits{(), NoDims, nothing}}, CoulombReactionField{Float64, Float64, Int64, Float64, Unitful.FreeUnits{(), NoDims, nothing}, Unitful.FreeUnits{(), NoDims, nothing}}}, ::Tuple{InteractionList2Atoms{CuArray{Int32, 1, CUDA.DeviceMemory}, CuArray{HarmonicBond{Float32, Float64}, 1, CUDA.DeviceMemory}}, InteractionList3Atoms{CuArray{Int32, 1, CUDA.DeviceMemory}, CuArray{HarmonicAngle{Float64, Float64}, 1, CUDA.DeviceMemory}}, InteractionList4Atoms{CuArray{Int32, 1, CUDA.DeviceMemory}, CuArray{PeriodicTorsion{6, Float64, Float64}, 1, CUDA.DeviceMemory}}}, ::Tuple{}, ::Tuple{}, ::DistanceNeighborFinder{CuArray{Bool, 2, CUDA.DeviceMemory}, Float64}, ::Tuple{}, ::Unitful.FreeUnits{(), NoDims, nothing}, ::Unitful.FreeUnits{(), NoDims, nothing}, ::Float64, ::Nothing, ::Type{System})
      @ Zygote ~/.julia/packages/Zygote/NRp5C/src/compiler/interface2.jl:0
    [7] System
      @ ~/projects/CESMIX/Molly.jl/src/types.jl:485 [inlined]
    [8] _pullback(::Zygote.Context{false}, ::typeof(Core.kwcall), ::@NamedTuple{atoms::CuArray{Atom{Float64, Float64, Float64, Float64}, 1, CUDA.DeviceMemory}, coords::CuArray{SVector{3, Float64}, 1, CUDA.DeviceMemory}, boundary::CubicBoundary{Float64}, velocities::CuArray{SVector{3, Float64}, 1, CUDA.DeviceMemory}, pairwise_inters::Tuple{LennardJones{false, DistanceCutoff{Float64, Float64, Float64}, Int64, Int64, Unitful.FreeUnits{(), NoDims, nothing}, Unitful.FreeUnits{(), NoDims, nothing}}, CoulombReactionField{Float64, Float64, Int64, Float64, Unitful.FreeUnits{(), NoDims, nothing}, Unitful.FreeUnits{(), NoDims, nothing}}}, specific_inter_lists::Tuple{InteractionList2Atoms{CuArray{Int32, 1, CUDA.DeviceMemory}, CuArray{HarmonicBond{Float32, Float64}, 1, CUDA.DeviceMemory}}, InteractionList3Atoms{CuArray{Int32, 1, CUDA.DeviceMemory}, CuArray{HarmonicAngle{Float64, Float64}, 1, CUDA.DeviceMemory}}, InteractionList4Atoms{CuArray{Int32, 1, CUDA.DeviceMemory}, CuArray{PeriodicTorsion{6, Float64, Float64}, 1, CUDA.DeviceMemory}}}, general_inters::Tuple{}, neighbor_finder::DistanceNeighborFinder{CuArray{Bool, 2, CUDA.DeviceMemory}, Float64}, force_units::Unitful.FreeUnits{(), NoDims, nothing}, energy_units::Unitful.FreeUnits{(), NoDims, nothing}}, ::Type{System})
      @ Zygote ~/.julia/packages/Zygote/NRp5C/src/compiler/interface2.jl:0
    [9] loss
      @ ~/projects/CESMIX/Molly.jl/test/zygote.jl:140 [inlined]
   [10] _pullback(::Zygote.Context{false}, ::var"#loss#39"{UnionAll, Bool, Bool, Bool, Bool, DistanceNeighborFinder{CuArray{Bool, 2, CUDA.DeviceMemory}, Float64}, InteractionList4Atoms{CuArray{Int32, 1, CUDA.DeviceMemory}, CuArray{PeriodicTorsion{6, Float64, Float64}, 1, CUDA.DeviceMemory}}, InteractionList3Atoms{CuArray{Int32, 1, CUDA.DeviceMemory}, CuArray{HarmonicAngle{Float64, Float64}, 1, CUDA.DeviceMemory}}, Vector{Float64}, CuArray{Int32, 1, CUDA.DeviceMemory}, CuArray{Int32, 1, CUDA.DeviceMemory}, Tuple{LennardJones{false, DistanceCutoff{Float64, Float64, Float64}, Int64, Int64, Unitful.FreeUnits{(), NoDims, nothing}, Unitful.FreeUnits{(), NoDims, nothing}}, CoulombReactionField{Float64, Float64, Int64, Float64, Unitful.FreeUnits{(), NoDims, nothing}, Unitful.FreeUnits{(), NoDims, nothing}}}, Vector{SVector{3, ForwardDiff.Dual{Nothing, Float64, 1}}}, Vector{SVector{3, ForwardDiff.Dual{Nothing, Float64, 1}}}, Vector{SVector{3, Float64}}, Vector{SVector{3, Float64}}, VelocityVerlet{Float64, RescaleThermostat{Float64}}, CubicBoundary{Float64}, Float64, Int64, Int64, var"#mean_min_separation#28"{var"#abs2_vec#27"}}, ::Float64, ::Float64)
      @ Zygote ~/.julia/packages/Zygote/NRp5C/src/compiler/interface2.jl:0
   [11] pullback(::Function, ::Zygote.Context{false}, ::Float64, ::Vararg{Float64})
      @ Zygote ~/.julia/packages/Zygote/NRp5C/src/compiler/interface.jl:90
   [12] pullback(::Function, ::Float64, ::Float64)
      @ Zygote ~/.julia/packages/Zygote/NRp5C/src/compiler/interface.jl:88
   [13] gradient(::Function, ::Float64, ::Vararg{Float64})
      @ Zygote ~/.julia/packages/Zygote/NRp5C/src/compiler/interface.jl:147
   [14] macro expansion
      @ ~/projects/CESMIX/Molly.jl/test/zygote.jl:202 [inlined]
   [15] macro expansion
      @ ~/builds/julia-1.10.2/share/julia/stdlib/v1.10/Test/src/Test.jl:1577 [inlined]
   [16] top-level scope
      @ ~/projects/CESMIX/Molly.jl/test/zygote.jl:31
   [17] include(fname::String)
      @ Base.MainInclude ./client.jl:489
   [18] top-level scope
      @ ~/projects/CESMIX/Molly.jl/test/runtests.jl:110
   [19] include(fname::String)
      @ Base.MainInclude ./client.jl:489
   [20] top-level scope
      @ none:6
   [21] eval
      @ ./boot.jl:385 [inlined]
   [22] exec_options(opts::Base.JLOptions)
      @ Base ./client.jl:291
   [23] _start()
      @ Base ./client.jl:552

@leios
Copy link
Contributor Author

leios commented Oct 24, 2024

As a note, it looks like #182 is close-ish to being done? If so, it might be best to rework this PR once those changes are in.

As it stands, this PR can work as a branch for any non-CUDA GPU (As long as the user does not need differentiable sims).

@jgreener64
Copy link
Collaborator

Thanks for keeping going on this. Yes #182 will get merged soon, I was waiting on full Enzyme GPU broadcasting support to avoid GPU differentiable simulations breaking, but I might just merge it and wait for that to arrive later. Consequently I wouldn't worry about the Zygote error.

I'm happy to help update this PR after #182, I realise that the ground has been moving under you.

@leios
Copy link
Contributor Author

leios commented Oct 24, 2024

Yeah, no worries. I should go ahead and close the #99 PR since this one supersedes it.

On my end, I am happy to wait a little longer and rebase up when you are happy with #182. I have other projects to work on in the mean time.

Could you link the Enzyme issue? Is it this one? EnzymeAD/Enzyme.jl#1672

@jgreener64
Copy link
Collaborator

Okay great. The example I posted on JuliaGPU/CUDA.jl#2471 doesn't work as it stands, that and similar issues around reduction and broadcasting are the problem.

@vchuravy
Copy link

@jgreener64 if there isn't an open issue on Enzyme.jl associated with it it is very likely that Billy and I will lose track of it.

@jgreener64
Copy link
Collaborator

jgreener64 commented Oct 24, 2024

Billy has favoured moving CUDA-specific issues to CUDA.jl, e.g. EnzymeAD/Enzyme.jl#1454 (comment), in either case I can find some MWEs in the next few days of what is erroring and make sure they are tracked in issues.

@jgreener64
Copy link
Collaborator

#182 is now merged.

@wsmoses
Copy link

wsmoses commented Oct 30, 2024

Particular issue is this one: JuliaGPU/CUDA.jl#2455

Which is essentially that we haven't yet implemented the CUDA.jl reduction of broadcast derivative (just added the reduction of array function in cuda.jl)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants