Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update Kokkos version to 4.4.1 #1191

Merged
merged 27 commits into from
Nov 22, 2024
Merged

Update Kokkos version to 4.4.1 #1191

merged 27 commits into from
Nov 22, 2024

Conversation

Yurlungur
Copy link
Collaborator

@Yurlungur Yurlungur commented Oct 16, 2024

PR Summary

Bumps Kokkos to 4.4.1.

This revealed some memory leaks in our view of view (because we never deallocate/destruct the inner views).
This PR fixes this by using the SequentialHostInit property that was introduced in 4.4.1 for the outer view allocation.
With this property the destructors of the inner views are called plainly on the host (rather than inside a parallel region, which is illegal in the Kokkos programming model).
Moreover, this required different logic as our view of view are on device (so we need to take special care when to pass SequentialHostInit because it cannot be passed to a device outer view).
Thus we only pass SequentialHostInit in the outer view allocation when compiling for a host execution space (because mirror views are noops then) and only pass SequentialHostInit for the outer view host mirror, when compiling on device.

Given that our ParArray#D interface did not allow to parse arbitrary allocation properties in the ctor, I decided to switch the outer views to plain Kokkos::Views (without state info) over changing the ctor interface of the ParArray#D to also parse allocation properties (to keep the interface small/clean).

It's probably worth to double check if any of the outer views need a state (but I don't think so because the code compiles fine).

Additional details/discussion also #1205 and #1193

PR Checklist

  • Code passes cpplint
  • New features are documented.
  • Adds a test for any bugs fixed. Adds tests for new features.
  • Code is formatted
  • Changes are summarized in CHANGELOG.md
  • Change is breaking (API, behavior, ...)
    • Change is additionally added to CHANGELOG.md in the breaking section
    • PR is marked as breaking
    • Short summary API changes at the top of the PR (plus optionally with an automated update/fix script)
  • CI has been triggered on Darwin for performance regression tests.
  • Docs build
  • (@lanl.gov employees) Update copyright on changed files

@Yurlungur Yurlungur linked an issue Oct 16, 2024 that may be closed by this pull request
Copy link
Collaborator

@pgrete pgrete left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely in favor (plus the bump to 4.4.1).

CHANGELOG.md Outdated Show resolved Hide resolved
@pdmullen pdmullen changed the title [trivial] bump Kokkos version to 4.2.01 [trivial] Update Kokkos version to 4.4.1 Oct 17, 2024
@BenWibking
Copy link
Collaborator

Just wanted to note that updating to Kokkos 4.4.x causes finalization of parthenon::Mesh to fail: #1193.

@BenWibking
Copy link
Collaborator

@pgrete It looks like that didn't fix it. Are there other view-of-views usage elsewhere in the code?

@pgrete pgrete mentioned this pull request Oct 21, 2024
@pgrete
Copy link
Collaborator

pgrete commented Oct 21, 2024

@pgrete It looks like that didn't fix it. Are there other view-of-views usage elsewhere in the code?

Potentially. But the fix didn't work because I put the view_alloc info to the device view (which triggered a static assert fail).
With the latest version here, I don't see finalize issues in AthenaPK any more (but I still need to check with the CI here does still fail).

@pgrete
Copy link
Collaborator

pgrete commented Oct 21, 2024

$ export KOKKOS_TOOLS_LIBS=/home/pgrete/src/kokkos-tools/build/debugging/vov-bug-finder/libkp_view_of_views_bug_finder.so
[18:02:47][pgrete@haerke: ~/src/parthenon/build-skx]
$ /home/pgrete/src/parthenon/build-skx/tst/unit/unit_tests "Can pull variables from containers based on Metadata"
Filters: Can pull variables from containers based on Metadata
view of views "MakePack::cv" not properly cleared this fence labelled "HostSpace::impl_deallocate before free" will hang

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
unit_tests is a Catch v2.13.8 host application.
Run with -? for options

-------------------------------------------------------------------------------
Can pull variables from containers based on Metadata
      Given: A Container with a set of variables initialized to zero
-------------------------------------------------------------------------------
/home/pgrete/src/parthenon/tst/unit/test_meshblock_data_iterator.cpp:76
...............................................................................

/home/pgrete/src/parthenon/tst/unit/test_meshblock_data_iterator.cpp:364: FAILED:
  {Unknown expression after the reported line}
due to a fatal error condition:
  SIGABRT - Abort (abnormal termination) signal

===============================================================================
test cases:  1 |  0 passed | 1 failed
assertions: 48 | 47 passed | 1 failed

I guess all our packs are broken (so the corresponding unit tests fail correctly).

@pgrete
Copy link
Collaborator

pgrete commented Oct 22, 2024

Ugh, this gets uglier with the minute...
The host tests now fail because the create_mirror_view is a noop so the new SequentialHostInit is not used and the ViewOfView on host fails to deallocate.

I'm removing the "trivial" from the PR and adding the bug label as we're leaking memory...

@pgrete pgrete changed the title [trivial] Update Kokkos version to 4.4.1 Update Kokkos version to 4.4.1 Oct 22, 2024
@pgrete pgrete added the bug Something isn't working label Oct 22, 2024
@pgrete pgrete disabled auto-merge November 6, 2024 15:37
@pgrete pgrete requested a review from pdmullen November 8, 2024 15:12
Copy link
Collaborator

@pdmullen pdmullen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if there should be some sort of "announcement" for downstream devs that if they invoke views of views, they need to revisit things...

@@ -127,11 +128,11 @@ struct ProResInfo {
int GetBufferSize(MeshBlock *pmb, const NeighborBlock &nb,
std::shared_ptr<Variable<Real>> v);

using BndInfoArr_t = ParArray1D<BndInfo>;
using BndInfoArr_t = Kokkos::View<BndInfo *, LayoutWrapper, DevMemSpace>;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No objection... just for my own understanding, can you explain the necessity of this change?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to pass the Kokkos::SequentialHostInit property, which is done by not passing a string label but a properties object constructed via Kokkos::view_alloc(Kokkos::SequentialHostInit, label).
This works for Views but ParArray#D are not plain View any more but carry a state.
Thus, in order to pass the properties object, we'd have to mirror that interface for ParArray#D, too.
Given that there's not place in the code (I could find) where we use the state for the outer view in a view of view, I decided to just use plain Views again (rather than extending the interface).

In order to not leak (more) "Kokkos" pieces in the codebase, I'd be happy to introduce RawParArray#D that directly map to views if that gets a 👍 by more people.

cache.bnd_info_h = Kokkos::create_mirror_view(cache.bnd_info);
cache.bnd_info = BndInfoArr_t(ViewOfViewAlloc("bnd_info"), nbound);
cache.bnd_info_h = Kokkos::create_mirror_view(
Kokkos::view_alloc(Kokkos::SequentialHostInit), cache.bnd_info);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder why kokkos can't automatically handle this for us.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we have our own type/calls that hide these kokkos calls for views of views to minimize boilerplate?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder why kokkos can't automatically handle this for us.

Probably because the general answer to views of views is: No :D

Should we have our own type/calls that hide these kokkos calls for views of views to minimize boilerplate?

Yes, I can add that. How about ViewOfViewMirror(view) to mirror the ViewOfViewAlloc?

Copy link
Collaborator Author

@Yurlungur Yurlungur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't tested downstream and it looks like some tests are failing but otherwise LGTM. (I can't approve since it was my branch originally ;) )

doc/sphinx/src/development.rst Outdated Show resolved Hide resolved
src/bvals/comms/bnd_info.cpp Outdated Show resolved Hide resolved
cache.bnd_info_h = Kokkos::create_mirror_view(cache.bnd_info);
cache.bnd_info = BndInfoArr_t(ViewOfViewAlloc("bnd_info"), nbound);
cache.bnd_info_h = Kokkos::create_mirror_view(
Kokkos::view_alloc(Kokkos::SequentialHostInit), cache.bnd_info);
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we have our own type/calls that hide these kokkos calls for views of views to minimize boilerplate?

// on the host. If the ViewOfViews in on the device, then `SequentialHostInit` should be
// passed when calling `create_mirror_view`.
template <typename T = DevMemSpace>
auto ViewOfViewAlloc(const std::string &label) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@pgrete
Copy link
Collaborator

pgrete commented Nov 12, 2024

I haven't tested downstream and it looks like some tests are failing but otherwise

The failing tests are due to the increased build size, see discussion on Matrix.

$ du -hs build-*dbg
18G	build-cuda-dbg
2.6G	build-host-dbg
8.5G	build-hip-dbg
``
The CI runner by default only has 14G space.
Let's see if there's an easy command line/compiler switch or if we have to update the build (to sequentially build and clean pieces).

Comment on lines +72 to +73
But if you have to (which is the case in some places inside Parthenon)
then follow this pattern:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this preferred over just making the inner Views unmanaged in a View of Views? My understanding is that the unmanaged view doesn't call the destructors, so doing this would also solve the problem since I don't think we ever make a view of views that actually manages the inner views (maybe this is wrong?). I think it would be trivial to make an UnmanagedParArray#D by just changing the view type it is templated on. Then we wouldn't have to have raw Kokkos views floating around in the code.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly, I don't know and potentially am afraid of some unknown side effects.
What happens to the reference counter of the existing managed view when we assign it to an unmanaged element of the outer view.
Similarly, for the boundary info and prolong/restrict info outer view that contain objects which implicitly contain ParArray#Ds. If they're not reference counted, we'd manually have to destruct those, don't we (or how else would the inner object know that they don't need to exist anymore)?

Copy link
Collaborator

@lroberts36 lroberts36 Nov 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the Kokkos docs:

“Unmanaged” means that Kokkos does not do reference counting or automatic deallocation for those Views.

I think this implies that the unmanaged view doesn't increment the reference count. So it is possible that the other view pointing to the same memory deallocates the memory while the unmanaged view still points to it, but the unmanaged view will never try to call the destructor.

I believe that in BndInfo and prolong/restrict info we are never creating a ParArray#D that is only held by those objects. Rather, we are essentially "pointing" to ParArray#Ds that are held elsewhere (e.g. the communication buffer, variable data). I think the other places those objects persist longer than the packs, BndInfo, etc. so it is safe to have the internal Views in views of views unmanaged.

I think my real concern is that it is unclear (at least to me) what SequentialHostInit is doing.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the Kokkos docs:

“Unmanaged” means that Kokkos does not do reference counting or automatic deallocation for those Views.

I think this implies that the unmanaged view doesn't increment the reference count. So it is possible that the other view pointing to the same memory deallocates the memory while the unmanaged view still points to it, but the unmanaged view will never try to call the destructor.

I believe that in BndInfo and prolong/restrict info we are never creating a ParArray#D that is only held by those objects. Rather, we are essentially "pointing" to ParArray#Ds that are held elsewhere (e.g. the communication buffer, variable data). I think the other places those objects persist longer than the packs, BndInfo, etc. so it is safe to have the internal Views in views of views unmanaged.

I think my real concern is that it is unclear (at least to me) what SequentialHostInit is doing.

The PR that added this feature to Kokkos is just a few lines (kokkos/kokkos#7229 excluding the test code).
AFAIK the property effectively tells Kokkos to explicitly call the destructor (line 229 https://github.com/kokkos/kokkos/pull/7229/files#diff-0d719ff2418eb0512b065dca1765d2788ddd4a524734f8f96b7af34a22f4b560) which results in a clean deallocation of the inner view.
Otherwise, the default dtor of the outer view would call the dtor of the inner views in a parallel region, which is illegal.

But maybe I'm also wrong.

But if that's correct, then I somehow feel more comfortable with this way (for the moment) as I understand that concept whereas it's unclear to me if there are any side effects (not necessarily from a Kokkos point of view but from the Parthenon usage point of view) in using unmanaged inner views.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what I didn't understand was that it seemed like I could only set SequentialHostInit for host space views of device space views and it seemed like we actually needed this for device space views of device space views. But it is very possible that I just don't understand what is going on.

@fglines-nv
Copy link
Collaborator

Just throwing some input here -- this PR was necessary for Parthenon to work on Venado with multiple GPUs per node after the recent DST. Old Parthenon builds and downstream codes (with Kokkos_ENABLE_IMPL_CUDA_MALLOC_ASYNC=OFF) were giving this error:

terminate called after throwing an instance of 'std::runtime_error'
  what():  cudaMemsetAsync( dst.data(), 0, dst.size() * sizeof(typename View<DT, DP...>::value_type), exec_space_instance.cuda_stream()) error( cudaErrorInvalidValue): invalid argument /users/forrestglines/projects/parthenon-project/parthenon/external/Kokkos/core/src/Kokkos_Cuda.hpp:252

Updating to this PR avoids the issue.

@pgrete
Copy link
Collaborator

pgrete commented Nov 21, 2024

Alright, I made the changes as discussed last week, i.e.,

  • the manual SequentialHostInit for creating a host mirror is now explicitly hidden behind a create_view_of_view_mirror function
    -Removed the plain Kokkos::Views I introduced in favor of ParArray#DRaw (which directly map to Kokkos Views and hide that interface in the Parthenon code outside the kokkos_abstraction
  • Changed the "debug" build for the "check compilers" stage to -O0 (--dopt=off as suggested by @forrestglines didn't work locally so let's see if the other change alonge brings us back to where we want to be)
  • Bonus: the check compiler for every single commit was probably overkill, I now updated the logic to match the "extended" logic so that it's only called when auto merge is on

Given the extend of the changes, might be good if someone could have a final look before merging.

@pgrete pgrete enabled auto-merge (squash) November 21, 2024 14:35
@pgrete pgrete disabled auto-merge November 21, 2024 14:35
@Yurlungur
Copy link
Collaborator Author

@brryan @bprather is this working for your downstreams? @pdmullen @lroberts36 @AstroBarker we should check for phoebus and riot. (Though I don't expect any blocks.)

@BenWibking
Copy link
Collaborator

The HIP builds are failing due to a compiler bug:

/usr/bin/hipcc  -DCATCH_CONFIG_ENABLE_BENCHMARKING -DKOKKOS_DEPENDENCE -I../src -Isrc/generated -I/usr/local/hdf5/serial/include -IKokkos -IKokkos/core/src -I../external/Kokkos/core/src -I../external/Kokkos/core/src/../../tpls/desul/include -I../external/Kokkos/core/src/../../tpls/mdspan/include -IKokkos/containers/src -I../external/Kokkos/containers/src -IKokkos/algorithms/src -I../external/Kokkos/algorithms/src -IKokkos/simd/src -I../external/Kokkos/simd/src -I../external/Catch2/single_include -O0   -fno-gpu-rdc --offload-arch=gfx1030 -std=c++17 -MD -MT tst/unit/CMakeFiles/unit_tests.dir/test_unit_params.cpp.o -MF tst/unit/CMakeFiles/unit_tests.dir/test_unit_params.cpp.o.d -o tst/unit/CMakeFiles/unit_tests.dir/test_unit_params.cpp.o -c ../tst/unit/test_unit_params.cpp
error: Illegal instruction detected: Invalid dpp_ctrl value: wavefront shifts are not supported on GFX10+
renamable $vgpr2 = V_MOV_B32_dpp $vgpr2(tied-def 0), $vgpr1, 304, 15, 15, -1, implicit $exec
1 error generated when compiling for gfx1030.

But I don't think it makes sense to build for gfx1030, since that's not a datacenter GPU... Would it make sense to change the CI build to gfx90a (MI250X) or gfx942 (MI300) (https://rocm.docs.amd.com/en/latest/reference/gpu-arch-specs.html)?

@pgrete
Copy link
Collaborator

pgrete commented Nov 21, 2024

The HIP builds are failing due to a compiler bug:

/usr/bin/hipcc  -DCATCH_CONFIG_ENABLE_BENCHMARKING -DKOKKOS_DEPENDENCE -I../src -Isrc/generated -I/usr/local/hdf5/serial/include -IKokkos -IKokkos/core/src -I../external/Kokkos/core/src -I../external/Kokkos/core/src/../../tpls/desul/include -I../external/Kokkos/core/src/../../tpls/mdspan/include -IKokkos/containers/src -I../external/Kokkos/containers/src -IKokkos/algorithms/src -I../external/Kokkos/algorithms/src -IKokkos/simd/src -I../external/Kokkos/simd/src -I../external/Catch2/single_include -O0   -fno-gpu-rdc --offload-arch=gfx1030 -std=c++17 -MD -MT tst/unit/CMakeFiles/unit_tests.dir/test_unit_params.cpp.o -MF tst/unit/CMakeFiles/unit_tests.dir/test_unit_params.cpp.o.d -o tst/unit/CMakeFiles/unit_tests.dir/test_unit_params.cpp.o -c ../tst/unit/test_unit_params.cpp
error: Illegal instruction detected: Invalid dpp_ctrl value: wavefront shifts are not supported on GFX10+
renamable $vgpr2 = V_MOV_B32_dpp $vgpr2(tied-def 0), $vgpr1, 304, 15, 15, -1, implicit $exec
1 error generated when compiling for gfx1030.

But I don't think it makes sense to build for gfx1030, since that's not a datacenter GPU... Would it make sense to change the CI build to gfx90a (MI250X) or gfx942 (MI300) (https://rocm.docs.amd.com/en/latest/reference/gpu-arch-specs.html)?

Or is this a compiler issue?
Because it builds and runs for the integration-amdgpu tests which is executed on a Kokkos_ARCH_NAVI1030 machine.

@BenWibking
Copy link
Collaborator

But I don't think it makes sense to build for gfx1030, since that's not a datacenter GPU... Would it make sense to change the CI build to gfx90a (MI250X) or gfx942 (MI300) (https://rocm.docs.amd.com/en/latest/reference/gpu-arch-specs.html)?

Or is this a compiler issue? Because it builds and runs for the integration-amdgpu tests which is executed on a Kokkos_ARCH_NAVI1030 machine.

Ah, didn't realize the CI machine had a gfx1030 card. Yes, definitely a compiler issue.

@brryan
Copy link
Collaborator

brryan commented Nov 21, 2024

@brryan @bprather is this working for your downstreams? @pdmullen @lroberts36 @AstroBarker we should check for phoebus and riot. (Though I don't expect any blocks.)

This works for me downstream in Artemis on Venado GPUs, at least with 2 ranks on one interactive node. The ultimate test is 8 ranks on 2 nodes but the machine seems busy right now so I think from my perspective its fine if this gets merged now.

@pgrete
Copy link
Collaborator

pgrete commented Nov 22, 2024

Bumping rocm to 6.2 didn't fix the issue.
I now changed the Arch just for the debug test build (which worked locally).
Not sure why this is causing problems. Hopefully, this works now.

@pgrete pgrete enabled auto-merge (squash) November 22, 2024 14:21
@pgrete pgrete disabled auto-merge November 22, 2024 15:14
@pgrete pgrete enabled auto-merge (squash) November 22, 2024 15:14
@pgrete pgrete merged commit b559452 into develop Nov 22, 2024
35 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working build configuration
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Update to Kokkos 4.2?
8 participants