Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Kokkos Memory Pool when using Shared Memory Spaces #67

Merged
merged 2 commits into from
Dec 14, 2023

Conversation

matthew-mccall
Copy link

This pull request fixes an issue that causes a segmentation fault when using the memory pool for the Kokkos backend. Specifically, kokkos_malloc and kokkos_free normally allocate/free from the default memory space unless that memory space is SharedSpace. As such, SharedSpace must be explicitly passed as a template parameter to kokkos_malloc and kokkos_free when using shared memory spaces.

While this rectifies the problem for most tests, run_unit_mesh seems to fail (for a different reason this time). It appears some fencing needs to be done after Kokkos::parallel_for since lack of which caused an error to be thrown later in the test.

@matthew-mccall matthew-mccall changed the title Fix Kokkos Memory Pool with using Shared Memory Spaces Fix Kokkos Memory Pool when using Shared Memory Spaces Oct 30, 2023
@cwsmith
Copy link

cwsmith commented Oct 31, 2023

@matthew-mccall Running ctest on a perlmutter compute node in an interactive allocation.

With this branch (e830e7e) the deltawing adapt '500k' case (see https://github.com/SCOREC/omega_h/wiki/Building-on-NERSC-Perlmutter for run instructions) with the cuda memory space and the mempool enabled in 6.3346 seconds. With the Kokkos Shared (CUDA UVM) memory space and the mempool enabled the test completed in 7.22439 seconds; about 15% slower. The entity counts and mesh quality histograms are identical between the two runs. Testing in July (2745b6b) of the same case took 6.04235 seconds; the fence appears to add about a 5% overhead.

this branch (e830e7e) test failures

All tests except the run_mpi_tests are passing (ran with ctest -E mpi --repeat until-fail:3).

9/23 Testing: run_mpi_tests
9/23 Test: run_mpi_tests
Command: "/pscratch/sd/c/cwsmith/omegahUvmBuilds/build-omegah-CUDAUVM-fix/src/mpi_tests" "--osh-pool"
Directory: /pscratch/sd/c/cwsmith/omegahUvmBuilds/build-omegah-CUDAUVM-fix/src
"run_mpi_tests" start time: Oct 31 09:42 PDT
Output:
----------------------------------------------------------
mpi_tests: /global/homes/c/cwsmith/develop/omegahUvm/omega_h/src/mpi_tests.cpp:31: void test_one_rank(Omega_h::CommPtr): Assertion `b == Read<GO>({3, 2, 1, 0})' failed.
<end of output>
Test time =   1.73 sec
----------------------------------------------------------
Test Failed.
"run_mpi_tests" end time: Oct 31 09:42 PDT
"run_mpi_tests" time elapsed: 00:00:01

Running it multiple times indicates there is non-deterministic behavior:

$ ctest -R mpi --repeat until-fail:10
Test project /pscratch/sd/c/cwsmith/omegahUvmBuilds/build-omegah-CUDAUVM-fix
    Start 9: run_mpi_tests
    Test #9: run_mpi_tests ....................   Passed    1.71 sec
    Start 9: run_mpi_tests
    Test #9: run_mpi_tests ....................   Passed    1.61 sec
    Start 9: run_mpi_tests
    Test #9: run_mpi_tests ....................   Passed    1.63 sec
    Start 9: run_mpi_tests
    Test #9: run_mpi_tests ....................Subprocess aborted***Exception:   1.61 sec

0% tests passed, 1 tests failed out of 1

Label Time Summary:
mesh    =   1.61 sec*proc (1 test)

Total Test time (real) =   6.57 sec

The following tests FAILED:
          9 - run_mpi_tests (Subprocess aborted)
Errors while running CTest
Output from these tests are in: /pscratch/sd/c/cwsmith/omegahUvmBuilds/build-omegah-CUDAUVM-fix/Testing/Temporary/LastTest.log
Use "--rerun-failed --output-on-failure" to re-run the failed cases verbosely.

scorec-v10.8.2 test failures

2:run_arrayops
3:run_reprosum
4:run_unit_array_algs
5:run_unit_mesh
6:run_unit_io
7:run_unit_parser
9:run_mpi_tests
10:serial_1d_test
11:run_corner_test
12:run_coarsen_test
13:serial_2d_conserve
14:warp_test_serial
15:run_aniso_test
16:run_random_test
17:amr_test2
18:reverse_class_test
19:rc_field_test

These tests all fail with Exception: **SegFault** without any useful output in the ctest log. For example:

test 2
    Start 2: run_arrayops

2: Test command: /pscratch/sd/c/cwsmith/omegahUvmBuilds/build-omegah-CUDAUVM/src/arrayops_test "--osh-pool"
2: Working Directory: /pscratch/sd/c/cwsmith/omegahUvmBuilds/build-omegah-CUDAUVM/src
2: Test timeout computed to be: 1500
1/1 Test #2: run_arrayops .....................***Exception: SegFault  1.79 sec

0% tests passed, 1 tests failed out of 1

Label Time Summary:
base    =   1.79 sec*proc (1 test)

Total Test time (real) =   1.80 sec

The following tests FAILED:
          2 - run_arrayops (SEGFAULT)

build details

Using gcc 11.2 and cuda 11.7 via the following modules:

module load PrgEnv-gnu
module load cmake/3.24.3

kokkos 4.1.00 cmake configure command

bdir=build-kokkos
cmake -S kokkos -B $bdir \
  -DBUILD_SHARED_LIBS=ON \
  -DCRAYPE_LINK_TYPE=dynamic \
  -DCMAKE_CXX_COMPILER=$PWD/kokkos/bin/nvcc_wrapper \
  -DKokkos_ARCH_AMPERE80=ON \
  -DKokkos_ENABLE_SERIAL=ON \
  -DKokkos_ENABLE_OPENMP=off \
  -DKokkos_ENABLE_CUDA=on \
  -DKokkos_ENABLE_CUDA_LAMBDA=on \
  -DKokkos_ENABLE_DEBUG=off \
  -DCMAKE_INSTALL_PREFIX=$bdir/install

omegah cmake configure command

cmake -S omega_h -B $bdir \
  -DCMAKE_INSTALL_PREFIX=$bdir/install \
  -DBUILD_SHARED_LIBS=on \
  -DOmega_h_USE_Kokkos=on \
  -DOmega_h_CUDA_ARCH=80 \
  -DOmega_h_MEM_SPACE_SHARED=on \
  -DOmega_h_USE_MPI=off \
  -DOmega_h_USE_libMeshb=on \
  -DBUILD_TESTING=on \
  -DCMAKE_CXX_COMPILER=CC \
  -DENABLE_CTEST_MEMPOOL=on

slurm interactive allocation command

salloc --nodes 1 --qos interactive --time 00:10:00 --constraint gpu -n 1 --gpus-per-task 1 -c 32

@matthew-mccall matthew-mccall self-assigned this Nov 6, 2023
@matthew-mccall matthew-mccall marked this pull request as ready for review December 14, 2023 19:11
@cwsmith
Copy link

cwsmith commented Dec 14, 2023

With CUDA UVM all tests are passing on a workstation with a Nvidia 2080, gcc 10.1.0, and cuda 11.4. More debugging/testing will be needed to resolve the failures in run_mpi_test but I think we should merge this 'as is' and create an issue for the A100/Perlmutter UVM failure.

@cwsmith cwsmith merged commit ecee285 into SCOREC:master Dec 14, 2023
2 checks passed
@matthew-mccall matthew-mccall deleted the kokkosPoolCudaUvmFix branch January 30, 2024 01:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants