Update to Parthenon with Kokkos 4.4.1 #13

brryan · 2024-11-20T03:00:28Z

Background

Chicoma and Venado are failing at runtime when running with multiple GPUs using CUDA-aware MPI. Forrest found that moving to Kokkos 4.4.1 fixes this issue at least on Venado.

Description of Changes

Switch our Parthenon submodule to the current branch that includes Kokkos 4.4.1 support.
Update env/bash script to support Venado (sort of)
Fix some compiler error (constexpr if capture)and warning (unused var) on recent nvcc

Checklist

New features are documented
Tests added for bug fixes and new features
(@lanl.gov employees) Update copyright on changed files
Parthenon PR is merged

…updates

…brryan/kokkos_441

pdmullen

LGTM!

adamdempsey90 · 2024-11-22T21:26:05Z

env/bash

    PARTITION="venado-gh"
-elif [[ "$HOSTNAME" =~ ^ve-rfe[4-7]$ || ( $SLURM_CLUSTER_NAME == "venado" && $SLURM_JOB_PARTITION == "cpu" ) ]]; then
+elif [[ "$HOSTNAME" =~ ^ve-rfe[4-7]$ || "$HOSTNAME" =~ ^ve-fe[4-7]$ || ( $SLURM_CLUSTER_NAME == "venado" && $SLURM_JOB_PARTITION == "cpu" ) ]]; then


I don't think you can distinguish cpu/gpu from the hostname. You can be on a grace-grace frontend but submit to a grace-hopper backend. The hostname you check on the backend is still the grace-grace one. I think if you did hostname in [1-3] or $SLURM_GPUS_ON_NODE > 0, that would always work

Yeah you're right, thanks for noticing this. I thought at some point in the past HOSTNAME wasn't defined on venado backends? But it definitely is now, maybe I'm just misremembering. Yes I can update this logic to fix this.

This also doesn't work for e.g. SLURM_JOB_PARTITION=gpu_debug (it's only been sneaking through because I always use gpu frontends for gpu backends etc.)

OK this should be fixed now. I tested it on cpu frontend, and gpu backend via either cpu frontend or gpu frontend. I'm not sure I have access to the CPU backends actually to test those

Benjamin Ransom Ryan and others added 8 commits November 8, 2024 13:36

Compiles and runs on one gpu

f35db4f

Merge branch 'develop' of github.com:lanl/artemis into brryan/venado_…

08d658f

…updates

Parthenon with Kokkos 4.4.1

a5736f2

Merge branch 'brryan/venado_updates' of github.com:lanl/artemis into …

7726375

…brryan/kokkos_441

Latest

809057b

Latest parthenon

28059c6

Clean up sloppy errors

f1409b1

One more error

f273b1b

brryan changed the title ~~Draft: Update to Parthenon with Kokkos 4.4.1~~ Update to Parthenon with Kokkos 4.4.1 Nov 22, 2024

brryan requested review from adamdempsey90 and pdmullen November 22, 2024 18:46

pdmullen approved these changes Nov 22, 2024

View reviewed changes

adamdempsey90 approved these changes Nov 22, 2024

View reviewed changes

Fix venado partition identification h/t AMD

c9e41ef

adamdempsey90 merged commit e2568f2 into develop Nov 22, 2024
4 checks passed

adamdempsey90 deleted the brryan/kokkos_441 branch November 22, 2024 23:37

brryan mentioned this pull request Dec 2, 2024

Fix Venado submission script issue #23

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update to Parthenon with Kokkos 4.4.1 #13

Update to Parthenon with Kokkos 4.4.1 #13

brryan commented Nov 20, 2024 •

edited

Loading

pdmullen left a comment

adamdempsey90 Nov 22, 2024

brryan Nov 22, 2024

brryan Nov 22, 2024

Update to Parthenon with Kokkos 4.4.1 #13

Update to Parthenon with Kokkos 4.4.1 #13

Conversation

brryan commented Nov 20, 2024 • edited Loading

Background

Description of Changes

Checklist

pdmullen left a comment

Choose a reason for hiding this comment

adamdempsey90 Nov 22, 2024

Choose a reason for hiding this comment

brryan Nov 22, 2024

Choose a reason for hiding this comment

brryan Nov 22, 2024

Choose a reason for hiding this comment

brryan commented Nov 20, 2024 •

edited

Loading