Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update to Parthenon with Kokkos 4.4.1 #13

Merged
merged 9 commits into from
Nov 22, 2024
Merged

Conversation

brryan
Copy link
Collaborator

@brryan brryan commented Nov 20, 2024

Background

Chicoma and Venado are failing at runtime when running with multiple GPUs using CUDA-aware MPI. Forrest found that moving to Kokkos 4.4.1 fixes this issue at least on Venado.

Description of Changes

  • Switch our Parthenon submodule to the current branch that includes Kokkos 4.4.1 support.
  • Update env/bash script to support Venado (sort of)
  • Fix some compiler error (constexpr if capture)and warning (unused var) on recent nvcc

Checklist

  • New features are documented
  • Tests added for bug fixes and new features
  • (@lanl.gov employees) Update copyright on changed files
  • Parthenon PR is merged

@brryan brryan changed the title Draft: Update to Parthenon with Kokkos 4.4.1 Update to Parthenon with Kokkos 4.4.1 Nov 22, 2024
Copy link
Collaborator

@pdmullen pdmullen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

env/bash Outdated
PARTITION="venado-gh"
elif [[ "$HOSTNAME" =~ ^ve-rfe[4-7]$ || ( $SLURM_CLUSTER_NAME == "venado" && $SLURM_JOB_PARTITION == "cpu" ) ]]; then
elif [[ "$HOSTNAME" =~ ^ve-rfe[4-7]$ || "$HOSTNAME" =~ ^ve-fe[4-7]$ || ( $SLURM_CLUSTER_NAME == "venado" && $SLURM_JOB_PARTITION == "cpu" ) ]]; then
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you can distinguish cpu/gpu from the hostname. You can be on a grace-grace frontend but submit to a grace-hopper backend. The hostname you check on the backend is still the grace-grace one. I think if you did hostname in [1-3] or $SLURM_GPUS_ON_NODE > 0, that would always work

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah you're right, thanks for noticing this. I thought at some point in the past HOSTNAME wasn't defined on venado backends? But it definitely is now, maybe I'm just misremembering. Yes I can update this logic to fix this.

This also doesn't work for e.g. SLURM_JOB_PARTITION=gpu_debug (it's only been sneaking through because I always use gpu frontends for gpu backends etc.)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK this should be fixed now. I tested it on cpu frontend, and gpu backend via either cpu frontend or gpu frontend. I'm not sure I have access to the CPU backends actually to test those

@adamdempsey90 adamdempsey90 merged commit e2568f2 into develop Nov 22, 2024
4 checks passed
@adamdempsey90 adamdempsey90 deleted the brryan/kokkos_441 branch November 22, 2024 23:37
@brryan brryan mentioned this pull request Dec 2, 2024
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants