Move GPU CI pipelines from old daint to new daint #1239

msimberg · 2024-09-10T12:18:09Z

No description provided.

codacy-production · 2024-09-10T12:22:16Z

Coverage summary from Codacy

See diff coverage on Codacy

Coverage variation	Diff coverage
✅ -0.05% (target: -1.00%)	✅ ∅ (target: 90.00%)

Coverage variation details

	Coverable lines	Covered lines	Coverage
Common ancestor commit (`aa6ef39`)	18217	13769	75.58%
Head commit (`fe405f0`)	18217 (+0)	13760 (-9)	75.53% (-0.05%)

Coverage variation is the difference between the coverage for the head and common ancestor commits of the pull request branch: <coverage of head commit> - <coverage of common ancestor commit>

Diff coverage details

	Coverable lines	Covered lines	Diff coverage
Pull request (#1239)	0	0	∅ (not applicable)

Diff coverage is the percentage of lines that are covered by tests out of the coverable lines that the pull request added or modified: <covered lines added or modified>/<coverable lines added or modified> * 100%

See your quality gate settings Change summary preferences

_{Codacy stopped sending the deprecated coverage status on June 5th, 2024. Learn more}

aurianer

Thanks a lot!

.gitlab/includes/clang14_cuda11_pipeline.yml

msimberg · 2024-09-11T14:04:24Z

.gitlab/pipelines_on_push.yml

@@ -9,3 +9,6 @@ include:
  - local: '.gitlab/includes/clang14_cuda11_pipeline.yml'
  - local: '.gitlab/includes/gcc12_hip6_pipeline.yml'
  - local: '.gitlab/includes/sloc.yml'
+  # TODO: move to on_merge before merging


msimberg · 2024-09-11T14:07:18Z

Exporting NVIDIA_VISIBLE_DEVICES=all and NVIDIA_DRIVER_CAPABILITIES="compute,utility" seems to be what was required to get the container images to load the correct drivers etc. and avoid

cudaErrorInsufficientDriver (CUDA driver version is insufficient for CUDA runtime version)

These are from https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/docker-specialized.html#constraints.

These work when testing manually, but don't seem to be work in CI yet.

msimberg · 2024-09-17T09:42:21Z

All right, we're making some progress:

The GCC 12/CUDA 12 pipeline is working now.
The clang/cuda one isn't working because it uses valgrind and valgrind doesn't seem to like some of the instructions used. I'll see if using a less specific arm instruction set may help valgrind here.
The CUDA 11 pipeline isn't working. I would've expected it to be compatible despite the old CUDA version. I'll see if changing required driver versions etc. helps at all.

I may end up disabling the test steps for the latter two in this PR to reenable them in separate PRs.

.gitlab/includes/common_pipeline.yml

msimberg · 2024-09-18T07:56:57Z

.gitlab/includes/gcc13_gh200_pipeline.yml

@@ -8,7 +8,7 @@ include:
  - local: '.gitlab/includes/common_pipeline.yml'


Remove pipeline?

msimberg · 2024-09-18T08:20:04Z

The clang/cuda configuration with valgrind no longer complains about illegal instructions: good. It now reports many issues, which I don't know yet if they're real or not.

I'll aim to get the GCC 12/CUDA 12 pipeline running properly (still some tweaks needed on the CSCS CI side apparently) and then I'll attempt to revive the two other CUDA configurations separately, possibly introducing another valgrind configuration on x86.

.gitlab/docker/Dockerfile.spack_build

.gitlab/includes/clang14_cuda11_pipeline.yml

Moving to aarch64 triggers too many false positives.

Default stays 300 seconds.

The first test that uses the GPU can take significantly longer to run. Following tests take a normal amount of time. This seems to affect older CUDA 11.X versions. 11.8 does not have this issue, but 11.5 and 11.2 do.

There is already a a GCC 12 CI configuration with CUDA running on gh200/aarch64. Remove the GCC 13 configuration since it's too similar.

msimberg added this to the 0.29.0 milestone Sep 10, 2024

msimberg self-assigned this Sep 10, 2024

msimberg force-pushed the cuda-pipelines-gh200 branch 4 times, most recently from d11f8c2 to acdfcb0 Compare September 10, 2024 16:05

aurianer mentioned this pull request Sep 10, 2024

Rename santis pipeline with gh200 + add a test step on daint-alps #1244

Closed

aurianer approved these changes Sep 10, 2024

View reviewed changes

.gitlab/includes/clang14_cuda11_pipeline.yml Outdated Show resolved Hide resolved

msimberg force-pushed the cuda-pipelines-gh200 branch 2 times, most recently from a72b2f8 to 2798c26 Compare September 11, 2024 14:01

msimberg commented Sep 11, 2024

View reviewed changes

msimberg force-pushed the cuda-pipelines-gh200 branch from b6105a5 to 4716cc4 Compare September 12, 2024 07:52

aurianer mentioned this pull request Sep 12, 2024

Enable testing for gh200 #1130

Closed

msimberg force-pushed the cuda-pipelines-gh200 branch from 0dea610 to bd072ec Compare September 17, 2024 09:52

msimberg commented Sep 18, 2024

View reviewed changes

.gitlab/includes/common_pipeline.yml Outdated Show resolved Hide resolved

msimberg commented Sep 18, 2024

View reviewed changes

aurianer mentioned this pull request Sep 27, 2024

Move perftests reporting on PRs to alps #1255

Draft

msimberg removed this from the 0.29.0 milestone Sep 30, 2024

msimberg mentioned this pull request Oct 3, 2024

Add nvhpc@24.9 pipeline on alps #1260

Merged

msimberg force-pushed the cuda-pipelines-gh200 branch 3 times, most recently from 214b45e to 286b0c5 Compare November 14, 2024 12:10

msimberg force-pushed the cuda-pipelines-gh200 branch from 286b0c5 to 7541a28 Compare November 25, 2024 10:08

msimberg force-pushed the cuda-pipelines-gh200 branch from 9a982e3 to 559c234 Compare December 6, 2024 10:08

msimberg commented Dec 6, 2024

View reviewed changes

.gitlab/docker/Dockerfile.spack_build Outdated Show resolved Hide resolved

msimberg commented Dec 6, 2024

View reviewed changes

.gitlab/includes/clang14_cuda11_pipeline.yml Show resolved Hide resolved

msimberg mentioned this pull request Dec 9, 2024

Reenable CUDA + valgrind CI configuration #1367

Open

msimberg force-pushed the cuda-pipelines-gh200 branch 4 times, most recently from d474e3e to 178bad0 Compare December 9, 2024 16:24

msimberg force-pushed the cuda-pipelines-gh200 branch 4 times, most recently from 0bbf943 to 88e86d8 Compare December 20, 2024 19:51

msimberg added 15 commits January 6, 2025 11:02

Move GPU CI pipelines from old daint to new daint

894fe00

Rename CI templates with _rosa suffix to use _zen2 suffix

3a9dca5

Enable valgrind testing on GCC 12 CI configuration

e7def48

Disable valgrind testing on clang 14/CUDA 11 CI pipeline

bf2467d

Moving to aarch64 triggers too many false positives.

Don't allow failures on CSCS CI GH200 pipelines

8d0aeff

Make test timeout configurable by pipeline in CSCS CI

e506574

Default stays 300 seconds.

Use higher test timeout in CUDA 11 CI pipelines

4efbed3

The first test that uses the GPU can take significantly longer to run. Following tests take a normal amount of time. This seems to affect older CUDA 11.X versions. 11.8 does not have this issue, but 11.5 and 11.2 do.

Remove gcc13_gh200 CI configuration

9c5c320

There is already a a GCC 12 CI configuration with CUDA running on gh200/aarch64. Remove the GCC 13 configuration since it's too similar.

TEMP: Run all CUDA pipelines on push

d629548

Update SPACK_ARCH for NVHPC 24.9 CI configuration

e18af6a

Ensure NVHPC gets installed in CI even if GCC is used from system

6f06e06

Patch math-vector.h on aarch64 to fix CUDA/glibc incompatibility

9ffefa3

Don't use unsupported versions in CUDA/clang CI configuration

762276a

Update nested namespaces to satisfy modernize-concat-nested-namespaces

73895d7

Use GCC 13.2.0 to avoid system GCC for NVHPC 24.7 CI configuration

fe405f0

msimberg force-pushed the cuda-pipelines-gh200 branch from 88e86d8 to fe405f0 Compare January 6, 2025 10:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move GPU CI pipelines from old daint to new daint #1239

Move GPU CI pipelines from old daint to new daint #1239

msimberg commented Sep 10, 2024 •

edited

Loading

codacy-production bot commented Sep 10, 2024 •

edited

Loading

aurianer left a comment

msimberg Sep 11, 2024

msimberg commented Sep 11, 2024 •

edited

Loading

msimberg commented Sep 17, 2024

msimberg Sep 18, 2024

msimberg commented Sep 18, 2024

		@@ -8,7 +8,7 @@ include:
		- local: '.gitlab/includes/common_pipeline.yml'

Move GPU CI pipelines from old daint to new daint #1239

Are you sure you want to change the base?

Move GPU CI pipelines from old daint to new daint #1239

Conversation

msimberg commented Sep 10, 2024 • edited Loading

codacy-production bot commented Sep 10, 2024 • edited Loading

Coverage summary from Codacy

See diff coverage on Codacy

See your quality gate settings Change summary preferences

aurianer left a comment

Choose a reason for hiding this comment

msimberg Sep 11, 2024

Choose a reason for hiding this comment

msimberg commented Sep 11, 2024 • edited Loading

msimberg commented Sep 17, 2024

msimberg Sep 18, 2024

Choose a reason for hiding this comment

msimberg commented Sep 18, 2024

msimberg commented Sep 10, 2024 •

edited

Loading

codacy-production bot commented Sep 10, 2024 •

edited

Loading

msimberg commented Sep 11, 2024 •

edited

Loading