-
Notifications
You must be signed in to change notification settings - Fork 638
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update PkgCI test_amd to use MI300x conductor cluster #19517
base: main
Are you sure you want to change the base?
Conversation
@@ -19,7 +19,7 @@ on: | |||
|
|||
jobs: | |||
test_mi300: | |||
runs-on: nodai-amdgpu-mi300-x86-64 | |||
runs-on: linux-mi300-gpu-1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://github.com/iree-org/iree/actions/runs/12399057884/job/34613576470?pr=19517#step:8:195
iree/runtime/src/iree/hal/drivers/hip/dynamic_symbols.c:160: UNAVAILABLE; HIP runtime library 'amdhip64.dll'/'libamdhip64.so' not available: please ensure installed and in dynamic library search path:
Tried: libamdhip64.so
iree/runtime/src/iree/base/internal/dynamic_library_posix.c:165: NOT_FOUND; failed to load dynamic library (possibly not found on any search path): libamdhip64.so: cannot open shared object file: No such file or directory; creating driver for device 'hip'; resolving dependencies for 'module'
What drivers and software are installed on these new runners? Should we run under Docker (either in the runners themselves or in the job)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah discussing with Jodie offline. We can use the rocm gh runner docker for the runner instantiation, but I think it is probably best to specify the docker in the job (workflow file), so it is visible to everyone what rocm is being used
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it may depend on how long it takes to download and start the docker image. If it takes 5 minutes to download a 70GB docker image, configuring that in the runner itself could help hide some latency from workflows?
We could revive these dockerfiles as needed:
- https://github.com/iree-org/base-docker-images/blob/main/dockerfiles/amdgpu_ubuntu_jammy_x86_64.Dockerfile
- https://github.com/iree-org/base-docker-images/blob/main/dockerfiles/amdgpu_ubuntu_jammy_ghr_x86_64.Dockerfile
If we need to use Docker, I'd definitely prefer to use either upstream public images if no extra deps are needed or those iree-org ones.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I was thinking about just using the latest released: rocm/dev-ubuntu-22.04:6.3
. This one shouldn't be too bad to pull down I think. We can give it a shot and see how long it takes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would be great. These jobs do need CMake and ninja but otherwise the deps are pretty minimal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed that it's clearer to set a container image in the workflow, but I'm not sure about the image download time. Updated and let's see how long it takes.
- name: Check out repository | ||
uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2 | ||
with: | ||
submodules: false | ||
- name: Check out runtime submodules | ||
run: ./build_tools/scripts/git/update_runtime_submodules.sh | ||
run: | | ||
git config --global --add safe.directory /__w/iree/iree |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can replace this with git config --global --add safe.directory ${{ github.workspace }}
so we are not hardcoding path. Now that we added the docker options, this may not be necessary anymore also
We want to migrate the workflows use MI300 and do not require cache support to migrate to our conductor cluster. A new runner with one GPU has been created
This PR is to update the run label.