Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IREE runtime seems to misbehave intermittently in CI tests (query/create devices failure and seg fault) #735

Open
sogartar opened this issue Jan 2, 2025 · 3 comments

Comments

@sogartar
Copy link
Contributor

sogartar commented Jan 2, 2025

On the last 3 commits on main
277618662a515f80f537d9b058b8ff1f15ca4ec0
f5e9cb4ef3fb37c31b438ea6a88b9c8179b7e9e7
ffb0dd2c6be7106e725d49a25b591a3e467913f3

some tests seem to fail somewhere in the IREE runtime.
2 of the failures are when querying the available devices.

>           hal_device_id = haldriver.query_available_devices()[device_idx]["device_id"]
E           IndexError: list index out of range

Once it crashed with a Segmentation fault.

The data dependent CI tests also failed on one occasion with

FAILED sharktank/tests/models/clip/clip_test.py::ClipTextIreeTest::testCompareLargeIreeBf16AgainstTorchEagerF32 - RuntimeError: Error creating device: c/runtime/src/iree/hal/drivers/hip/hip_device.c:467: ALREADY_EXISTS; HIP driver error 'hipErrorPeerAccessAlreadyEnabled' (704): peer access is already enabled; creating device 'hip'
@sogartar
Copy link
Contributor Author

sogartar commented Jan 2, 2025

Here is a fix for the hipErrorPeerAccessAlreadyEnabled error.

@AWoloszyn
Copy link
Contributor

hal_device_id = haldriver.query_available_devices()[device_idx]["device_id"] Is a curious failure, as iree_Hal_hip_driver_query_available_devices is not a complex call.

It

  1. enumerates devices from the driver
  2. walks each device, and returns a tiny bit of information (queried from the driver) about them.

@AWoloszyn
Copy link
Contributor

But the IndexError: list index out of range implies that the returned list is smaller than device_idx. (I.e. we are trying to get information for a device that does not exist)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants