Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SYCL] Fix post-commit failure #13657

Closed
wants to merge 15 commits into from

Conversation

konradkusiak97
Copy link
Contributor

No description provided.

@aelovikov-intel
Copy link
Contributor

@konradkusiak97 , why can't this be tested locally? Our CI has limited resources, we wouldn't want the entire development be done in CI. Logs for post-commit CI tasks contain options passed to buildbot/configure.py and that should provide you enough information to reproduce this locally.

@konradkusiak97
Copy link
Contributor Author

@konradkusiak97 , why can't this be tested locally? Our CI has limited resources, we wouldn't want the entire development be done in CI. Logs for post-commit CI tasks contain options passed to buildbot/configure.py and that should provide you enough information to reproduce this locally.

@aelovikov-intel, one of the previous failures was only appearing in the post-commit CI and wasn't reproducible locally which is why I was trying to get some insight into the problem here. But I understand your concern about the limited resources of the CI. I'll close this PR and try to work on the fix locally.

@aelovikov-intel
Copy link
Contributor

@konradkusiak97 , why can't this be tested locally? Our CI has limited resources, we wouldn't want the entire development be done in CI. Logs for post-commit CI tasks contain options passed to buildbot/configure.py and that should provide you enough information to reproduce this locally.

@aelovikov-intel, one of the previous failures was only appearing in the post-commit CI and wasn't reproducible locally which is why I was trying to get some insight into the problem here. But I understand your concern about the limited resources of the CI. I'll close this PR and try to work on the fix locally.

I don't mind using CI this way if the failure is specific to the HW we use in CI. Otherwise, I'd be happy to help you reproduce locally or share a way to trigger post-commit only (without running pre-commit in addition to post).

@konradkusiak97
Copy link
Contributor Author

@konradkusiak97 , why can't this be tested locally? Our CI has limited resources, we wouldn't want the entire development be done in CI. Logs for post-commit CI tasks contain options passed to buildbot/configure.py and that should provide you enough information to reproduce this locally.

@aelovikov-intel, one of the previous failures was only appearing in the post-commit CI and wasn't reproducible locally which is why I was trying to get some insight into the problem here. But I understand your concern about the limited resources of the CI. I'll close this PR and try to work on the fix locally.

I don't mind using CI this way if the failure is specific to the HW we use in CI. Otherwise, I'd be happy to help you reproduce locally or share a way to trigger post-commit only (without running pre-commit in addition to post).

That sounds great, thanks. I'll get back to you if my efforts to reproduce this locally will turn out to be fully unsuccessful and I'll need further help.

@konradkusiak97
Copy link
Contributor Author

konradkusiak97 commented May 10, 2024

I didn't manage to reproduce this failure on a different Intel gpus that I have access to. It seems it is specific to Iris(R) Xe Graphics unless something different is going on.

For that latest post-commit run, I modified test-e2e/out_of_order_queue_status.cpp to use Q.memset() instead of Q.fill() and the test fails again, with the same error message as after merging #12702 which means there might be a bug in the current memset implementation.

I've noticed that compared to the same test passing on the pre-commit run, the DPC++ is built with clang/clang++ on the post-commit which might be the reason for the difference. I tried reproducing it running on:

[level_zero:gpu] Intel(R) Level-Zero, Intel(R) UHD Graphics 770 1.3 [1.3.29138]

with this configuration:

export CXX=clang++
export CC=clang

python ../llvm/buildbot/configure.py -t Release --ci-defaults --shared-libs --no-assertions \
 --cmake-opt="-DSYCL_ENABLE_STACK_PRINTING=ON" \
 --cmake-opt="-DSYCL_LIB_WITH_DEBUG_SYMBOL=ON" \
 --cmake-opt="-DLLVM_INSTALL_UTILS=ON" \
 --cmake-opt="-DNATIVECPU_USE_OCK=Off" \
 --cmake-opt="-DSYCL_PI_TESTS=OFF" \
-o ./

and running the test with:

clang++ -fsycl -fsycl-targets=spir64 out_of_order_queue_status_memset.cpp -o res
env ONEAPI_DEVICE_SELECTOR=level_zero:gpu ./res

@aelovikov-intel do you think it might be a bug specific to Iris(R) Xe Graphics? If so, I'll file a bug report.

@aelovikov-intel
Copy link
Contributor

@aelovikov-intel do you think it might be a bug specific to Iris(R) Xe Graphics? If so, I'll file a bug report.

Not sure I understand this. Who/what component do you want to file that bug against?

@konradkusiak97
Copy link
Contributor Author

@aelovikov-intel do you think it might be a bug specific to Iris(R) Xe Graphics? If so, I'll file a bug report.

Not sure I understand this. Who/what component do you want to file that bug against?

The test which I added in this PR: out_of_order_queue_status_memset.cpp uses Q.memset instead of Q.fill and produces the same failure as after merging #12702.

So the faulty component would be the current implementation of memset. The failure is specific to the Level-zero backend on Iris(R) Xe Graphics and DPC++ built with clang.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants