Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Address a deadlock issue in multi-GPU scenario #1407

Open
wants to merge 1 commit into
base: clang_tot_upgrade
Choose a base branch
from

Conversation

mangupta
Copy link
Contributor

@mangupta mangupta commented Mar 4, 2020

  • Fixes SWDEV-219322

@mangupta mangupta changed the base branch from master to clang_tot_upgrade March 4, 2020 05:11
@jeffdaily
Copy link
Collaborator

The error message for the failing tests is curious. error while loading shared libraries: libhsa-runtime64.so.1: cannot open shared object file: No such file or directory

@jeffdaily
Copy link
Collaborator

I kicked off a unit test run on my local system to see if I could reproduce the failures.

@@ -4176,14 +4176,14 @@ void HSAQueue::dispose() {

Kalmar::HSADevice* device = static_cast<Kalmar::HSADevice*>(getDev());

wait();

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comment would help. It seems calling 'wait()' is equivalent to holding qmutex and wait_no_lock() which is what is lines 4185 below after locking rocrQueuesMutex. Does this fix imply that rocrQueuesMutex should NOT be held before qmutex as it may cause deadlock? If so, shouldn't line 4185 locking of qmutex be removed?
Or, could moving acquiring of rocrQueuesMutex to just before calling removeRocrQueue() help?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wait_no_lock() could potentially call EnqueueMarkerNoLock(). If the HSAQueue does not hold a rocr queue, it will end up calling createOrstealRocrQueue() and lock rocrQueuesMutex.

Moving the lock on rocrQueuesMutex until just before line 4206 might work, too.

@scchan
Copy link
Collaborator

scchan commented Mar 5, 2020

The error message for the failing tests is curious. error while loading shared libraries: libhsa-runtime64.so.1: cannot open shared object file: No such file or directory

That's a known problem that @david-salinas will address

@scchan
Copy link
Collaborator

scchan commented Mar 5, 2020

@jeffdaily do we still need an explicit wait if we clear the asyncops vector?

@jeffdaily
Copy link
Collaborator

@jeffdaily do we still need an explicit wait if we clear the asyncops vector?

Isn't the wait is necessary due to HCC_OPT_FLUSH?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants