-
Notifications
You must be signed in to change notification settings - Fork 108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Address a deadlock issue in multi-GPU scenario #1407
base: clang_tot_upgrade
Are you sure you want to change the base?
Conversation
mangupta
commented
Mar 4, 2020
- Fixes SWDEV-219322
The error message for the failing tests is curious. |
I kicked off a unit test run on my local system to see if I could reproduce the failures. |
@@ -4176,14 +4176,14 @@ void HSAQueue::dispose() { | |||
|
|||
Kalmar::HSADevice* device = static_cast<Kalmar::HSADevice*>(getDev()); | |||
|
|||
wait(); | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some comment would help. It seems calling 'wait()' is equivalent to holding qmutex
and wait_no_lock()
which is what is lines 4185 below after locking rocrQueuesMutex
. Does this fix imply that rocrQueuesMutex
should NOT be held before qmutex
as it may cause deadlock? If so, shouldn't line 4185 locking of qmutex
be removed?
Or, could moving acquiring of rocrQueuesMutex
to just before calling removeRocrQueue()
help?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wait_no_lock()
could potentially call EnqueueMarkerNoLock()
. If the HSAQueue
does not hold a rocr queue, it will end up calling createOrstealRocrQueue()
and lock rocrQueuesMutex
.
Moving the lock on rocrQueuesMutex
until just before line 4206 might work, too.
That's a known problem that @david-salinas will address |
@jeffdaily do we still need an explicit wait if we clear the asyncops vector? |
Isn't the |