You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We see the following warnings (followed by asserts in debug mode) when memory on the device is tight:
W@00000 GPU[hip(0)]: Write access to data copy 0x7fbe35bdbb10 [ref_count 1] with existing readers [1024] (possible anti-dependency,
or concurrent accesses), please prevent that with CTL dependencies
The 1024 is suspicious and points us to #575. @therault and I found that the rollback of the CAS is wrong. The CAS is done on an element that we will abandon and is only there to block someone from taking the element. There is no need to rollback the CAS.
Once we have released the LRU element, we go back to malloc_data. Now there is a pretty good chance that the zone_alloc succeeds. We still have PARSEC_CUDA_DATA_COPY_ATOMIC_SENTINEL as copy_readers_update, which will then be applied to the gpu_elem at the end.
I think it's safe to remove everything to do with copy_readers_update (i.e., the fetch-and-op and all places where we set it) as the readers field in the final gpu_elem does not need to be adjusted.
The text was updated successfully, but these errors were encountered:
Nobody can take that element. This entire function is done in the context of the thread handling the current device (where the copy is located), so is protected. What that CAS is protecting from, is from another thread trying to use the copy as source for a device-to-device transfer (this is not ownership).
We do not abandon the copy, we detach it from the old master and then we repurpose it for another data. Once this done, the readers shall be 0 again.
When we go back to malloc_data the first thing we do is to reset the copy_readers_update to zero
Describe the bug
We see the following warnings (followed by asserts in debug mode) when memory on the device is tight:
The 1024 is suspicious and points us to #575. @therault and I found that the rollback of the CAS is wrong. The CAS is done on an element that we will abandon and is only there to block someone from taking the element. There is no need to rollback the CAS.
Once we have released the LRU element, we go back to
malloc_data
. Now there is a pretty good chance that thezone_alloc
succeeds. We still havePARSEC_CUDA_DATA_COPY_ATOMIC_SENTINEL
ascopy_readers_update
, which will then be applied to thegpu_elem
at the end.I think it's safe to remove everything to do with
copy_readers_update
(i.e., the fetch-and-op and all places where we set it) as thereaders
field in the finalgpu_elem
does not need to be adjusted.The text was updated successfully, but these errors were encountered: