Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for oversubscription broken #601

Open
devreal opened this issue Nov 29, 2023 · 1 comment
Open

Support for oversubscription broken #601

devreal opened this issue Nov 29, 2023 · 1 comment
Labels
bug Something isn't working

Comments

@devreal
Copy link
Contributor

devreal commented Nov 29, 2023

Describe the bug

We see the following warnings (followed by asserts in debug mode) when memory on the device is tight:

W@00000 GPU[hip(0)]:	Write access to data copy 0x7fbe35bdbb10 [ref_count 1] with existing readers [1024] (possible anti-dependency,
or concurrent accesses), please prevent that with CTL dependencies

The 1024 is suspicious and points us to #575. @therault and I found that the rollback of the CAS is wrong. The CAS is done on an element that we will abandon and is only there to block someone from taking the element. There is no need to rollback the CAS.

Once we have released the LRU element, we go back to malloc_data. Now there is a pretty good chance that the zone_alloc succeeds. We still have PARSEC_CUDA_DATA_COPY_ATOMIC_SENTINEL as copy_readers_update, which will then be applied to the gpu_elem at the end.

I think it's safe to remove everything to do with copy_readers_update (i.e., the fetch-and-op and all places where we set it) as the readers field in the final gpu_elem does not need to be adjusted.

@devreal devreal added the bug Something isn't working label Nov 29, 2023
@bosilca
Copy link
Contributor

bosilca commented Feb 2, 2024

I don't think this analysis is correct.

  1. Nobody can take that element. This entire function is done in the context of the thread handling the current device (where the copy is located), so is protected. What that CAS is protecting from, is from another thread trying to use the copy as source for a device-to-device transfer (this is not ownership).
  2. We do not abandon the copy, we detach it from the old master and then we repurpose it for another data. Once this done, the readers shall be 0 again.
  3. When we go back to malloc_data the first thing we do is to reset the copy_readers_update to zero

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants