-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Locking seems to be broken if the lock variable is runtime-allocated. #9
Comments
Working fine for me. |
@tonycurtis What UCX version are you using? What system you tested on? This is my package list from system repositories:
as well as mpiexec for launcher (Open MPI 4.0.5, also from system packages), and this is my osss-ucx configure line:
Any tips on how to further debug the issue on my side? |
UCX git latest, 1.10.1. Tested on 2 different ARM Infiniband clusters, plus Fedora x86 VM (with knem transport). Also PMIx 3.2.3. |
I appear to have eventually triggered the behavior you see. My quiet() is still using a nearly-deprecated UCX call: will try updating. |
The cluster I was using went down over the weekend for some reason, just managed to find somewhere else that demonstrates this behavior. I suspect my lock code is aging badly with relaxed memory on e.g. ARM. Will take more investigation/rethink. |
Please note that I'm running the reproducer in a rather unremarkable x64_64 desktop machine, Linux 5.12.7, no knem. |
Yeah, it's behaving differently on 2 ARM systems too. The lock code comes from an earlier shmem implementation, may need to modernize. memory consistency is becoming much more difficult. |
Have a go now. I think I've fixed it (make sure to be using the "main" branch) |
Now have problems with #include <shmem.h>
#include <stdio.h>
int main(void) {
static int count = 0;
shmem_init();
#if 0
static long _lock = 0;
long *lock = &_lock;
#else
long *lock = shmem_malloc(sizeof(long));
*lock = 0;
shmem_barrier_all();
#endif
int mype = shmem_my_pe();
while (shmem_test_lock(lock));
int val = shmem_g(&count, 0); /* get count value on PE 0 */
printf("%d: count is %d\n", mype, val);
val++; /* incrementing and updating count on PE 0 */
shmem_p(&count, val, 0);
shmem_clear_lock(lock); /* ensures count update completes before clearing the lock */
shmem_finalize();
return 0;
} |
I'm running Fedora 33 with system packages openmpi-4.0.5, ucx-1.10.1, gcc-10.3.1
Reproducer (slight modification of the OpenSHMEM 1.5 spec document):
Note: If you switch
#if 0
->#if 1
, then things work as expected:The text was updated successfully, but these errors were encountered: