Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Locking seems to be broken if the lock variable is runtime-allocated. #9

Open
dalcinl opened this issue May 30, 2021 · 9 comments
Open

Comments

@dalcinl
Copy link

dalcinl commented May 30, 2021

I'm running Fedora 33 with system packages openmpi-4.0.5, ucx-1.10.1, gcc-10.3.1

Reproducer (slight modification of the OpenSHMEM 1.5 spec document):

#include <shmem.h>
#include <stdio.h>

int main(void) {
  static int count = 0;
  
  shmem_init();

#if 0
  static long _lock  = 0;
  long *lock  = &_lock;
#else
  long *lock = shmem_malloc(sizeof(long));
  *lock = 0;
  shmem_barrier_all();
#endif
  
  int mype = shmem_my_pe();
  shmem_set_lock(lock);
  int val = shmem_g(&count, 0); /* get count value on PE 0 */
  printf("%d: count is %d\n", mype, val);
  val++; /* incrementing and updating count on PE 0 */
  shmem_p(&count, val, 0);
  shmem_clear_lock(lock); /* ensures count update completes before clearing the lock */
  shmem_finalize();
  return 0;
}
$ oshcc tmp2.c 

$ oshrun -n 5 ./a.out 
0: count is 0
2: count is 1
3: count is 2
4: count is 1
1: count is 3

$ oshrun -n 5 ./a.out 
1: count is 0
3: count is 0
4: count is 0
0: count is 1
2: count is 2

Note: If you switch #if 0 -> #if 1, then things work as expected:

$ oshrun -n 5 ./a.out 
2: count is 2
4: count is 0
0: count is 1
1: count is 3
3: count is 4
@tonycurtis
Copy link
Contributor

Working fine for me.

@dalcinl
Copy link
Author

dalcinl commented May 31, 2021

@tonycurtis What UCX version are you using? What system you tested on?

This is my package list from system repositories:

$ rpm -qa | egrep 'ucx|pmix'
pmix-3.2.3-1.fc33.x86_64
ucx-1.10.1-1.fc33.x86_64
pmix-devel-3.2.3-1.fc33.x86_64
ucx-devel-1.10.1-1.fc33.x86_64
ucx-cma-1.10.1-1.fc33.x86_64
pmix-pmi-3.2.3-1.fc33.x86_64
pmix-pmi-devel-3.2.3-1.fc33.x86_64

as well as mpiexec for launcher (Open MPI 4.0.5, also from system packages), and this is my osss-ucx configure line:

./configure --prefix=/home/devel/shmem/osss --with-ucx --with-pmix --enable-debug --enable-logging

Any tips on how to further debug the issue on my side?

@tonycurtis
Copy link
Contributor

tonycurtis commented May 31, 2021

UCX git latest, 1.10.1. Tested on 2 different ARM Infiniband clusters, plus Fedora x86 VM (with knem transport). Also PMIx 3.2.3.

@tonycurtis
Copy link
Contributor

I appear to have eventually triggered the behavior you see. My quiet() is still using a nearly-deprecated UCX call: will try updating.

@tonycurtis
Copy link
Contributor

The cluster I was using went down over the weekend for some reason, just managed to find somewhere else that demonstrates this behavior. I suspect my lock code is aging badly with relaxed memory on e.g. ARM. Will take more investigation/rethink.

@dalcinl
Copy link
Author

dalcinl commented Jun 3, 2021

Please note that I'm running the reproducer in a rather unremarkable x64_64 desktop machine, Linux 5.12.7, no knem.

@tonycurtis
Copy link
Contributor

Yeah, it's behaving differently on 2 ARM systems too. The lock code comes from an earlier shmem implementation, may need to modernize. memory consistency is becoming much more difficult.

@tonycurtis
Copy link
Contributor

Have a go now. I think I've fixed it (make sure to be using the "main" branch)

@dalcinl
Copy link
Author

dalcinl commented Jun 9, 2021

Now have problems with shmem_test_lock, run the following with at least two PEs. It is almost the same code, but instead of shmem_set_lock(), I'm using while (shmem_test_lock); :

#include <shmem.h>
#include <stdio.h>

int main(void) {
  static int count = 0;
  
  shmem_init();

#if 0
  static long _lock  = 0;
  long *lock  = &_lock;
#else
  long *lock = shmem_malloc(sizeof(long));
  *lock = 0;
  shmem_barrier_all();
#endif
  
  int mype = shmem_my_pe();

  while (shmem_test_lock(lock));
  
  int val = shmem_g(&count, 0); /* get count value on PE 0 */
  printf("%d: count is %d\n", mype, val);
  val++; /* incrementing and updating count on PE 0 */
  shmem_p(&count, val, 0);
  shmem_clear_lock(lock); /* ensures count update completes before clearing the lock */
  shmem_finalize();
  return 0;
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants