OLCF/Summit: Issues on running MPICH on Summit #6142

shintaro-iwasaki · 2020-09-29T15:18:40Z

shintaro-iwasaki
Sep 29, 2020

Problem

The current MPICH does not run in the current Summit environment with jsrun. It causes an error in PMIx_Init().

How to reproduce the issue

I used the latest MPICH and used the default MPI (IBM's spectrum MPI and its PMIx) and compile the following after loading CUDA and its GCC modules:

./configure --with-device=ch4:ucx --prefix=$HOME/software/ci-build --enable-ch4-am-only \
    --with-pm=none --with-pmix=$MPI_ROOT
# if you use CUDA
# ./configure --with-device=ch4:ucx --prefix=$HOME/software/ci-build --enable-ch4-am-only \
    --enable-gpu-tests-only --with-cuda=$CUDAPATH --with-pm=none --with-pmix=$MPI_ROOT

It shows the following error when I run cpi.

bash-4.2$ echo "$MPI_ROOT"
/autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/xl-16.1.1-5/spectrum-mpi-10.3.1.2-.../
bash-4.2$ jsrun -n 1 -r 1 -a 1 -g 1 --smpiargs="-disable_gpu_hooks" ./cpi
Abort(201935621) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Comm_size: Invalid communicator, error stack:
PMPI_Comm_size(109): MPI_Comm_size(comm=0x18d12a0, size=0x2000000b049c) failed
PMPI_Comm_size(66).: Invalid communicator

When I use GDB on a compute node, an error seems in PMIx_Init(): supposedly, illegal memory access happened on initialization.

Note that libpmix is linked against:

libpmix.so.2 =>
/autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/gcc-9.1.0/spectrum-mpi-10.3.1.2-.../lib/libpmix.so.2

It seems that, with exactly the same configure settings, MPICH works over jsrun without any problem on a Summit-like IBM-based system (Ascent: https://docs.olcf.ornl.gov/systems/ascent_user_guide.html), so this might be a Summit-specific issue.

Answered by shintaro-iwasaki

Sep 29, 2020

Workaround (example: running two processes)

Workaround is manual host setting, which at least worked on Summit. The following worked with MPICH, 404cd8a.

Compile MPICH with CUDA+UCX (installed to MPICH_CUDA_PATH)

# You also need to install newer libtool/autotool etc to compile MPICH.
module load cuda/11.0.3 gcc/9.1.0
./configure --with-device=ch4:ucx --prefix=$MPICH_CUDA_PATH --enable-ch4-am-only --enable-gpu-tests-only --with-cuda="$(realpath $(dirname $(which nvcc))/..)" CC=gcc CXX=gcc

Compile MPICH without CUDA (installed to MPICH_NOCUDA_PATH) to get mpiexec that does not need CUDA.
module load gcc/9.1.0

./configure --with-device=ch4:ucx --prefix=$MPICH_NOCUDA_PATH --enable-ch4-am-…

View full answer

shintaro-iwasaki · 2020-09-29T15:23:44Z

shintaro-iwasaki
Sep 29, 2020
Author

Workaround (example: running two processes)

Workaround is manual host setting, which at least worked on Summit. The following worked with MPICH, 404cd8a.

Compile MPICH with CUDA+UCX (installed to MPICH_CUDA_PATH)

# You also need to install newer libtool/autotool etc to compile MPICH.
module load cuda/11.0.3 gcc/9.1.0
./configure --with-device=ch4:ucx --prefix=$MPICH_CUDA_PATH --enable-ch4-am-only --enable-gpu-tests-only --with-cuda="$(realpath $(dirname $(which nvcc))/..)" CC=gcc CXX=gcc

Compile MPICH without CUDA (installed to MPICH_NOCUDA_PATH) to get mpiexec that does not need CUDA.
module load gcc/9.1.0

./configure --with-device=ch4:ucx --prefix=$MPICH_NOCUDA_PATH --enable-ch4-am-only CC=gcc CXX=gcc

Allocate two nodes (then you will login a batch node)

bsub -W 2:00 -nnodes 2 -P csc371 -Is $SHELL

On a batch node, run the following to get command after loading all the modules

# Get LD_LIBRARY_PATH on a compute node
# $ jsrun -n 1 -r 1 echo $LD_LIBRARY_PATH
# Get a list of accessible hosts
# $ jsrun -n 2 -r 1 hostname | paste -d, -s -
echo "# two nodes, one process per node"
echo "LD_LIBRARY_PATH=$(jsrun -n 1 -r 1 echo $LD_LIBRARY_PATH) ${MPICH_NOCUDA_PATH}/bin/mpiexec -host $(jsrun -n 2 -r 1 hostname | paste -d, -s -) -n 2 <APP>"
echo ""
echo "# one node, two processes"
echo "LD_LIBRARY_PATH=$(jsrun -n 1 -r 1 echo $LD_LIBRARY_PATH) ${MPICH_NOCUDA_PATH}/bin/mpiexec -env CUDA_VISIBLE_DEVICES 0 -n 1 <APP> : -env CUDA_VISIBLE_DEVICES 1 -n 1 <APP>"

On a batch node, login one of the compute nodes.

ssh $(jsrun -n 1 -r 1 hostname)

On the compute node, run a CUDA-compiled version with mpiexec above.

Note: on Summit, the home directory is read-only from compute nodes.

.ssh/known_hosts might cause an ssh related issue. You can alias ssh/ add known_hosts manually to fix it.
You might need to run examples/.libs/cpi instead of examples/cpi

1 reply

hzhou Aug 30, 2022
Maintainer

Related issue - #4852

hzhou · 2020-09-29T15:34:19Z

hzhou
Sep 29, 2020
Maintainer

bash-4.2$ jsrun -n 1 -r 1 -a 1 -g 1 --smpiargs="-disable_gpu_hooks" ./cpi
Abort(201935621) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Comm_size: Invalid communicator, error stack:
PMPI_Comm_size(109): MPI_Comm_size(comm=0x18d12a0, size=0x2000000b049c) failed
PMPI_Comm_size(66).: Invalid communicator

It was caused by Darshan, use module unload darshan-runtime to skip it.

0 replies

shintaro-iwasaki · 2020-09-29T16:48:51Z

shintaro-iwasaki
Sep 29, 2020
Author

Thanks. I will check it.

0 replies

shintaro-iwasaki · 2020-10-06T20:35:44Z

shintaro-iwasaki
Oct 6, 2020
Author

It worked. Thanks, @hzhou! I updated the MPICH wiki: https://wiki.mpich.org/mpich/index.php/Summit

I will close this issue.

0 replies

minsii · 2020-10-09T16:31:42Z

minsii
Oct 9, 2020

Liked the Summit wiki!

0 replies

minsii · 2021-03-12T00:23:38Z

minsii
Mar 12, 2021

After module unload darshan-runtime, I see a different PMI error when running with mpich/main + jsrun.

MPICH/main configure:

module load gcc
module load cuda/10.1.243
module unload darshan-runtime

yaksadir=$HOME/git/yaksa/build-cuda10.1.243/install
ucxdir=/autofs/nccs-svm1_home1/minsi/git/ucx-1.10.0/build-cuda10.1.243/install

../configure --prefix=$installdir CC=gcc  CXX=g++ \
  --disable-romio --disable-mpe --disable-ft-tests --disable-spawn --disable-fortran                       \
  --disable-fast --enable-g=all \
  --with-yaksa=$yaksadir    \
  --with-device=ch4:ucx  --with-ucx=$ucxdir  \
  --disable-static --with-cuda=$CUDA_DIR  \
  --with-hwloc=embedded \
  --with-pm=none --with-pmix=$MPI_ROOT \
  CFLAGS=-std=gnu11

Note: $MPI_ROOT was set by spectrum-mpi/10.3.1.2-20200121

Compile test program:

$installdir/bin/mpiexec -o cpi ./cpi.c

Execution command with an interactive allocation (two ranks running on a single node)

jsrun -n 2 -r 2 -a 1 -g 1 ./cpi

Error

Abort at src/util/mpir_pmi.c line 1105:

static int hex(unsigned char c)
{
    if (c >= '0' && c <= '9') {
        return c - '0';
    } else if (c >= 'a' && c <= 'f') {
        return 10 + c - 'a';
    } else if (c >= 'A' && c <= 'F') {
        return 10 + c - 'A';
    } else {
        MPIR_Assert(0); <<<< here
        return -1;
    }
}

Some debugging notes

Core dump backtrace:

#0  0x000020000080fbf0 in raise () from /lib64/libc.so.6
#1  0x0000200000811f6c in abort () from /lib64/libc.so.6
#2  0x000020000054d39c in hex (c=201 '\311') at ../src/util/mpir_pmi.c:1108
#3  0x000020000054d4c8 in decode (size=514, src=0x42900720 "\311\320\340\360", dest=0x2000ce7d1200 "") at ../src/util/mpir_pmi.c:1126
#4  0x000020000054b700 in get_ex (src=1, key=0x7fffd6b85218 "-allgather-shm-1-1", buf=0x2000ce7d1000, p_size=0x7fffd6b8524c, is_local=0)
    at ../src/util/mpir_pmi.c:474
#5  0x000020000054c568 in MPIR_pmi_allgather_shm (sendbuf=0x428ffef0, sendsize=893, shm_buf=0x2000ce7d0000, recvsize=4096, domain=MPIR_PMI_DOMAIN_ALL)
    at ../src/util/mpir_pmi.c:701
#6  0x00002000005c7850 in MPIDU_bc_table_create (rank=1, size=2, nodemap=0x4232f430, bc=0x428ffef0, bc_len=893, same_len=0, roots_only=0,
    bc_table=0x7fffd6b853b8, ret_bc_len=0x7fffd6b853c0) at ../src/mpid/common/bc/mpidu_bc.c:154
#7  0x00002000005aabec in initial_address_exchange (init_comm=0x0) at ../src/mpid/ch4/netmod/ucx/ucx_init.c:93
#8  0x00002000005abc50 in MPIDI_UCX_mpi_init_hook (rank=1, size=2, appnum=0, tag_bits=0x7fffd6b85504, init_comm=0x0)
    at ../src/mpid/ch4/netmod/ucx/ucx_init.c:277
#9  0x00002000005aba7c in MPIDI_UCX_init_world (init_comm=0x0) at ../src/mpid/ch4/netmod/ucx/ucx_init.c:259
#10 0x0000200000558370 in MPID_Init_world () at ../src/mpid/ch4/src/ch4_init.c:624
#11 0x0000200000557614 in MPID_Init (requested=0, provided=0x2000007b4810 <MPIR_ThreadInfo>) at ../src/mpid/ch4/src/ch4_init.c:474

Print exchanged key-value

put_ex: key=-allgather-shm-1-0, bufsize=893, n=1787, strlen=1786, encoded=00DEAC1D5A1824B7F24008E363C902BA22285DA7D377CC2B32004C3E5077CCAB33004F230088420E02800A0000C04108E363C902...
MPIR_pmi_kvs_get: key=-allgather-shm-1-0, strlen=1786, val_size=1024, pvalue->data.string=00DEAC1D5A1824B7F24008E363C902BA22285DA7D377CC2B32004C3E5077CCAB33004F230088420E02800A0000C04108E36...
get_ex: key=-allgather-shm-1-0, size=514, val=1737924512

Guessed cause: PMIx_Get receives entire value (1786bytes) at optimized_get->MPIR_pmi_kvs_get, but the caller copied only 1024bytes which is limited by pmi_max_val_size

Naive fix

Increase pmi_max_val_size at MPIR_pmi_init when

 #elif defined USE_PMIX_API
     pmi_max_key_size = PMIX_MAX_KEYLEN;
-    pmi_max_val_size = 1024;    /* this is what PMI2_MAX_VALLEN currently set to */
+    pmi_max_val_size = 1024*16;

0 replies

minsii · 2021-03-12T00:25:01Z

minsii
Mar 12, 2021

@raffenet Can you please suggest the right fix for the above PMIx bug? I did a naive fix (increasing value of pmi_max_val_size ) on summit and now mpich/main + jsrun finally works.

0 replies

minsii · 2021-03-12T00:25:53Z

minsii
Mar 12, 2021

TODO: Both hydra and jsrun works with mpich/main on Summit now. Going to write note to https://wiki.mpich.org/mpich/index.php/Summit

[DONE]

0 replies

hzhou · 2021-03-12T01:05:48Z

hzhou
Mar 12, 2021
Maintainer

Changing the pmi_max_val_size will still break if the business card exceeds the new limit, although seems unlikely today.

In put_ex, we do the segmentation when #if defined(USE_PMI1_API) || defined(USE_PMI2_API), but not when USE_PMIX_API. If you remove the #if - switch, and always do the segmentation, will it work?

0 replies

minsii · 2021-03-12T02:30:34Z

minsii
Mar 12, 2021

segmentation does not seem to be the right solution to me. PMI1 and PMI2 had such approach because they have the PMI2_MAX_VALLEN limit and require the user to provide the recv buffer.

pmi_errno = PMI_KVS_Get(pmi_kvs_name, key, val, val_size);
pmi_errno = PMI2_KVS_Get(pmi_jobid, src, key, val, val_size, &out_len);

But such a limit does not exist in PMIx anymore (I don't read PMIx spec careful enough, please correct me if wrong). And now the temporary recvbuf is allocated by PMIx internally.

pmix_value_t *pvalue;
PMIx_Get(&proc, key, NULL, 0, &pvalue);
// copy out from pvalue->data.string

An initial thought is that we might need modify get_ex, so that data can be copied from pvalue->data.string to the user recv buffer directly.

0 replies

hzhou · 2021-03-12T03:04:46Z

hzhou
Mar 12, 2021
Maintainer

An initial thought is that we might need modify get_ex, so that data can be copied from pvalue->data.string to the user recv buffer directly.

The user still need allocate the recv buffer. I think the reason to have MAX_VALLEN is not so much as PMI can't deliver a huge message. It is mostly an interface thing. Without reasonable MAX_VALLEN, we'll always need extra API for the user to work -- first query the size, allocate the buffer, then copy value out.

In fact, the very bug here is the recv buffer overflow, right?

0 replies

hzhou · 2021-03-12T03:17:55Z

hzhou
Mar 12, 2021
Maintainer

An initial thought is that we might need modify get_ex, so that data can be copied from pvalue->data.string to the user recv buffer directly.

Oh, the tricky part is we are not put/get the original message directly, we are transmitting the encoded message, which is bigger than the original message and thus won't fit into the user-allocated buffer. I guess if we can assume the encoded message is double the size of original message and thus allocate that size for recv buffer, and modify get_ex, it probably can work. But honestly I don't think it is elegant either. The segmentation code is already there, why not just use it and keep the code simple?

If you worry about performance, we always can set MAX_VALEN to bigger value, e.g. 16k. The segmentation code is a fail-safe, so our code is robust with unforeseen situations.

0 replies

minsii · 2021-03-12T18:25:14Z

minsii
Mar 12, 2021

@hzhou I don't understand the PMI code well enough, thus cannot make a design decision now. I will try to spend more time on it and fix later. I guess the fix is not super urgent as we can workaround it by either increasing the buffer or switching to hydra on Summit.

0 replies

raffenet · 2021-03-12T21:00:28Z

raffenet
Mar 12, 2021
Maintainer

One thing we can investigate with PMIx is using pmix_byte_object_t rather than string type for the business cards. We may be able to skip the encode/decode step entirely and just send the raw address+size in a single step.

0 replies

minsii · 2021-03-13T00:06:31Z

minsii
Mar 13, 2021

@raffenet why do we have the encode/decode steps in PMI1/PMI2?

0 replies

hzhou · 2021-03-13T00:30:46Z

hzhou
Mar 13, 2021
Maintainer

@raffenet why do we have the encode/decode steps in PMI1/PMI2?

Because the PMI1/PMI2 protocol only handles ascii strings, I believe.

0 replies

raffenet · 2021-03-15T13:49:24Z

raffenet
Mar 15, 2021
Maintainer

@raffenet why do we have the encode/decode steps in PMI1/PMI2?

Because the PMI1/PMI2 protocol only handles ascii strings, I believe.

That's right. Only PMIx supports binary blob data.

0 replies

pgrete · 2022-01-16T20:50:35Z

pgrete
Jan 16, 2022

I tried to follow the instructions on the wiki but didn't get it working (trying both the commit mentioned on the wiki as well as current main).
The error I get is:

jsrun --nrs 6  --tasks_per_rs 1 --cpu_per_rs 7 --gpu_per_rs 1 --rs_per_host 6 --smpiargs="-disable_gpu_hooks" ./myapp
[1642360280.626409] [h36n14:2999445:0]         address.c:1059 UCX  ERROR failed to parse address: number of addresses exceeds 128
[1642360280.626413] [h36n14:2999447:0]         address.c:1059 UCX  ERROR failed to parse address: number of addresses exceeds 128
[1642360280.626411] [h36n14:2999448:0]         address.c:1059 UCX  ERROR failed to parse address: number of addresses exceeds 128
[1642360280.626413] [h36n14:2999446:0]         address.c:1059 UCX  ERROR failed to parse address: number of addresses exceeds 128
[1642360280.626411] [h36n14:2999449:0]         address.c:1059 UCX  ERROR failed to parse address: number of addresses exceeds 128
[1642360280.626413] [h36n14:2999450:0]         address.c:1059 UCX  ERROR failed to parse address: number of addresses exceeds 128
Abort(138006287) on node 3 (rank 3 in comm 0): Fatal error in internal_Init: Other MPI error, error stack:
internal_Init(59).............: MPI_Init(argc=0x7fffc40ab630, argv=0x7fffc40ab638) failed
MPII_Init_thread(217).........: 
MPIR_init_comm_world(34)......: 
MPIR_Comm_commit(722).........: 
MPIR_Comm_commit_internal(510): 
MPID_Comm_commit_pre_hook(158): 
MPIDI_UCX_init_world(288).....: 
initial_address_exchange(145).:  ucx function returned with failed status(ucx_init.c 145 initial_address_exchange Invalid parameter)
Abort(272224015) on node 5 (rank 5 in comm 0): Fatal error in internal_Init: Other MPI error, error stack:
internal_Init(59).............: MPI_Init(argc=0x7fffebfe66f0, argv=0x7fffebfe66f8) failed
MPII_Init_thread(217).........: 
MPIR_init_comm_world(34)......: 
MPIR_Comm_commit(722).........: 
MPIR_Comm_commit_internal(510): 
MPID_Comm_commit_pre_hook(158): 
MPIDI_UCX_init_world(288).....: 
initial_address_exchange(145).:  ucx function returned with failed status(ucx_init.c 145 initial_address_exchange Invalid parameter)
Abort(3788559) on node 0 (rank 0 in comm 0): Fatal error in internal_Init: Other MPI error, error stack:
internal_Init(59).............: MPI_Init(argc=0x7ffff63599d0, argv=0x7ffff63599d8) failed
MPII_Init_thread(217).........: 
MPIR_init_comm_world(34)......: 
MPIR_Comm_commit(722).........: 
MPIR_Comm_commit_internal(510): 
MPID_Comm_commit_pre_hook(158): 
MPIDI_UCX_init_world(288).....: 
initial_address_exchange(145).:  ucx function returned with failed status(ucx_init.c 145 initial_address_exchange Invalid parameter)
[h36n14:2999450:0:2999450] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
[h36n14:2999445:0:2999445] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
Abort(406441743) on node 4 (rank 4 in comm 0): Fatal error in internal_Init: Other MPI error, error stack:
internal_Init(59).............: MPI_Init(argc=0x7ffff5ab77a0, argv=0x7ffff5ab77a8) failed
MPII_Init_thread(217).........: 
MPIR_init_comm_world(34)......: 
MPIR_Comm_commit(722).........: 
MPIR_Comm_commit_internal(510): 
MPID_Comm_commit_pre_hook(158): 
MPIDI_UCX_init_world(288).....: 
initial_address_exchange(145).:  ucx function returned with failed status(ucx_init.c 145 initial_address_exchange Invalid parameter)
[h36n14:2999449:0:2999449] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
Abort(943312655) on node 2 (rank 2 in comm 0): Fatal error in internal_Init: Other MPI error, error stack:
internal_Init(59).............: MPI_Init(argc=0x7ffff3b8e630, argv=0x7ffff3b8e638) failed
MPII_Init_thread(217).........: 
MPIR_init_comm_world(34)......: 
MPIR_Comm_commit(722).........: 
MPIR_Comm_commit_internal(510): 
MPID_Comm_commit_pre_hook(158): 
MPIDI_UCX_init_world(288).....: 
initial_address_exchange(145).:  ucx function returned with failed status(ucx_init.c 145 initial_address_exchange Invalid parameter)
[h36n14:2999447:0:2999447] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))

As far as I can tell this is different from the errors reported so far. Shall I open a new issue or keep it here (as the issue title still fits).

0 replies

hzhou · 2022-01-18T16:49:35Z

hzhou
Jan 18, 2022
Maintainer

@pgrete Which mpich version were you testing?

0 replies

pgrete · 2022-01-18T17:01:33Z

pgrete
Jan 18, 2022

I tried commit 219a9006 mentioned in the wiki as well as main from two days ago.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OLCF/Summit: Issues on running MPICH on Summit #6142

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 20 comments 1 reply

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

OLCF/Summit: Issues on running MPICH on Summit #6142

Problem

How to reproduce the issue

Workaround (example: running two processes)

Replies: 20 comments · 1 reply

shintaro-iwasaki Sep 29, 2020 Author

Workaround (example: running two processes)

hzhou Aug 30, 2022 Maintainer

hzhou Sep 29, 2020 Maintainer

shintaro-iwasaki Sep 29, 2020 Author

shintaro-iwasaki Oct 6, 2020 Author

MPICH/main configure:

Compile test program:

Execution command with an interactive allocation (two ranks running on a single node)

Error

Some debugging notes

Naive fix

hzhou Mar 12, 2021 Maintainer

hzhou Mar 12, 2021 Maintainer

hzhou Mar 12, 2021 Maintainer

raffenet Mar 12, 2021 Maintainer

hzhou Mar 13, 2021 Maintainer

raffenet Mar 15, 2021 Maintainer

hzhou Jan 18, 2022 Maintainer

Replies: 20 comments 1 reply

shintaro-iwasaki
Sep 29, 2020
Author

hzhou Aug 30, 2022
Maintainer

hzhou
Sep 29, 2020
Maintainer

shintaro-iwasaki
Sep 29, 2020
Author

shintaro-iwasaki
Oct 6, 2020
Author

hzhou
Mar 12, 2021
Maintainer

hzhou
Mar 12, 2021
Maintainer

hzhou
Mar 12, 2021
Maintainer

raffenet
Mar 12, 2021
Maintainer

hzhou
Mar 13, 2021
Maintainer

raffenet
Mar 15, 2021
Maintainer

hzhou
Jan 18, 2022
Maintainer