OLCF/Summit: Issues on running MPICH on Summit #6142
-
ProblemThe current MPICH does not run in the current Summit environment with How to reproduce the issueI used the latest MPICH and used the default MPI (IBM's spectrum MPI and its PMIx) and compile the following after loading CUDA and its GCC modules: ./configure --with-device=ch4:ucx --prefix=$HOME/software/ci-build --enable-ch4-am-only \
--with-pm=none --with-pmix=$MPI_ROOT
# if you use CUDA
# ./configure --with-device=ch4:ucx --prefix=$HOME/software/ci-build --enable-ch4-am-only \
--enable-gpu-tests-only --with-cuda=$CUDAPATH --with-pm=none --with-pmix=$MPI_ROOT It shows the following error when I run bash-4.2$ echo "$MPI_ROOT"
/autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/xl-16.1.1-5/spectrum-mpi-10.3.1.2-.../
bash-4.2$ jsrun -n 1 -r 1 -a 1 -g 1 --smpiargs="-disable_gpu_hooks" ./cpi
Abort(201935621) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Comm_size: Invalid communicator, error stack:
PMPI_Comm_size(109): MPI_Comm_size(comm=0x18d12a0, size=0x2000000b049c) failed
PMPI_Comm_size(66).: Invalid communicator When I use GDB on a compute node, an error seems in Note that
It seems that, with exactly the same |
Beta Was this translation helpful? Give feedback.
Replies: 20 comments 1 reply
-
Workaround (example: running two processes)Workaround is manual host setting, which at least worked on Summit. The following worked with MPICH, 404cd8a.
# You also need to install newer libtool/autotool etc to compile MPICH.
module load cuda/11.0.3 gcc/9.1.0
./configure --with-device=ch4:ucx --prefix=$MPICH_CUDA_PATH --enable-ch4-am-only --enable-gpu-tests-only --with-cuda="$(realpath $(dirname $(which nvcc))/..)" CC=gcc CXX=gcc
./configure --with-device=ch4:ucx --prefix=$MPICH_NOCUDA_PATH --enable-ch4-am-only CC=gcc CXX=gcc
bsub -W 2:00 -nnodes 2 -P csc371 -Is $SHELL
# Get LD_LIBRARY_PATH on a compute node
# $ jsrun -n 1 -r 1 echo $LD_LIBRARY_PATH
# Get a list of accessible hosts
# $ jsrun -n 2 -r 1 hostname | paste -d, -s -
echo "# two nodes, one process per node"
echo "LD_LIBRARY_PATH=$(jsrun -n 1 -r 1 echo $LD_LIBRARY_PATH) ${MPICH_NOCUDA_PATH}/bin/mpiexec -host $(jsrun -n 2 -r 1 hostname | paste -d, -s -) -n 2 <APP>"
echo ""
echo "# one node, two processes"
echo "LD_LIBRARY_PATH=$(jsrun -n 1 -r 1 echo $LD_LIBRARY_PATH) ${MPICH_NOCUDA_PATH}/bin/mpiexec -env CUDA_VISIBLE_DEVICES 0 -n 1 <APP> : -env CUDA_VISIBLE_DEVICES 1 -n 1 <APP>"
ssh $(jsrun -n 1 -r 1 hostname)
Note: on Summit, the home directory is read-only from compute nodes.
|
Beta Was this translation helpful? Give feedback.
-
It was caused by Darshan, use |
Beta Was this translation helpful? Give feedback.
-
Thanks. I will check it. |
Beta Was this translation helpful? Give feedback.
-
It worked. Thanks, @hzhou! I updated the MPICH wiki: https://wiki.mpich.org/mpich/index.php/Summit I will close this issue. |
Beta Was this translation helpful? Give feedback.
-
Liked the Summit wiki! |
Beta Was this translation helpful? Give feedback.
-
After MPICH/main configure:
Note: $MPI_ROOT was set by Compile test program:
Execution command with an interactive allocation (two ranks running on a single node)
ErrorAbort at
Some debugging notes
Naive fixIncrease
|
Beta Was this translation helpful? Give feedback.
-
@raffenet Can you please suggest the right fix for the above PMIx bug? I did a naive fix (increasing value of |
Beta Was this translation helpful? Give feedback.
-
TODO: Both hydra and jsrun works with mpich/main on Summit now. Going to write note to https://wiki.mpich.org/mpich/index.php/Summit [DONE] |
Beta Was this translation helpful? Give feedback.
-
Changing the In |
Beta Was this translation helpful? Give feedback.
-
But such a limit does not exist in PMIx anymore (I don't read PMIx spec careful enough, please correct me if wrong). And now the temporary recvbuf is allocated by PMIx internally.
An initial thought is that we might need modify |
Beta Was this translation helpful? Give feedback.
-
The user still need allocate the recv buffer. I think the reason to have In fact, the very bug here is the recv buffer overflow, right? |
Beta Was this translation helpful? Give feedback.
-
Oh, the tricky part is we are not put/get the original message directly, we are transmitting the encoded message, which is bigger than the original message and thus won't fit into the user-allocated buffer. I guess if we can assume the encoded message is double the size of original message and thus allocate that size for recv buffer, and modify If you worry about performance, we always can set |
Beta Was this translation helpful? Give feedback.
-
@hzhou I don't understand the PMI code well enough, thus cannot make a design decision now. I will try to spend more time on it and fix later. I guess the fix is not super urgent as we can workaround it by either increasing the buffer or switching to hydra on Summit. |
Beta Was this translation helpful? Give feedback.
-
One thing we can investigate with PMIx is using |
Beta Was this translation helpful? Give feedback.
-
@raffenet why do we have the encode/decode steps in PMI1/PMI2? |
Beta Was this translation helpful? Give feedback.
-
Because the PMI1/PMI2 protocol only handles ascii strings, I believe. |
Beta Was this translation helpful? Give feedback.
-
That's right. Only PMIx supports binary blob data. |
Beta Was this translation helpful? Give feedback.
-
I tried to follow the instructions on the wiki but didn't get it working (trying both the commit mentioned on the wiki as well as current jsrun --nrs 6 --tasks_per_rs 1 --cpu_per_rs 7 --gpu_per_rs 1 --rs_per_host 6 --smpiargs="-disable_gpu_hooks" ./myapp
[1642360280.626409] [h36n14:2999445:0] address.c:1059 UCX ERROR failed to parse address: number of addresses exceeds 128
[1642360280.626413] [h36n14:2999447:0] address.c:1059 UCX ERROR failed to parse address: number of addresses exceeds 128
[1642360280.626411] [h36n14:2999448:0] address.c:1059 UCX ERROR failed to parse address: number of addresses exceeds 128
[1642360280.626413] [h36n14:2999446:0] address.c:1059 UCX ERROR failed to parse address: number of addresses exceeds 128
[1642360280.626411] [h36n14:2999449:0] address.c:1059 UCX ERROR failed to parse address: number of addresses exceeds 128
[1642360280.626413] [h36n14:2999450:0] address.c:1059 UCX ERROR failed to parse address: number of addresses exceeds 128
Abort(138006287) on node 3 (rank 3 in comm 0): Fatal error in internal_Init: Other MPI error, error stack:
internal_Init(59).............: MPI_Init(argc=0x7fffc40ab630, argv=0x7fffc40ab638) failed
MPII_Init_thread(217).........:
MPIR_init_comm_world(34)......:
MPIR_Comm_commit(722).........:
MPIR_Comm_commit_internal(510):
MPID_Comm_commit_pre_hook(158):
MPIDI_UCX_init_world(288).....:
initial_address_exchange(145).: ucx function returned with failed status(ucx_init.c 145 initial_address_exchange Invalid parameter)
Abort(272224015) on node 5 (rank 5 in comm 0): Fatal error in internal_Init: Other MPI error, error stack:
internal_Init(59).............: MPI_Init(argc=0x7fffebfe66f0, argv=0x7fffebfe66f8) failed
MPII_Init_thread(217).........:
MPIR_init_comm_world(34)......:
MPIR_Comm_commit(722).........:
MPIR_Comm_commit_internal(510):
MPID_Comm_commit_pre_hook(158):
MPIDI_UCX_init_world(288).....:
initial_address_exchange(145).: ucx function returned with failed status(ucx_init.c 145 initial_address_exchange Invalid parameter)
Abort(3788559) on node 0 (rank 0 in comm 0): Fatal error in internal_Init: Other MPI error, error stack:
internal_Init(59).............: MPI_Init(argc=0x7ffff63599d0, argv=0x7ffff63599d8) failed
MPII_Init_thread(217).........:
MPIR_init_comm_world(34)......:
MPIR_Comm_commit(722).........:
MPIR_Comm_commit_internal(510):
MPID_Comm_commit_pre_hook(158):
MPIDI_UCX_init_world(288).....:
initial_address_exchange(145).: ucx function returned with failed status(ucx_init.c 145 initial_address_exchange Invalid parameter)
[h36n14:2999450:0:2999450] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
[h36n14:2999445:0:2999445] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
Abort(406441743) on node 4 (rank 4 in comm 0): Fatal error in internal_Init: Other MPI error, error stack:
internal_Init(59).............: MPI_Init(argc=0x7ffff5ab77a0, argv=0x7ffff5ab77a8) failed
MPII_Init_thread(217).........:
MPIR_init_comm_world(34)......:
MPIR_Comm_commit(722).........:
MPIR_Comm_commit_internal(510):
MPID_Comm_commit_pre_hook(158):
MPIDI_UCX_init_world(288).....:
initial_address_exchange(145).: ucx function returned with failed status(ucx_init.c 145 initial_address_exchange Invalid parameter)
[h36n14:2999449:0:2999449] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
Abort(943312655) on node 2 (rank 2 in comm 0): Fatal error in internal_Init: Other MPI error, error stack:
internal_Init(59).............: MPI_Init(argc=0x7ffff3b8e630, argv=0x7ffff3b8e638) failed
MPII_Init_thread(217).........:
MPIR_init_comm_world(34)......:
MPIR_Comm_commit(722).........:
MPIR_Comm_commit_internal(510):
MPID_Comm_commit_pre_hook(158):
MPIDI_UCX_init_world(288).....:
initial_address_exchange(145).: ucx function returned with failed status(ucx_init.c 145 initial_address_exchange Invalid parameter)
[h36n14:2999447:0:2999447] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
As far as I can tell this is different from the errors reported so far. Shall I open a new issue or keep it here (as the issue title still fits). |
Beta Was this translation helpful? Give feedback.
-
@pgrete Which mpich version were you testing? |
Beta Was this translation helpful? Give feedback.
-
I tried commit |
Beta Was this translation helpful? Give feedback.
Workaround (example: running two processes)
Workaround is manual host setting, which at least worked on Summit. The following worked with MPICH, 404cd8a.
MPICH_CUDA_PATH
)MPICH_NOCUDA_PATH
) to getmpiexec
that does not need CUDA.module load
gcc/9.1.0
./configure --with-device=ch4:ucx --prefix=$MPICH_NOCUDA_PATH --enable-ch4-am-…