You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This is not our bug and we will not fix it, but the details are documented here for posterity.
There is a bug in Intel MPI 2021.10 and Cray MPI 8.1.29 when using request-based RMA (#53). It could be an MPICH bug in the argument checking macros but I tested MPICH 4.2 extensively today and it does not appear there.
In MPI_Rget_accumulate(NULL, 0, MPI_BYTE, .. , MPI_NO_OP, ..), the implementation incorrectly says that MPI_BYTE has not been committed.
Reproducer by running this in e.g. /tmp:
. /opt/intel/oneapi/setvars.sh --force
git clone --depth 1 https://github.com/jeffhammond/armci-mpi -b request-based-rma
cd armci-mpi
./autogen.sh
mkdir build
cd build
../configure CC=/opt/intel/oneapi/mpi/2021.10.0/bin/mpicc --enable-g
make -j checkprogs
export ARMCI_VERBOSE=1
mpirun -n 4 ./tests/contrib/armci-test # this fails
export ARMCI_RMA_ATOMICITY=0 # this disables MPI_Rget_accumulate(MPI_NO_OP)
mpirun -n 4 ./tests/contrib/armci-tes # this works
It fails here:
Testing non-blocking gets and puts
local[0:2] -> remote[0:2] -> local[1:3]
local[1:3,0:0] -> remote[1:3,0:0] -> local[1:3,1:1]
local[2:3,0:1,2:3] -> remote[2:3,0:1,2:3] -> local[1:2,0:1,2:3]
local[2:2,1:1,3:5,1:5] -> remote[4:4,0:0,1:3,1:5] -> local[3:3,1:1,1:3,2:6]
local[1:4,1:1,0:0,2:6,0:2] -> remote[1:4,2:2,1:1,2:6,1:3] -> local[0:3,1:1,5:5,2:6,2:4]
local[1:4,0:2,1:7,5:6,0:6,1:2] -> remote[0:3,0:2,1:7,7:8,0:6,0:1] -> local[0:3,0:2,0:6,3:4,0:6,0:1]
local[3:4,0:1,0:0,5:7,5:6,0:1,0:1] -> remote[1:2,0:1,0:0,5:7,2:3,0:1,0:1] -> local[0:1,0:1,4:4,2:4,3:4,0:1,0:1]
Abort(336723971) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Rget_accumulate: Invalid datatype, error stack:
PMPI_Rget_accumulate(218): MPI_Rget_accumulate(origin_addr=(nil), origin_count=0, MPI_BYTE, result_addr=0x60d4e105cfd0, result_count=1, dtype=USER<contig>, target_rank=3, target_disp=8, target_count=1, dtype=USER<contig>, MPI_NO_OP, win=0xa0000001, 0x7ffc044c9558) failed
PMPI_Rget_accumulate(159): Datatype has not been committed
MPI_BYTE does not need to be committed.
This is a patch that works around the Intel MPI bug, and therefore reveals the problem:
diff --git a/src/gmr.c b/src/gmr.c
index 129b97c..acf8539 100644
--- a/src/gmr.c+++ b/src/gmr.c@@ -603,7 +603,9 @@ int gmr_get_typed(gmr_t *mreg, void *src, int src_count, MPI_Datatype src_type,
MPI_Request req = MPI_REQUEST_NULL;
if (ARMCII_GLOBAL_STATE.rma_atomicity) {
- MPI_Rget_accumulate(NULL, 0, MPI_BYTE,+ // using the source type instead of MPI_BYTE works around an Intel MPI 2021.10 bug...+ MPI_Rget_accumulate(NULL, 0, src_type /* MPI_BYTE */,
dst, dst_count, dst_type, grp_proc,
(MPI_Aint) disp, src_count, src_type,
MPI_NO_OP, mreg->window, &req);
The setting ARMCI_RMA_ATOMICITY=0 disables this code path in favor of the following MPI_Get, which works just fine with the same arguments except for the (NULL,0,MPI_BYTE) tuple, which of course is unused.
The text was updated successfully, but these errors were encountered:
This is not our bug and we will not fix it, but the details are documented here for posterity.
There is a bug in Intel MPI 2021.10 and Cray MPI 8.1.29 when using request-based RMA (#53). It could be an MPICH bug in the argument checking macros but I tested MPICH 4.2 extensively today and it does not appear there.
In
MPI_Rget_accumulate(NULL, 0, MPI_BYTE, .. , MPI_NO_OP, ..)
, the implementation incorrectly says that MPI_BYTE has not been committed.Reproducer by running this in e.g. /tmp:
It fails here:
MPI_BYTE does not need to be committed.
This is a patch that works around the Intel MPI bug, and therefore reveals the problem:
The setting ARMCI_RMA_ATOMICITY=0 disables this code path in favor of the following MPI_Get, which works just fine with the same arguments except for the (NULL,0,MPI_BYTE) tuple, which of course is unused.
The text was updated successfully, but these errors were encountered: