Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intel MPI 2021.10 Rget_accumulate false positive in error checking #55

Open
jeffhammond opened this issue Oct 3, 2024 · 1 comment
Open
Assignees
Labels

Comments

@jeffhammond
Copy link
Member

jeffhammond commented Oct 3, 2024

This is not our bug and we will not fix it, but the details are documented here for posterity.

There is a bug in Intel MPI 2021.10 and Cray MPI 8.1.29 when using request-based RMA (#53). It could be an MPICH bug in the argument checking macros but I tested MPICH 4.2 extensively today and it does not appear there.

In MPI_Rget_accumulate(NULL, 0, MPI_BYTE, .. , MPI_NO_OP, ..), the implementation incorrectly says that MPI_BYTE has not been committed.

Reproducer by running this in e.g. /tmp:

. /opt/intel/oneapi/setvars.sh  --force
git clone --depth 1 https://github.com/jeffhammond/armci-mpi -b request-based-rma
cd armci-mpi
./autogen.sh
mkdir build
cd build
../configure CC=/opt/intel/oneapi/mpi/2021.10.0/bin/mpicc --enable-g
make -j checkprogs
export ARMCI_VERBOSE=1
mpirun -n 4 ./tests/contrib/armci-test # this fails
export ARMCI_RMA_ATOMICITY=0 # this disables MPI_Rget_accumulate(MPI_NO_OP)
mpirun -n 4 ./tests/contrib/armci-tes # this works

It fails here:

Testing non-blocking gets and puts
local[0:2] -> remote[0:2] -> local[1:3]
local[1:3,0:0] -> remote[1:3,0:0] -> local[1:3,1:1]
local[2:3,0:1,2:3] -> remote[2:3,0:1,2:3] -> local[1:2,0:1,2:3]
local[2:2,1:1,3:5,1:5] -> remote[4:4,0:0,1:3,1:5] -> local[3:3,1:1,1:3,2:6]
local[1:4,1:1,0:0,2:6,0:2] -> remote[1:4,2:2,1:1,2:6,1:3] -> local[0:3,1:1,5:5,2:6,2:4]
local[1:4,0:2,1:7,5:6,0:6,1:2] -> remote[0:3,0:2,1:7,7:8,0:6,0:1] -> local[0:3,0:2,0:6,3:4,0:6,0:1]
local[3:4,0:1,0:0,5:7,5:6,0:1,0:1] -> remote[1:2,0:1,0:0,5:7,2:3,0:1,0:1] -> local[0:1,0:1,4:4,2:4,3:4,0:1,0:1]
Abort(336723971) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Rget_accumulate: Invalid datatype, error stack:
PMPI_Rget_accumulate(218): MPI_Rget_accumulate(origin_addr=(nil), origin_count=0, MPI_BYTE, result_addr=0x60d4e105cfd0, result_count=1, dtype=USER<contig>, target_rank=3, target_disp=8, target_count=1, dtype=USER<contig>, MPI_NO_OP, win=0xa0000001, 0x7ffc044c9558) failed
PMPI_Rget_accumulate(159): Datatype has not been committed

MPI_BYTE does not need to be committed.

This is a patch that works around the Intel MPI bug, and therefore reveals the problem:

diff --git a/src/gmr.c b/src/gmr.c
index 129b97c..acf8539 100644
--- a/src/gmr.c
+++ b/src/gmr.c
@@ -603,7 +603,9 @@ int gmr_get_typed(gmr_t *mreg, void *src, int src_count, MPI_Datatype src_type,
     MPI_Request req = MPI_REQUEST_NULL;

     if (ARMCII_GLOBAL_STATE.rma_atomicity) {
-        MPI_Rget_accumulate(NULL, 0, MPI_BYTE,
+        // using the source type instead of MPI_BYTE works around an Intel MPI 2021.10 bug...
+        MPI_Rget_accumulate(NULL, 0, src_type /* MPI_BYTE */,
                             dst, dst_count, dst_type, grp_proc,
                             (MPI_Aint) disp, src_count, src_type,
                             MPI_NO_OP, mreg->window, &req);

The setting ARMCI_RMA_ATOMICITY=0 disables this code path in favor of the following MPI_Get, which works just fine with the same arguments except for the (NULL,0,MPI_BYTE) tuple, which of course is unused.

@jeffhammond jeffhammond self-assigned this Oct 3, 2024
@jeffhammond
Copy link
Member Author

This bug is in Cray MPI (/opt/cray/pe/mpich/8.1.29), too, so it must be from MPICH.

Fatal error in PMPI_Rget_accumulate: Invalid datatype, error stack:
PMPI_Rget_accumulate(235): MPI_Rget_accumulate(origin_addr=(nil), origin_count=0, MPI_BYTE, result_addr=0x37899d0, result_count=1, dtype=USER<contig>, target_rank=0, target_disp=8, target_count=1, dtype=USER<contig>, MPI_NO_OP, win=0xa0000002, 0x7fff7159aafc) failed
PMPI_Rget_accumulate(170): Datatype has not been committed
srun: error: nid002439: task 3: Exited with exit code 255
srun: Terminating StepId=8094140.0
slurmstepd: error: *** STEP 8094140.0 ON nid002438 CANCELLED AT 2024-10-03T17:53:15 ***
srun: error: nid002438: tasks 0-2: Exited with exit code 255

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant