You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I mainly want to open an issue so that eventually I remember to look into this more. I don't have a lot of useful information yet.
I'm running some calculations using ORMAS restrictions to create wavefunctions with holes in core orbitals. To do this, I need to start from a set of determinants (or single determinant) that fits the particular ORMAS constraints I've defined (i.e. has a hole in the specified set of core orbitals).
I've been running these calculations using 5-10 nodes (with something like mpirun -n 1 qp_run fci and mpirun -n $(N-1) qp_run -s fci), and I often see one or several of the additional N-1 nodes fail at some point with a message like this:
Abort(875110415) on node 1 (rank 1 in comm 0): Fatal error in internal_Bcast: Other MPI error, error stack:
internal_Bcast(2016)....................: MPI_Bcast(buffer=0xe13020, count=128, MPI_CHARACTER, 0, MPI_COMM_WORLD) failed
MPIR_Bcast(501).........................:
MPIDI_Bcast_intra_composition_gamma(559):
MPIDI_NM_mpi_bcast(141).................:
MPIR_Bcast_intra_auto(85)...............:
MPIR_Bcast_intra_binomial(135)..........: message sizes do not match across processes in the collective routine: Received 4 but expected 128
Abort(875110415) on node 2 (rank 2 in comm 0): Fatal error in internal_Bcast: Other MPI error, error stack:
internal_Bcast(2016)....................: MPI_Bcast(buffer=0xe13020, count=128, MPI_CHARACTER, 0, MPI_COMM_WORLD) failed
MPIR_Bcast(501).........................:
MPIDI_Bcast_intra_composition_gamma(559):
MPIDI_NM_mpi_bcast(141).................:
MPIR_Bcast_intra_auto(85)...............:
MPIR_Bcast_intra_binomial(135)..........: message sizes do not match across processes in the collective routine: Received 4 but expected 128
I see that this broadcast is the same size as what is expected in the error message (128), but that is as far as I've gotten in tracking down anything useful so far.
I'm not sure if it's a more general problem/bug that's showing up or if it is specifically related to the ORMAS constraints. I haven't thought through all of the multi-level filtering around the ionized generators, but I wouldn't be surprised if maybe something is happening where many of the generated connected determinants are disallowed by ORMAS, and maybe some ranks end up not generating any connected dets (after ORMAS filtering)?. Maybe this is creating some corner case that wasn't considered because it wouldn't happen in any reasonable calculation without some harsh constraints like this?
There might also be an issue somewhere related to the ref_bitmask/HF_bitmask. If there are any parts of the code that rely on the assumption that the HF_bitmask is included in the wavefunction, then the ORMAS could cause problems (if the HF determinant doesn't fit the ORMAS constraints, then that determinant will not be in the wavefunction). I haven't had a chance to dig into this more, but it's on my list of things to look into.
I haven't had time to do a lot of testing or track down anything more than this, so I'm sorry I don't have more to say about it yet. I'm planning to take a closer look when I find the time, but I wanted to at least open the issue as a reminder to myself, and also in case someone else sees an obvious problem that I'm missing, or if this is caused by something other than ORMAS and has maybe affected somebody else as well.
The text was updated successfully, but these errors were encountered:
I mainly want to open an issue so that eventually I remember to look into this more. I don't have a lot of useful information yet.
I'm running some calculations using ORMAS restrictions to create wavefunctions with holes in core orbitals. To do this, I need to start from a set of determinants (or single determinant) that fits the particular ORMAS constraints I've defined (i.e. has a hole in the specified set of core orbitals).
I've been running these calculations using 5-10 nodes (with something like
mpirun -n 1 qp_run fci
andmpirun -n $(N-1) qp_run -s fci
), and I often see one or several of the additional N-1 nodes fail at some point with a message like this:I see that this broadcast is the same size as what is expected in the error message (128), but that is as far as I've gotten in tracking down anything useful so far.
I'm not sure if it's a more general problem/bug that's showing up or if it is specifically related to the ORMAS constraints. I haven't thought through all of the multi-level filtering around the ionized generators, but I wouldn't be surprised if maybe something is happening where many of the generated connected determinants are disallowed by ORMAS, and maybe some ranks end up not generating any connected dets (after ORMAS filtering)?. Maybe this is creating some corner case that wasn't considered because it wouldn't happen in any reasonable calculation without some harsh constraints like this?
There might also be an issue somewhere related to the ref_bitmask/HF_bitmask. If there are any parts of the code that rely on the assumption that the HF_bitmask is included in the wavefunction, then the ORMAS could cause problems (if the HF determinant doesn't fit the ORMAS constraints, then that determinant will not be in the wavefunction). I haven't had a chance to dig into this more, but it's on my list of things to look into.
I haven't had time to do a lot of testing or track down anything more than this, so I'm sorry I don't have more to say about it yet. I'm planning to take a closer look when I find the time, but I wanted to at least open the issue as a reminder to myself, and also in case someone else sees an obvious problem that I'm missing, or if this is caused by something other than ORMAS and has maybe affected somebody else as well.
The text was updated successfully, but these errors were encountered: