Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[work-in-progress] possible issue with multi-node cipsi using ORMAS? #342

Open
kgasperich opened this issue Aug 15, 2024 · 0 comments
Open

Comments

@kgasperich
Copy link
Collaborator

I mainly want to open an issue so that eventually I remember to look into this more. I don't have a lot of useful information yet.

I'm running some calculations using ORMAS restrictions to create wavefunctions with holes in core orbitals. To do this, I need to start from a set of determinants (or single determinant) that fits the particular ORMAS constraints I've defined (i.e. has a hole in the specified set of core orbitals).

I've been running these calculations using 5-10 nodes (with something like mpirun -n 1 qp_run fci and mpirun -n $(N-1) qp_run -s fci), and I often see one or several of the additional N-1 nodes fail at some point with a message like this:

Abort(875110415) on node 1 (rank 1 in comm 0): Fatal error in internal_Bcast: Other MPI error, error stack:
internal_Bcast(2016)....................: MPI_Bcast(buffer=0xe13020, count=128, MPI_CHARACTER, 0, MPI_COMM_WORLD) failed
MPIR_Bcast(501).........................:
MPIDI_Bcast_intra_composition_gamma(559):
MPIDI_NM_mpi_bcast(141).................:
MPIR_Bcast_intra_auto(85)...............:
MPIR_Bcast_intra_binomial(135)..........: message sizes do not match across processes in the collective routine: Received 4 but expected 128
Abort(875110415) on node 2 (rank 2 in comm 0): Fatal error in internal_Bcast: Other MPI error, error stack:
internal_Bcast(2016)....................: MPI_Bcast(buffer=0xe13020, count=128, MPI_CHARACTER, 0, MPI_COMM_WORLD) failed
MPIR_Bcast(501).........................:
MPIDI_Bcast_intra_composition_gamma(559):
MPIDI_NM_mpi_bcast(141).................:
MPIR_Bcast_intra_auto(85)...............:
MPIR_Bcast_intra_binomial(135)..........: message sizes do not match across processes in the collective routine: Received 4 but expected 128

I see that this broadcast is the same size as what is expected in the error message (128), but that is as far as I've gotten in tracking down anything useful so far.

I'm not sure if it's a more general problem/bug that's showing up or if it is specifically related to the ORMAS constraints. I haven't thought through all of the multi-level filtering around the ionized generators, but I wouldn't be surprised if maybe something is happening where many of the generated connected determinants are disallowed by ORMAS, and maybe some ranks end up not generating any connected dets (after ORMAS filtering)?. Maybe this is creating some corner case that wasn't considered because it wouldn't happen in any reasonable calculation without some harsh constraints like this?

There might also be an issue somewhere related to the ref_bitmask/HF_bitmask. If there are any parts of the code that rely on the assumption that the HF_bitmask is included in the wavefunction, then the ORMAS could cause problems (if the HF determinant doesn't fit the ORMAS constraints, then that determinant will not be in the wavefunction). I haven't had a chance to dig into this more, but it's on my list of things to look into.

I haven't had time to do a lot of testing or track down anything more than this, so I'm sorry I don't have more to say about it yet. I'm planning to take a closer look when I find the time, but I wanted to at least open the issue as a reminder to myself, and also in case someone else sees an obvious problem that I'm missing, or if this is caused by something other than ORMAS and has maybe affected somebody else as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant