Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Overflow in scatter/broadcast for arrays larger than 2GB #115

Open
mrava87 opened this issue Oct 29, 2024 · 10 comments
Open

Overflow in scatter/broadcast for arrays larger than 2GB #115

mrava87 opened this issue Oct 29, 2024 · 10 comments
Assignees
Labels
enhancement New feature or request

Comments

@mrava87
Copy link
Contributor

mrava87 commented Oct 29, 2024

Problem

There may be scenarios where the input array that is broadcasted across ranks has a size that exceeds 2GB. This line

self.local_array[index] = self.base_comm.bcast(value)

will generate an error of the kind in mpi4py/mpi4py#119

Solution

One option is to look into what suggested in the linked issue. However, I am wondering if (and when) we really need to do broadcasting... for example, when we define a MPILinearOperator that wraps a pylops operator, we know that the input at each rank will be the same, and each rank will perform the same operation, so could we avoid all together to do this:

y[:] = self.Op._matvec(x.local_array)

This will also reduce a lot of (perhaps useless) communication time?

@mrava87 mrava87 added the enhancement New feature or request label Oct 29, 2024
@hongyx11
Copy link
Collaborator

hongyx11 commented Oct 31, 2024

There are two other solutions, one is plain MPI send and Recv as stated by Lisandro mpi4py/mpi4py#119 (comment)
Second, is mpi one sided API, https://mpi4py.readthedocs.io/en/stable/tutorial.html#one-sided-communication-rma
I recently benchmark both collective API and one sided API, but i use all_to_allv and passive window in onesided API,
It turns out after a certain threshold, collective API performance drops dramatically, and one sided API keeps similar.
one sided API will be slower than collective API in general, but if perforance is not critical in some functions, it will be a good choice.

roofline_benchmark

@mrava87
Copy link
Contributor Author

mrava87 commented Oct 31, 2024

Thanks @hongyx11, interesting 😀

Two questions:

  • I dug a bit deeper and realized that the 2GB problem is solved with MPI4.0 (and MPI4Py>=4.0.0 as they started supporting MPI4.0). In fact, in one cluster I have mpi4py==4.0.0 and I don’t experience this problem, in the other cluster a older version of mpi4py is installed and that’s why I started to far this issue. Now the question is, do you know if the 2GB limit has been pushed up to XGB or if now there is no limit as MPI4.0 handles communication differently internally?
  • to the one-sided MPI. I read a bit about but I’m not super familiar; would it be straightforward to use it in pylops-MPI or we would need to reconsider completely the design of the library?

@mrava87
Copy link
Contributor Author

mrava87 commented Oct 31, 2024

Btw, @rohanbabbar04, my main question remain valid - there is nothing better than NOT communicate if this is not needed and I wonder if we are actually over-communicating when we have a partition.BROADCAST array? The only reason I see for doing this bcast operation is that we want to ensure that if the array became different across ranks the one of rank0 overwrites the others - perhaps we could create something new like partition.UNSAFEBROADCAST that does not do the bcast operation at the risk of having arrays different from each other. Users would switch to it after they have ensured their code works for large runs where you want to be as efficient as possible?

@hongyx11
Copy link
Collaborator

hongyx11 commented Nov 1, 2024

Thanks @hongyx11, interesting 😀

Two questions:

  • I dug a bit deeper and realized that the 2GB problem is solved with MPI4.0 (and MPI4Py>=4.0.0 as they started supporting MPI4.0). In fact, in one cluster I have mpi4py==4.0.0 and I don’t experience this problem, in the other cluster a older version of mpi4py is installed and that’s why I started to far this issue. Now the question is, do you know if the 2GB limit has been pushed up to XGB or if now there is no limit as MPI4.0 handles communication differently internally?
  • to the one-sided MPI. I read a bit about but I’m not super familiar; would it be straightforward to use it in pylops-MPI or we would need to reconsider completely the design of the library?

It seems to me that MPI 4.0 has not solved the int problem. But my impression might be wrong. could you please provide more details about the experiment?
Indeed, moving to one sided mpi will increase the complexity. So I think send and recv is more reasonable.
but the best is of course no communication if it not needed.
Yuxi

@mrava87
Copy link
Contributor Author

mrava87 commented Nov 1, 2024

I can try to reproduce the example, but basically I was running the same code in Ibex and Shahee where, because of this broadcasting operation I am talking about, a large vector was sent across nodes. In one case things worked fine, in the other I started to get the error OverflowError: integer 2176051861 does not fit in 'int'... after mpi4py was updated to >=4.0.1 the problem was gone.

Reading the comment of Lisandro here mpi4py/mpi4py#119

This happens when the total memory of the pickled object is larger than 2GB, the message count overflows a 32bit int. This is a well know limitation of MPI-1/2/3, hopefully fixed on MPI-4, but not yet available on a released Open MPI or MPICH (in mpi4py, there is already a branch with MPI-4 support).

I had assumed this problem is fixed in MPI-4? But maybe the allowed size is just now bigger but there is still a limit?

@mrava87
Copy link
Contributor Author

mrava87 commented Nov 1, 2024

For sure, one thing we may want to look at is mpi4py.util.pkl5 as our library is bound to work with large vector 😸

I'll give it a try and report my findings

@rohanbabbar04
Copy link
Collaborator

Hmm...
Sorry for jumping in a little late.

I agree with @mrava87 that eliminating communication altogether should work. I'm encountering the 2GB error with mpi4py<4.0.0. Since we can get the correct result without using bcast (just sending and receiving), the best approach might be to assign the local arrays directly at each rank, as @mrava87 suggested. This would avoid communication overhead and could work effectively. Do you think we should pursue both the UNSAFE and SAFE approaches? I think we can test this by assigning the same local array at each rank for the broadcast. If we are able to bypass the 2GB error, then we should be good.

I have one question that just came to mind:

  • We use global sum reductions in some cases, which will require the incorporation of all processes. I think this might not be efficient in the case of One-Sided MPI.

One option would be to integrate it as an additional requirement without recommending it as the default. What do you all think?

@mrava87
Copy link
Contributor Author

mrava87 commented Nov 2, 2024

Hi @rohanbabbar04,
Thanks for the feedback.

I have not yet tried getting rid of the bcast line but I’m happy you also think it should work as long as we assign the same array in all ranks at the start and we are consistent through the run.

I wasn’t sure why in the first place we did it this way and I seemed to remember the idea was there we want to prevent users to accidentally overwrite an array that should be broadcasted in one rank and not do the same in the others… so the idea of having an unsafe and a safe option, but we could change and just have the current broadcast that does not enforce anything so we reduce communication (but give users more power to mess us 🙈)

I also don’t think we should switch to one-sides MPI… perhaps just add the possibility to use it if deemed a better fit in some scenarios- but I’m not familiar with it so I would not be sure when and where to use it

@mrava87
Copy link
Contributor Author

mrava87 commented Nov 2, 2024

For sure, one thing we may want to look at is mpi4py.util.pkl5 as our library is bound to work with large vector 😸

I'll give it a try and report my findings

I tried this, but I seem to still have issues with large arrays, so probably the 2GB thing is not fully gone in MPI4.0

@rohanbabbar04
Copy link
Collaborator

Hi @rohanbabbar04, Thanks for the feedback.

I have not yet tried getting rid of the bcast line but I’m happy you also think it should work as long as we assign the same array in all ranks at the start and we are consistent through the run.

I wasn’t sure why in the first place we did it this way and I seemed to remember the idea was there we want to prevent users to accidentally overwrite an array that should be broadcasted in one rank and not do the same in the others… so the idea of having an unsafe and a safe option, but we could change and just have the current broadcast that does not enforce anything so we reduce communication (but give users more power to mess us 🙈)

I also don’t think we should switch to one-sides MPI… perhaps just add the possibility to use it if deemed a better fit in some scenarios- but I’m not familiar with it so I would not be sure when and where to use it

Let me add a PR for this change for assigning it directly to the local array, I will test it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants