Overflow in scatter/broadcast for arrays larger than 2GB #115

mrava87 · 2024-10-29T07:16:43Z

Problem

There may be scenarios where the input array that is broadcasted across ranks has a size that exceeds 2GB. This line

pylops-mpi/pylops_mpi/DistributedArray.py

Line 137 in 57b793e

self.local_array[index] = self.base_comm.bcast(value)

will generate an error of the kind in mpi4py/mpi4py#119

Solution

One option is to look into what suggested in the linked issue. However, I am wondering if (and when) we really need to do broadcasting... for example, when we define a MPILinearOperator that wraps a pylops operator, we know that the input at each rank will be the same, and each rank will perform the same operation, so could we avoid all together to do this:

pylops-mpi/pylops_mpi/LinearOperator.py

Line 86 in 57b793e

y[:] = self.Op._matvec(x.local_array)

This will also reduce a lot of (perhaps useless) communication time?

hongyx11 · 2024-10-31T03:08:55Z

There are two other solutions, one is plain MPI send and Recv as stated by Lisandro mpi4py/mpi4py#119 (comment)
Second, is mpi one sided API, https://mpi4py.readthedocs.io/en/stable/tutorial.html#one-sided-communication-rma
I recently benchmark both collective API and one sided API, but i use all_to_allv and passive window in onesided API,
It turns out after a certain threshold, collective API performance drops dramatically, and one sided API keeps similar.
one sided API will be slower than collective API in general, but if perforance is not critical in some functions, it will be a good choice.

mrava87 · 2024-10-31T04:35:39Z

Thanks @hongyx11, interesting 😀

Two questions:

I dug a bit deeper and realized that the 2GB problem is solved with MPI4.0 (and MPI4Py>=4.0.0 as they started supporting MPI4.0). In fact, in one cluster I have mpi4py==4.0.0 and I don’t experience this problem, in the other cluster a older version of mpi4py is installed and that’s why I started to far this issue. Now the question is, do you know if the 2GB limit has been pushed up to XGB or if now there is no limit as MPI4.0 handles communication differently internally?
to the one-sided MPI. I read a bit about but I’m not super familiar; would it be straightforward to use it in pylops-MPI or we would need to reconsider completely the design of the library?

mrava87 · 2024-10-31T04:39:27Z

Btw, @rohanbabbar04, my main question remain valid - there is nothing better than NOT communicate if this is not needed and I wonder if we are actually over-communicating when we have a partition.BROADCAST array? The only reason I see for doing this bcast operation is that we want to ensure that if the array became different across ranks the one of rank0 overwrites the others - perhaps we could create something new like partition.UNSAFEBROADCAST that does not do the bcast operation at the risk of having arrays different from each other. Users would switch to it after they have ensured their code works for large runs where you want to be as efficient as possible?

hongyx11 · 2024-11-01T01:19:28Z

Thanks @hongyx11, interesting 😀

Two questions:

I dug a bit deeper and realized that the 2GB problem is solved with MPI4.0 (and MPI4Py>=4.0.0 as they started supporting MPI4.0). In fact, in one cluster I have mpi4py==4.0.0 and I don’t experience this problem, in the other cluster a older version of mpi4py is installed and that’s why I started to far this issue. Now the question is, do you know if the 2GB limit has been pushed up to XGB or if now there is no limit as MPI4.0 handles communication differently internally?

to the one-sided MPI. I read a bit about but I’m not super familiar; would it be straightforward to use it in pylops-MPI or we would need to reconsider completely the design of the library?

It seems to me that MPI 4.0 has not solved the int problem. But my impression might be wrong. could you please provide more details about the experiment?
Indeed, moving to one sided mpi will increase the complexity. So I think send and recv is more reasonable.
but the best is of course no communication if it not needed.
Yuxi

mrava87 · 2024-11-01T06:40:53Z

I can try to reproduce the example, but basically I was running the same code in Ibex and Shahee where, because of this broadcasting operation I am talking about, a large vector was sent across nodes. In one case things worked fine, in the other I started to get the error OverflowError: integer 2176051861 does not fit in 'int'... after mpi4py was updated to >=4.0.1 the problem was gone.

Reading the comment of Lisandro here mpi4py/mpi4py#119

This happens when the total memory of the pickled object is larger than 2GB, the message count overflows a 32bit int. This is a well know limitation of MPI-1/2/3, hopefully fixed on MPI-4, but not yet available on a released Open MPI or MPICH (in mpi4py, there is already a branch with MPI-4 support).

I had assumed this problem is fixed in MPI-4? But maybe the allowed size is just now bigger but there is still a limit?

mrava87 · 2024-11-01T06:43:47Z

For sure, one thing we may want to look at is mpi4py.util.pkl5 as our library is bound to work with large vector 😸

I'll give it a try and report my findings

rohanbabbar04 · 2024-11-02T07:42:22Z

Hmm...
Sorry for jumping in a little late.

I agree with @mrava87 that eliminating communication altogether should work. I'm encountering the 2GB error with mpi4py<4.0.0. Since we can get the correct result without using bcast (just sending and receiving), the best approach might be to assign the local arrays directly at each rank, as @mrava87 suggested. This would avoid communication overhead and could work effectively. Do you think we should pursue both the UNSAFE and SAFE approaches? I think we can test this by assigning the same local array at each rank for the broadcast. If we are able to bypass the 2GB error, then we should be good.

I have one question that just came to mind:

We use global sum reductions in some cases, which will require the incorporation of all processes. I think this might not be efficient in the case of One-Sided MPI.

One option would be to integrate it as an additional requirement without recommending it as the default. What do you all think?

mrava87 · 2024-11-02T10:32:33Z

Hi @rohanbabbar04,
Thanks for the feedback.

I have not yet tried getting rid of the bcast line but I’m happy you also think it should work as long as we assign the same array in all ranks at the start and we are consistent through the run.

I wasn’t sure why in the first place we did it this way and I seemed to remember the idea was there we want to prevent users to accidentally overwrite an array that should be broadcasted in one rank and not do the same in the others… so the idea of having an unsafe and a safe option, but we could change and just have the current broadcast that does not enforce anything so we reduce communication (but give users more power to mess us 🙈)

I also don’t think we should switch to one-sides MPI… perhaps just add the possibility to use it if deemed a better fit in some scenarios- but I’m not familiar with it so I would not be sure when and where to use it

mrava87 · 2024-11-02T10:33:50Z

For sure, one thing we may want to look at is mpi4py.util.pkl5 as our library is bound to work with large vector 😸

I'll give it a try and report my findings

I tried this, but I seem to still have issues with large arrays, so probably the 2GB thing is not fully gone in MPI4.0

rohanbabbar04 · 2024-11-02T10:40:19Z

Hi @rohanbabbar04, Thanks for the feedback.

I have not yet tried getting rid of the bcast line but I’m happy you also think it should work as long as we assign the same array in all ranks at the start and we are consistent through the run.

I wasn’t sure why in the first place we did it this way and I seemed to remember the idea was there we want to prevent users to accidentally overwrite an array that should be broadcasted in one rank and not do the same in the others… so the idea of having an unsafe and a safe option, but we could change and just have the current broadcast that does not enforce anything so we reduce communication (but give users more power to mess us 🙈)

I also don’t think we should switch to one-sides MPI… perhaps just add the possibility to use it if deemed a better fit in some scenarios- but I’m not familiar with it so I would not be sure when and where to use it

Let me add a PR for this change for assigning it directly to the local array, I will test it

mrava87 assigned rohanbabbar04 Oct 29, 2024

mrava87 added the enhancement New feature or request label Oct 29, 2024

rohanbabbar04 mentioned this issue Nov 6, 2024

Change Array assignment for Broadcast #117

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Overflow in scatter/broadcast for arrays larger than 2GB #115

Overflow in scatter/broadcast for arrays larger than 2GB #115

mrava87 commented Oct 29, 2024

hongyx11 commented Oct 31, 2024 •

edited

Loading

mrava87 commented Oct 31, 2024

mrava87 commented Oct 31, 2024 •

edited

Loading

hongyx11 commented Nov 1, 2024

mrava87 commented Nov 1, 2024 •

edited

Loading

mrava87 commented Nov 1, 2024 •

edited

Loading

rohanbabbar04 commented Nov 2, 2024

mrava87 commented Nov 2, 2024

mrava87 commented Nov 2, 2024

rohanbabbar04 commented Nov 2, 2024

Overflow in scatter/broadcast for arrays larger than 2GB #115

Overflow in scatter/broadcast for arrays larger than 2GB #115

Comments

mrava87 commented Oct 29, 2024

Problem

Solution

hongyx11 commented Oct 31, 2024 • edited Loading

mrava87 commented Oct 31, 2024

mrava87 commented Oct 31, 2024 • edited Loading

hongyx11 commented Nov 1, 2024

mrava87 commented Nov 1, 2024 • edited Loading

mrava87 commented Nov 1, 2024 • edited Loading

rohanbabbar04 commented Nov 2, 2024

mrava87 commented Nov 2, 2024

mrava87 commented Nov 2, 2024

rohanbabbar04 commented Nov 2, 2024

hongyx11 commented Oct 31, 2024 •

edited

Loading

mrava87 commented Oct 31, 2024 •

edited

Loading

mrava87 commented Nov 1, 2024 •

edited

Loading

mrava87 commented Nov 1, 2024 •

edited

Loading