-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Overflow in scatter/broadcast for arrays larger than 2GB #115
Comments
There are two other solutions, one is plain MPI send and Recv as stated by Lisandro mpi4py/mpi4py#119 (comment) |
Thanks @hongyx11, interesting 😀 Two questions:
|
Btw, @rohanbabbar04, my main question remain valid - there is nothing better than NOT communicate if this is not needed and I wonder if we are actually over-communicating when we have a partition.BROADCAST array? The only reason I see for doing this bcast operation is that we want to ensure that if the array became different across ranks the one of rank0 overwrites the others - perhaps we could create something new like partition.UNSAFEBROADCAST that does not do the bcast operation at the risk of having arrays different from each other. Users would switch to it after they have ensured their code works for large runs where you want to be as efficient as possible? |
It seems to me that MPI 4.0 has not solved the int problem. But my impression might be wrong. could you please provide more details about the experiment? |
I can try to reproduce the example, but basically I was running the same code in Ibex and Shahee where, because of this broadcasting operation I am talking about, a large vector was sent across nodes. In one case things worked fine, in the other I started to get the error Reading the comment of Lisandro here mpi4py/mpi4py#119
I had assumed this problem is fixed in MPI-4? But maybe the allowed size is just now bigger but there is still a limit? |
For sure, one thing we may want to look at is mpi4py.util.pkl5 as our library is bound to work with large vector 😸 I'll give it a try and report my findings |
Hmm... I agree with @mrava87 that eliminating communication altogether should work. I'm encountering the 2GB error with mpi4py<4.0.0. Since we can get the correct result without using bcast (just sending and receiving), the best approach might be to assign the local arrays directly at each rank, as @mrava87 suggested. This would avoid communication overhead and could work effectively. Do you think we should pursue both the UNSAFE and SAFE approaches? I think we can test this by assigning the same local array at each rank for the broadcast. If we are able to bypass the 2GB error, then we should be good. I have one question that just came to mind:
One option would be to integrate it as an additional requirement without recommending it as the default. What do you all think? |
Hi @rohanbabbar04, I have not yet tried getting rid of the bcast line but I’m happy you also think it should work as long as we assign the same array in all ranks at the start and we are consistent through the run. I wasn’t sure why in the first place we did it this way and I seemed to remember the idea was there we want to prevent users to accidentally overwrite an array that should be broadcasted in one rank and not do the same in the others… so the idea of having an unsafe and a safe option, but we could change and just have the current broadcast that does not enforce anything so we reduce communication (but give users more power to mess us 🙈) I also don’t think we should switch to one-sides MPI… perhaps just add the possibility to use it if deemed a better fit in some scenarios- but I’m not familiar with it so I would not be sure when and where to use it |
I tried this, but I seem to still have issues with large arrays, so probably the 2GB thing is not fully gone in MPI4.0 |
Let me add a PR for this change for assigning it directly to the local array, I will test it |
Problem
There may be scenarios where the input array that is broadcasted across ranks has a size that exceeds 2GB. This line
pylops-mpi/pylops_mpi/DistributedArray.py
Line 137 in 57b793e
will generate an error of the kind in mpi4py/mpi4py#119
Solution
One option is to look into what suggested in the linked issue. However, I am wondering if (and when) we really need to do broadcasting... for example, when we define a MPILinearOperator that wraps a pylops operator, we know that the input at each rank will be the same, and each rank will perform the same operation, so could we avoid all together to do this:
pylops-mpi/pylops_mpi/LinearOperator.py
Line 86 in 57b793e
This will also reduce a lot of (perhaps useless) communication time?
The text was updated successfully, but these errors were encountered: