Skip to content

Commit

Permalink
Add documentation
Browse files Browse the repository at this point in the history
  • Loading branch information
lroberts36 committed Nov 14, 2024
1 parent 255e85a commit 3941117
Showing 1 changed file with 90 additions and 0 deletions.
90 changes: 90 additions & 0 deletions doc/sphinx/src/boundary_communication.rst
Original file line number Diff line number Diff line change
Expand Up @@ -476,3 +476,93 @@ For backwards compatibility, we keep the aliases
- ``ReceiveFluxCorrections`` = ``ReceiveBoundBufs<BoundaryType::flxcor_recv>``
- ``SetFluxCorrections`` = ``SetBoundBufs<BoundaryType::flxcor_recv>``

Coalesced MPI Communication
---------------------------

As is described above, a one-dimensional buffer is packed and unpacked for each communicated
field on each pair of blocks that share a unique topological element. For codes with larger
numbers of variables and/or in simulations run with smaller block sizes, this can result in
a large total number of buffers and importantly a large number of buffers that need to be
communicated across MPI ranks. The latter fact can have significant performance implications,
as each ``CommBuffer<T>::Send()`` call for these non-local buffers corresponds to an
``MPI_Isend``. Generally, these messages contain a small amount of data which results in
a small effective MPI bandwith. Additionally, MPI implementations seem to have a hard time
dealing with the large number of messages required. In some cases, this can result in poor
scaling behavior for Parthenon.

To get around this, we introduce a second level of buffers for communicating across ranks.
For each ``MeshData`` object on a given MPI rank, coalesced buffers equal in size to all
MPI non-local variable-boundary buffers are created for each other MPI rank that ``MeshData``
communicates to. These coalesced buffers are then filled from the single variable-boundary
buffers, a *single* MPI send is called per MPI rank pair, and the receiving ranks unpack the
coalesced buffer into the single variable-boundary buffers. This can drastically reduce the
number of MPI sends and increase the total amount of data sent per message, thereby
increasing the effective bandwidth. Further, in cases where Parthenon is running on GPUs but
GPUDirect MPI is not available, this can also minimize the number of DtoH and HtoD copies
during communication.

To use coalesced communication, your input must include:

.. code::
parthenon/mesh/do_combined_comms = true
curently by default this is set to ``false``.

Implementation Details
~~~~~~~~~~~~~~~~~~~~~~

The coalesced send and receive buffers for each rank are stored in ``Mesh::pcombined_buffers``,
which is a ``std::shared_ptr`` to a ``CombinedBuffers`` object. To do coalesced communication
two pieces are required: 1) an initialization step telling all ranks what coalesced buffer
messages they can expect and 2) a mechanism for packing, sending and unpacking the coalesced
buffers during each boundary communication step.

For the first piece, after every remesh during ``BuildBoundaryBuffers``, each non-local
variable-boundary buffer is registered with ``pcombined_buffers``. Once all these buffers are
registered, ``CombinedBuffers::ResolveAndSendSendBuffers()`` is called, which determines all
the coalesced buffers that are going to be sent from a given rank to every other rank, packs
information about each of the coalesced buffers into MPI messages, and sends them to the other
ranks so that the receiving ranks know how to interpret the messages they receive from a given
rank. ``CombinedBuffers::ReceiveBufferInfo()`` is then called to receive this information from
other ranks. This process basically just packs ``BndId`` objects, which contain the information
necessary to identify a variable-boundary communication channel and the amount of data that
is communicated across that channel, and then unpacks them on the receiving end and finds the
correct variable-boundary buffers. These routines are called once per rank (rather than per
``MeshData``).

For the second piece, variable-boundary buffers are first filled as normal in ``SendBoundBufs``
but the states of the ``CommBuffer``s are updated without actually calling the associated
``MPI_Isend``s. Then ``CombinedBuffers::PackAndSend(MeshData<Real> *pmd, BoundaryType b_type)``
is called, which for each rank pair associated with ``pmd`` packs the variable-boundary buffers
into the coalesced buffer, packs a second message containing the sparse allocation status of
each variable-boundary buffer, send these two messages, and then stales the associated
variable-boundary buffers since their data is no longer required. On the receiving side,
``ReceiveBoundBufs`` receives these messages, sets the corresponding variable-boundary
buffers to the correct ``received`` or ``received_null`` state, and then unpacks the data
into the buffers. Note that the messages received here do not necessarily correspond to the
``MeshData`` that is passed to the associated ``ReceiveBoundBufs`` call, so all
variable-boundary associated with a given receiving ``MeshData`` must still be checked for
being in a received state. Once they are all in a received state, setting of boundaries,
prolongation, etc. can proceed normally.

Some notes:
- Internally ``CombinedBuffers`` contains maps from MPI rank and ``BoundaryType`` (e.g. regular
communication, flux correction) to ``CombinedBuffersRank`` objects for sending and receiving
rank pairs. These ``CombinedBuffersRank`` objects in turn contain maps from ``MeshData``
partition id of the sending ``MeshData`` (which also doubles as the MPI tag for the messages)
to ``CombinedBuffersRankPartition`` objects.
- ``CombinedBuffersRank`` is where the post-remesh initialization routines are actually
implemented. This can either correspond to the send or receive side.
- ``CombinedBuffersRankPartition`` corresponds to each coalesced buffer and is where the
the packing, sending, receiving, and unpacking details for coalesced boundary communication
are implemented. This object internally owns the ``CommunicationBuffer<BufArray1D<Real>>``
that is used for sending and receiving the coalesced data (as well as the communication buffer
used for communicating allocation status).
- Because Parthenon allows communication on ``MeshData`` objects that contain a subset of the
``MetaData::FillGhost`` fields in a simulation, we need to be able to interpret coalesced
messages that that contain a subset of fields. Most of what is needed for this is implemented
in ``GetBndIdsOnDevice``.
- Currently, there is a ``Compare`` method in ``CombinedBuffersRankPartition`` that is just for
debugging. It should compare the received coalesced messages to the variable-boundary buffer
messages, but using it requires some hacks in the code to send both types of buffers.

0 comments on commit 3941117

Please sign in to comment.