Add documentation

parthenon-hpc-lab · Nov 14, 2024 · 3941117 · 3941117
1 parent 255e85a
commit 3941117
Showing 1 changed file with 90 additions and 0 deletions.
diff --git a/doc/sphinx/src/boundary_communication.rst b/doc/sphinx/src/boundary_communication.rst
@@ -476,3 +476,93 @@ For backwards compatibility, we keep the aliases
 - ``ReceiveFluxCorrections`` = ``ReceiveBoundBufs<BoundaryType::flxcor_recv>`` 
 - ``SetFluxCorrections`` = ``SetBoundBufs<BoundaryType::flxcor_recv>``
 
+Coalesced MPI Communication
+---------------------------
+
+As is described above, a one-dimensional buffer is packed and unpacked for each communicated 
+field on each pair of blocks that share a unique topological element. For codes with larger
+numbers of variables and/or in simulations run with smaller block sizes, this can result in
+a large total number of buffers and importantly a large number of buffers that need to be 
+communicated across MPI ranks. The latter fact can have significant performance implications,
+as each ``CommBuffer<T>::Send()`` call for these non-local buffers corresponds to an
+``MPI_Isend``. Generally, these messages contain a small amount of data which results in
+a small effective MPI bandwith. Additionally, MPI implementations seem to have a hard time
+dealing with the large number of messages required. In some cases, this can result in poor
+scaling behavior for Parthenon. 
+
+To get around this, we introduce a second level of buffers for communicating across ranks.
+For each ``MeshData`` object on a given MPI rank, coalesced buffers equal in size to all
+MPI non-local variable-boundary buffers are created for each other MPI rank that ``MeshData``
+communicates to. These coalesced buffers are then filled from the single variable-boundary
+buffers, a *single* MPI send is called per MPI rank pair, and the receiving ranks unpack the 
+coalesced buffer into the single variable-boundary buffers. This can drastically reduce the 
+number of MPI sends and increase the total amount of data sent per message, thereby
+increasing the effective bandwidth. Further, in cases where Parthenon is running on GPUs but
+GPUDirect MPI is not available, this can also minimize the number of DtoH and HtoD copies
+during communication. 
+
+To use coalesced communication, your input must include: 
+
+.. code::
+   
+   parthenon/mesh/do_combined_comms = true
+
+curently by default this is set to ``false``.
+
+Implementation Details
+~~~~~~~~~~~~~~~~~~~~~~
+
+The coalesced send and receive buffers for each rank are stored in ``Mesh::pcombined_buffers``,
+which is a ``std::shared_ptr`` to a ``CombinedBuffers`` object. To do coalesced communication 
+two pieces are required: 1) an initialization step telling all ranks what coalesced buffer
+messages they can expect and 2) a mechanism for packing, sending and unpacking the coalesced 
+buffers during each boundary communication step.
+
+For the first piece, after every remesh during ``BuildBoundaryBuffers``, each non-local
+variable-boundary buffer is registered with ``pcombined_buffers``. Once all these buffers are
+registered, ``CombinedBuffers::ResolveAndSendSendBuffers()`` is called, which determines all
+the coalesced buffers that are going to be sent from a given rank to every other rank, packs
+information about each of the coalesced buffers into MPI messages, and sends them to the other
+ranks so that the receiving ranks know how to interpret the messages they receive from a given
+rank. ``CombinedBuffers::ReceiveBufferInfo()`` is then called to receive this information from
+other ranks. This process basically just packs ``BndId`` objects, which contain the information
+necessary to identify a variable-boundary communication channel and the amount of data that 
+is communicated across that channel, and then unpacks them on the receiving end and finds the
+correct variable-boundary buffers. These routines are called once per rank (rather than per
+``MeshData``). 
+
+For the second piece, variable-boundary buffers are first filled as normal in ``SendBoundBufs``
+but the states of the ``CommBuffer``s are updated without actually calling the associated
+``MPI_Isend``s. Then ``CombinedBuffers::PackAndSend(MeshData<Real> *pmd, BoundaryType b_type)``
+is called, which for each rank pair associated with ``pmd`` packs the variable-boundary buffers
+into the coalesced buffer, packs a second message containing the sparse allocation status of 
+each variable-boundary buffer, send these two messages, and then stales the associated 
+variable-boundary buffers since their data is no longer required. On the receiving side, 
+``ReceiveBoundBufs`` receives these messages, sets the corresponding variable-boundary 
+buffers to the correct ``received`` or ``received_null`` state, and then unpacks the data
+into the buffers. Note that the messages received here do not necessarily correspond to the
+``MeshData`` that is passed to the associated ``ReceiveBoundBufs`` call, so all
+variable-boundary associated with a given receiving ``MeshData`` must still be checked for
+being in a received state. Once they are all in a received state, setting of boundaries,
+prolongation, etc. can proceed normally. 
+
+Some notes:
+- Internally ``CombinedBuffers`` contains maps from MPI rank and ``BoundaryType`` (e.g. regular
+  communication, flux correction) to ``CombinedBuffersRank`` objects for sending and receiving
+  rank pairs. These ``CombinedBuffersRank`` objects in turn contain maps from ``MeshData``
+  partition id of the sending ``MeshData`` (which also doubles as the MPI tag for the messages) 
+  to ``CombinedBuffersRankPartition`` objects. 
+- ``CombinedBuffersRank`` is where the post-remesh initialization routines are actually
+  implemented. This can either correspond to the send or receive side.
+- ``CombinedBuffersRankPartition`` corresponds to each coalesced buffer and is where the
+  the packing, sending, receiving, and unpacking details for coalesced boundary communication 
+  are implemented. This object internally owns the ``CommunicationBuffer<BufArray1D<Real>>``
+  that is used for sending and receiving the coalesced data (as well as the communication buffer
+  used for communicating allocation status).
+- Because Parthenon allows communication on ``MeshData`` objects that contain a subset of the 
+  ``MetaData::FillGhost`` fields in a simulation, we need to be able to interpret coalesced
+  messages that that contain a subset of fields. Most of what is needed for this is implemented 
+  in ``GetBndIdsOnDevice``.
+- Currently, there is a ``Compare`` method in ``CombinedBuffersRankPartition`` that is just for 
+  debugging. It should compare the received coalesced messages to the variable-boundary buffer 
+  messages, but using it requires some hacks in the code to send both types of buffers.