diff --git a/doc/sphinx/src/boundary_communication.rst b/doc/sphinx/src/boundary_communication.rst index 39e38e81749c..ad2a1d486af3 100644 --- a/doc/sphinx/src/boundary_communication.rst +++ b/doc/sphinx/src/boundary_communication.rst @@ -476,3 +476,93 @@ For backwards compatibility, we keep the aliases - ``ReceiveFluxCorrections`` = ``ReceiveBoundBufs`` - ``SetFluxCorrections`` = ``SetBoundBufs`` +Coalesced MPI Communication +--------------------------- + +As is described above, a one-dimensional buffer is packed and unpacked for each communicated +field on each pair of blocks that share a unique topological element. For codes with larger +numbers of variables and/or in simulations run with smaller block sizes, this can result in +a large total number of buffers and importantly a large number of buffers that need to be +communicated across MPI ranks. The latter fact can have significant performance implications, +as each ``CommBuffer::Send()`` call for these non-local buffers corresponds to an +``MPI_Isend``. Generally, these messages contain a small amount of data which results in +a small effective MPI bandwith. Additionally, MPI implementations seem to have a hard time +dealing with the large number of messages required. In some cases, this can result in poor +scaling behavior for Parthenon. + +To get around this, we introduce a second level of buffers for communicating across ranks. +For each ``MeshData`` object on a given MPI rank, coalesced buffers equal in size to all +MPI non-local variable-boundary buffers are created for each other MPI rank that ``MeshData`` +communicates to. These coalesced buffers are then filled from the single variable-boundary +buffers, a *single* MPI send is called per MPI rank pair, and the receiving ranks unpack the +coalesced buffer into the single variable-boundary buffers. This can drastically reduce the +number of MPI sends and increase the total amount of data sent per message, thereby +increasing the effective bandwidth. Further, in cases where Parthenon is running on GPUs but +GPUDirect MPI is not available, this can also minimize the number of DtoH and HtoD copies +during communication. + +To use coalesced communication, your input must include: + +.. code:: + + parthenon/mesh/do_combined_comms = true + +curently by default this is set to ``false``. + +Implementation Details +~~~~~~~~~~~~~~~~~~~~~~ + +The coalesced send and receive buffers for each rank are stored in ``Mesh::pcombined_buffers``, +which is a ``std::shared_ptr`` to a ``CombinedBuffers`` object. To do coalesced communication +two pieces are required: 1) an initialization step telling all ranks what coalesced buffer +messages they can expect and 2) a mechanism for packing, sending and unpacking the coalesced +buffers during each boundary communication step. + +For the first piece, after every remesh during ``BuildBoundaryBuffers``, each non-local +variable-boundary buffer is registered with ``pcombined_buffers``. Once all these buffers are +registered, ``CombinedBuffers::ResolveAndSendSendBuffers()`` is called, which determines all +the coalesced buffers that are going to be sent from a given rank to every other rank, packs +information about each of the coalesced buffers into MPI messages, and sends them to the other +ranks so that the receiving ranks know how to interpret the messages they receive from a given +rank. ``CombinedBuffers::ReceiveBufferInfo()`` is then called to receive this information from +other ranks. This process basically just packs ``BndId`` objects, which contain the information +necessary to identify a variable-boundary communication channel and the amount of data that +is communicated across that channel, and then unpacks them on the receiving end and finds the +correct variable-boundary buffers. These routines are called once per rank (rather than per +``MeshData``). + +For the second piece, variable-boundary buffers are first filled as normal in ``SendBoundBufs`` +but the states of the ``CommBuffer``s are updated without actually calling the associated +``MPI_Isend``s. Then ``CombinedBuffers::PackAndSend(MeshData *pmd, BoundaryType b_type)`` +is called, which for each rank pair associated with ``pmd`` packs the variable-boundary buffers +into the coalesced buffer, packs a second message containing the sparse allocation status of +each variable-boundary buffer, send these two messages, and then stales the associated +variable-boundary buffers since their data is no longer required. On the receiving side, +``ReceiveBoundBufs`` receives these messages, sets the corresponding variable-boundary +buffers to the correct ``received`` or ``received_null`` state, and then unpacks the data +into the buffers. Note that the messages received here do not necessarily correspond to the +``MeshData`` that is passed to the associated ``ReceiveBoundBufs`` call, so all +variable-boundary associated with a given receiving ``MeshData`` must still be checked for +being in a received state. Once they are all in a received state, setting of boundaries, +prolongation, etc. can proceed normally. + +Some notes: +- Internally ``CombinedBuffers`` contains maps from MPI rank and ``BoundaryType`` (e.g. regular + communication, flux correction) to ``CombinedBuffersRank`` objects for sending and receiving + rank pairs. These ``CombinedBuffersRank`` objects in turn contain maps from ``MeshData`` + partition id of the sending ``MeshData`` (which also doubles as the MPI tag for the messages) + to ``CombinedBuffersRankPartition`` objects. +- ``CombinedBuffersRank`` is where the post-remesh initialization routines are actually + implemented. This can either correspond to the send or receive side. +- ``CombinedBuffersRankPartition`` corresponds to each coalesced buffer and is where the + the packing, sending, receiving, and unpacking details for coalesced boundary communication + are implemented. This object internally owns the ``CommunicationBuffer>`` + that is used for sending and receiving the coalesced data (as well as the communication buffer + used for communicating allocation status). +- Because Parthenon allows communication on ``MeshData`` objects that contain a subset of the + ``MetaData::FillGhost`` fields in a simulation, we need to be able to interpret coalesced + messages that that contain a subset of fields. Most of what is needed for this is implemented + in ``GetBndIdsOnDevice``. +- Currently, there is a ``Compare`` method in ``CombinedBuffersRankPartition`` that is just for + debugging. It should compare the received coalesced messages to the variable-boundary buffer + messages, but using it requires some hacks in the code to send both types of buffers. \ No newline at end of file