gmxpar.tex

% This file is part of the GROMACS molecular simulation package.
%
% Copyright (c) 2013, by the GROMACS development team, led by
% David van der Spoel, Berk Hess, Erik Lindahl, and including many
% others, as listed in the AUTHORS file in the top-level source
% directory and at http://www.gromacs.org.
%
% GROMACS is free software; you can redistribute it and/or
% modify it under the terms of the GNU Lesser General Public License
% as published by the Free Software Foundation; either version 2.1
% of the License, or (at your option) any later version.
%
% GROMACS is distributed in the hope that it will be useful,
% but WITHOUT ANY WARRANTY; without even the implied warranty of
% MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
% Lesser General Public License for more details.
%
% You should have received a copy of the GNU Lesser General Public
% License along with GROMACS; if not, see
% http://www.gnu.org/licenses, or write to the Free Software Foundation,
% Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02110-1301  USA.
%
% If you want to redistribute modifications to GROMACS, please
% consider that scientific software is very special. Version
% control is crucial - bugs must be traceable. We will be happy to
% consider code for inclusion in the official distribution, but
% derived work must not be called official GROMACS. Details are found
% in the README & COPYING files - if they are missing, get the
% official version at http://www.gromacs.org.
%
% To help us fund GROMACS development, we humbly ask that you cite
% the research papers on the package. Check out http://www.gromacs.org

\section{Parallelization}
\label{sec:par}

\newcommand{\abs}[1]{\mid \! {#1} \! \mid}

The purpose of this section is to discuss the 
\normindex{parallelization} of the 
principle MD algorithm and not to describe the algorithms that are in 
practical use for molecular systems with their complex variety of atoms 
and terms in the force field descriptions. We shall therefore consider 
as an example a simple system consisting only of a single type of atoms 
with a simple form of the interaction potential. The emphasis will be 
on the special problems that arise when the algorithm is implemented on 
a parallel computer. 

The simple model problem already contains the bottleneck of all MD 
simulations: the computationally intensive evaluation of the 
{\em non-bonded} forces between pairs of atoms, based on the distance 
between particles. Complex molecular systems will in addition 
involve many different kinds of {\em bonded} forces between designated 
atoms. Such interactions add to the complexity of the algorithm but do 
not modify the basic considerations concerning parallelization.


\subsection{Methods of parallelization}
There are a number of methods to parallelize the MD algorithm, each of
them with their own advantages and disadvantages. The method to 
choose depends on the hardware and compilers available.
We list them here:
\begin{enumerate}
\item[1]        {\em \normindex{Message Passing}.}\\
                In this method, which is more or less the traditional
                way of parallel programming, all the parallelism is
                explicitly programmed by the user. The disadvantage
                is that it takes extra code and effort, the advantage
                is that the programmer keeps full control over the data
                flow and can do optimizations a compiler could not come 
                up with. 

                The implementation is typically done by calling a set of 
                library routines to send and receive data to and from 
                other processors. Almost all hardware vendors support 
                this way of
                parallelism in their C and Fortran compilers.
                
\item[2]        {\em \swapindex{Data}{Parallel}.}\\
                This method lets the user define arrays on which to
                operate in parallel. Programming this way is much
                like vectorizing: recurrence is not parallelized
                ({\eg} {\tt for(i=1; (i<MAX); i++) a[i] = a[i-1] + 1;}
                does not vectorize and not parallelize, because for
                every i the result from the previous step is needed).

                The advantage of data parallelism is that it is
                easier for the user; the compiler takes care of the
                parallelism. The disadvantage is that it is supported
                by a small (though growing) number of hardware vendors,
                and that it is much harder to maintain a program that has to
                run on both parallel and sequential machines, because
                the only standard language that supports it is Fortran-90
                which is not available on many platforms.
\end{enumerate}
Both methods allow for the MD algorithm to be implemented without much
trouble. Message passing MD algorithms have been published
since the mid 80's (\cite{Fincham87}, \cite{Raine89}) 
and development is still continuing. 
Data parallel programming is newer,
but starting from a well vectorized program it is not hard to do.

Our implementation of MD is a message passing one, the reason for which
is partly historical: the project to develop a parallel MD program started
when Fortran-90 was still in the making, and no compilers were
expected to be available. 
At current, we still believe that message passing is the way
to go, after having done some experiments with data parallel programming on a
Connection Machine (CM-5), because of portability to other hardware,
the poor performance of the code produced by the compilers 
and because this way of programming
has the same drawback as vectorization: the part of the program that is
not vectorized or parallelized determines the run time of the program
(\normindex{Amdahl's law}).

The approach we took to parallelism was a minimalist one: use as few
non-standard elements in the software as possible, and use the
simplest \swapindex{processor}{topology} that does the job. We therefore 
decided to use a standard language (ANSI-C) with as few non-standard
routines as possible. We only use 5 communication routines that are
non-standard. It is therefore very easy to port our code to other machines.

For an $O(N^2)$ problem like MD, one of the best schemes for the
inter-processor connections is a ring, so our software demands that a
ring is present in the inter-processor connections. A ring can essentially
always be mapped onto another network like a \normindex{hypercube}, a
bus interface (Ethernet {\eg} using 
\seeindex{Message Passing Interface}{MPI} \normindex{MPI}) or 
a \normindex{tree}
(CM-5). Some hardware vendors have very luxurious connection schemes
that connect every processor to every other processor, but we do not
really need it and so do not use it even though it might come in handy
at times. The advantage with this simple scheme is that {\gromacs}
performs extremely well even on inexpensive workstation clusters.

When using a message passing scheme one has to divide the particles 
over processors, which can be done in two ways:
\begin{itemize}
\item   {\em \swapindex{Space}{Decomposition}.}\\
        An element of space is allocated to each processor, when dividing
        a cubic box with edge $b$ over $P$ processors this can be done 
        by giving
        each processor a slab of length $b/P$. This method 
        has the advantage
        that each processor has about the same number of interactions
        to calculate (at least when the simulated system has a homogeneous
        density, like a liquid or a gas). The disadvantage is that a lot of
        bookkeeping is necessary for particles that move over processor
        boundaries. When using more complex systems, such as macromolecules there
        are also 3- and 4-atom interactions that make 
  	the bookkeeping too complicated for our taste.
\item   {\em \swapindex{Particle}{Decomposition}.}\\
        Every processor is allocated a number of particles. When
        dividing $N$ particles over $P$ processors each processor will
        get $N/P$ particles. The implementation of this method
        is described in the next section.
\end{itemize}

\begin {figure}[p]
\centerline{\includegraphics[width=10cm]{plots/int-mat}}
\caption[The interaction matrix.]{The interaction matrix (left) and
the same using action$~=~-$reaction (right).}
\label{fig:int_mat}
\end {figure}

\begin{table}[p]
\centerline{
\newcolumntype{s}[1]{D{/}{/}{#1}}
\begin{tabular}{|l|s4|s4|s4|s4|}
\dline
        & \mcc{1}{i mod 2 = 0}
                & \mcc{1}{i mod 2 = 0}
                                & \mcc{1}{i mod 2 = 1}
                                                & \mcc{1}{i mod 2 = 1} \\
        & \mcc{1}{i $<$ N/2} 
                & \mcc{1}{i $\ge$ N/2}
                                & \mcc{1}{i $<$ N/2}
                                                & \mcc{1}{i $\ge$ N/2} \\
\hline
N mod 2 = 1     & N/2   & N/2           & N/2           & N/2           \\
N mod 4 = 2     & N/2   & N/2           & N/2 - 1       & N/2 - 1       \\
N mod 4 = 0     & N/2   & N/2 - 1       & N/2 - 1       & N/2           \\
\dline
\end{tabular}
}
\caption[The number of interactions between particles.]{The number of
interactions between particles. The number of $j$ particles per $i$
particle is a function of the total number of particles $N$ and
particle number $i$. Note that here the $/$ operator is used for
integer division, {\ie} truncating the reminder.}
\label{tab:decomp}
\end{table}

\begin{figure}[p]
\centerline{\includegraphics[width=\linewidth]{plots/decomp}}
\caption[Interaction matrices for different $N$.]{Interaction matrices for different $N$. The number of $j$-particles an $i$-particle interacts with depends on the {\em total} number of particles and on the {\em particle number}.}
\label{fig:decomp}
\end{figure}

\subsection{MD on a ring of processors}
When a neighbor list is not used the MD problem is in principle an $O(N^2)$ 
problem as each particle can interact
with every other. This can be simplified using Newton's third law
\beq
F_{ij}  ~=~     -F_{ji}
\label{eqn:Newt3}
\eeq
This implies that there is half a matrix of interactions (without diagonal, 
a particle does not interact with itself) to consider (\figref{int_mat}).
When we reflect the upper right triangle of interactions to the lower
left triangle of the matrix, we still cover all possible interactions,
but now every row in the matrix has almost the same number of points
or possible interactions.  We can now assign a (preferably equal)
number of rows to each processor to compute the forces and at the same
time a number of particles to do the update on, the {\em home}
particles. The number of interactions per particle is dependent on the
{\em total number} $N$ of particles (see \figref{decomp}) and on the
{\em particle number} $i$.  The exact formulae are given in
\tabref{decomp}.

A flow chart of the algorithm is given in \figref{mdpar}.
\begin{figure}
\centerline{\includegraphics[height=18cm]{plots/mdpar}}
\caption[The Parallel MD algorithm.]{The Parallel MD algorithm. If
the steps marked * are left out we have the sequential algorithm
again.}
\label{fig:mdpar}
\end{figure}
\vfill

It is the same as the sequential algorithm, except for two
communication steps. After the particles have been reset in the box,
each processor sends its coordinates onward (left) and then starts computation
of the forces.  After this step each processor holds the {\em partial
forces} for the available particles, {\eg} processor 0 holds forces
acting on home particles from processor 0, 1, 2 and 3. These forces
must be accumulated and sent back (right) to the home
processor. Finally the update of the velocity and coordinates is done
on the home processor.

The {\tt communicate_r} routine is given below in the full C-code:\\
\begin{footnotesize}
\begin{verbatim}
void communicate_r(int nprocs,int pid,rvec vecs[],int start[],int homenr[])
/* 
 * nprocs = number of processors
 * pid    = processor id (0..nprocs-1)
 * vecs   = vectors
 * start  = starting index in vecs for each processor
 * homenr = number of home particles for each processor
 */
{
  int i;        /* processor counter */
  int shift;    /* the amount of processors to communicate with */
  int cur;      /* current processor to send data from */
  int next;     /* next processor on a ring (using modulo) */

  cur   = pid;
  shift = nprocs/2;

  for (i=0; (i<shift); i++) {
    next=(cur+1) \% nprocs;     
    send   (left,  vecs[start[cur]],  homenr[cur]);
    receive(right, vecs[start[next]], homenr[next]);
    cur=next;
  }
}
\end{verbatim}

\end{footnotesize}

The data flow around the ring is visualized in \figref{ring}. 
Note that because of the ring topology each processor automatically 
gets the proper particles to interact with.
\begin {figure}
\centerline{\includegraphics[width=6cm]{plots/ring}}
\caption {Data flow in a ring of processors.}
\label{fig:ring}
\end {figure}

\section{Parallel Molecular Dynamics}
In this chapter we describe some details of the \swapindex{parallel}{MD}  
algorithm used 
in {\gromacs}. This also includes some other information on neighbor searching and
a side excursion to parallel sorting.
Please note the following which we use throughout this chapter:\\
{\bf definition:} {\natom}: Number of particles, {\nproc} number of processors.\\
{\gromacs} employs two different grids: the neighbor searching grid ({\nsgrid})
and the combined charge/potential grid ({\fftgrid}), as will be described below.
To maximize the confusion, 
these two grids are mapped onto a grid of processors when {\gromacs} runs on a 
parallel computer.

\subsection{Domain decomposition}
Modern parallel computers, such as an IBM SP/2 or a Cray T3E
consist of relatively small numbers of relatively fast scalar
processors (typically 8 to 256).  The communication channels that are
available in hardware on these machine are not directly visible to
the programmer; a software layer (usually \normindex{MPI}) 
hides this, and makes communication from all processors to all
others possible. In contrast, in the {\gromacs} hardware~\cite{Berendsen95a}
only communication in a ring was available, {\ie} each processor could communicate
with its direct neighbors only.

It seems logical to map the computational box of an MD simulation system 
to a 3D grid of 
processors ({\eg} 4x4x4 for a 64 processor system). This ensures that most 
interactions that are local in space can be computed with information from 
neighboring processors only. However, this means that there have to be
communication channels in 3 dimensions too, which is not necessarily the case.
Although this may be overcome in software, such a mapping complicates the MD
software as well, without clear performance benefits on most
parallel computers. 

Therefore we opt for a simple one-dimensional division scheme
for the computational box. Each processor gets a slab of this box in the 
X-dimension.
For the communication between processors this has two main advantages:
\begin{enumerate}
\item   Simplicity of coding. Communication can only be to two neighbors
        (called {\em left} and {\em right} in {\gromacs}).
\item   Communication can usually be done in large chunks, which makes it
        more efficient on most hardware platforms.
\end{enumerate}

Most interactions in molecular dynamics have in principle a
short-range character.  Bonds, angles and dihedrals are guaranteed to
have the corresponding particles close in space.


\subsection{Domain decomposition for non-bonded forces}
For large parallel computers, domain decomposition is preferable over
particle decomposition, since it is easier to do load
balancing. Without load balancing the scaling of the code is rather
poor. For this purpose, the computational box is divided in {\nproc}
slabs, where {\nproc} is equal to the number of processors. There are
multiple ways of dividing the box over processors, but since the
{\gromacs} code assumes a ring topology for the processors, it is
logical to cut the system in slabs in just one dimension, the X
dimension.  The algorithm for neighbor searching then becomes:
\begin{enumerate}
\item   Make a list of charge group indices sorted on (increasing) X coordinate
        (\figref{parsort}).
        {\bf Note} that care must be taken to parallelize the sorting algorithm
        as well. See \secref{parsort}.
\item   Divide this list into slabs, with each slab having the same number of
        charge groups
\item   Put the particles corresponding to the local slab on a 3D {\nsgrid} as 
        described in \secref{nsgrid}.
\item   Communicate the {\nsgrid} to neighboring processors (not necessarily to all
        processors). The amount of neighboring {\nsgrid} cells (N$_{gx}$) to 
        communicate is determined by the cut-off length $r_c$ according to
        \beq
        N_{gx}  ~=~     \frac{r_c \nproc}{l_x}   
        \eeq
        where $l_x$ is the box length in the slabbing direction. 
\item   On each processor compute the neighbor list for all charge groups in
        its slab using the normal grid neighbor-searching.
\end{enumerate}

\begin{figure}
\centerline{\includegraphics[width=10cm]{plots/parsort}}
\caption[Index in the coordinate array.]{Index in the coordinate
array. The division in slabs is indicated by dashed lines.}
\label{fig:parsort}
\end{figure}

For homogeneous system, this is close to an optimal load balancing,
without actually doing load balancing. For inhomogeneous system, such
as membranes or interfaces, the slabs should be perpendicular to the
interface; this way, each processor has ``a little bit of
everything''.  The {\gromacs} utility program {\tt editconf} has an
option to rotate a whole computational box.

The following observations are important here:
\begin{itemize}
\item   Particles may diffuse from one slab to the other, therefore each processor
        must hold coordinates for all particles all the time, and distribute forces
        back to all processors as well.
\item   Velocities are kept on the ``home processor'' for each particle,
        where the integration of Newton's equations is done.
\item   Fixed interaction lists (bonds, angles etc.) are kept each
        on a single processor.  Since all processors have all
        coordinates, it does not matter where interactions are
        calculated.  The division is actually done by the {\gromacs}
        preprocessor {\tt grompp} and care is taken that, as far as
        possible, every processor gets the same number of bonded
        interactions.
\end{itemize}

In all, this makes for a mixed particle decomposition/domain decomposition scheme
for parallelization of the MD code. The communication costs are four times higher
than for the simple particle decomposition method described in \secref{par}
(the whole coordinate and force array are communicated across the whole ring,
rather than half the array over half the ring).
However, for large numbers of processors the improved load balancing 
compensates this easily.

\subsection{Parallel \normindex{PPPM}}
A further reason for domain decomposition is the PPPM algorithm. This
algorithm works with a 3D Fast Fourier Transform. It employs a
discrete grid of dimensions (\nx,\ny,\nz), the {\fftgrid}. The
algorithm consist of five steps, each of which have to be
parallelized:
\begin{enumerate}
\item   Spreading charges on the {\fftgrid} to obtain the charge 
        distribution $\rho(\ve{r})$.
        This bit involves the following sub-steps:
        \begin{enumerate}
        \item[{\bf a.}] put particle in the box
        \item[{\bf b.}] find the {\fftgrid} cell in which the particle resides
        \item[{\bf c.}] add the charge of the particle times the appropriate
                        weight factor (see \secref{pppm}) to 
                        each of the 27 grid points (3 x 3 x 3).
        \end{enumerate}
        In the parallel case, the {\fftgrid} 
        must be filled on each processor with its
        share of the particles, and subsequently the {\fftgrid}s of all processors
        must be summed to find the total charge distribution. It may be clear that
        this induces a large amount of unnecessary work, unless we use domain
        decomposition. If each processor only has particles in a certain region
        of space, it only has to calculate the charge distribution for 
        that region of space. Since {\gromacs} works with slabs, this means that
        each processor fills the {\fftgrid} cells corresponding to its slab in space
        and addition of {\fftgrid}s need only be done for neighboring slabs.\\
        To be more precise, the slab $x$ for processor $i$ is defined as:
        \beq
        i\, \frac{l_x}{M} \le x <\, (i+1)\frac{l_x}{M}
        \eeq
        Particle with this $x$ coordinate range will add to the charge distribution
        on the following range of 
        of {\fftgrid} slabs in the $x$ direction:
        \beq
        {\rm trunc}\left(i\,\frac{l_x \nx}{M}\right)-1 \le i_x \le {\rm trunc}\left((i+1)\,\frac{l_x \nx}{M}\right)+2
        \eeq
        where trunc indicates the truncation of a real number to the largest integer
        smaller than or equal to that real number.
        
\item   Doing the Fourier transform of the charge distribution $\rho(\ve{r})$ 
        in parallel to obtain $\hat{\rho}(\ve{k})$. This is done using
        the FFTW library (see \href{http://www.fftw.org}{www.fftw.org})
        which employs the MPI library for message passing programs
        (note that there are also \swapindex{shared}{memory} versions
        of the FFTW code).\\
        This FFT algorithm actually use slabs as well (good
        thinking!).  Each processor does 2D FFTs on its slab, and then
        the whole {\fftgrid} is transposed {\em in place}
        ({\ie} without using extra memory).  This means that after the
        FFT the X and Y components are swapped.  To complete the FFT,
        this swapping should be undone in principle (by transposing
        back).  Happily the FFTW code has an option to omit this,
        which we use in the next step.
\item   Convolute $\hat{\rho}(\ve{k})$ with the Fourier transform of the
        charge spread function $\hat{g}(\ve{k})$ (which we have tabulated before)
        to obtain the potential $\hat{\phi}(k)$. 
        As an optimization, we store the $\hat{g}(\ve{k})$  in transposed form
        as well, matching the transposed form of $\hat{\rho}(\ve{k})$
        which we get from the FFTW routine. After this step we have the 
        potential $\hat{\phi}(k)$ in Fourier space, but still on the transposed
        {\fftgrid}.
\item   Do an inverse transform of $\hat{\phi}(k)$ to obtain
        ${\phi}(\ve{r})$. Since the algorithm must do a transpose of the data
        this step actually yields the wanted result: the un-transposed
        potential in real space.
\item   Interpolate the potential ${\phi}(\ve{r})$ in real space at the particle
        positions to obtain forces and energy. For this bit the same considerations
 	about parallelism hold as for the charge spreading. However in this
        case more neighboring grid cells are needed, implying that we need
        the following set of {\fftgrid} slabs in the $x$ direction:
        \beq
        {\rm trunc}\left(i\,\frac{l_x \nx}{M}\right)-3 \le i_x \le {\rm trunc}\left((i+1)\,\frac{l_x \nx}{M}\right)+4
        \eeq

\end{enumerate}
The algorithm as sketched above requires communication for spreading
the charges, for the forward and backward FFTs, and for interpolating
the forces.  The {\gromacs} bits of the program use only left and
right communication, {\ie} using two communication channels. The FFTW
routines actually use other forms of communication as well, and these
routines are coded with MPI routines for message passing. This implies
that {\gromacs} can only perform the PPPM algorithm on parallel
computers that support MPI. However, most
\swapindex{shared}{memory} computers, such as the SGI Origin, also
support MPI using the 
shared memory for communication.

\subsection{Parallel sorting}
\label{sec:parsort}
For the domain decomposition bit of {\gromacs} it is necessary to sort the 
coordinates (or rather the index to coordinates) every time a neighbor list is made.
If we use brute force, and sort all coordinates on each processor (which is 
technically possible since we have all the coordinates), then this sorting procedure
will take a constant wall clock time, proportional to {\natom}$^2\log${\natom}, 
regardless of the number of processors. We can however do a little
better, if we assume that particles diffuse only slowly.
A parallel sorting algorithm can be conceived as follows: \\
At the first step of the simulation
\begin{enumerate}
\item   Do a full sort of all indices using {\eg} the  Quicksort algorithm that is
        built-in in the standard C-library
\item   Divide the sorted array into slabs (as described above see 
        \figref{parsort}).
\end{enumerate}
At subsequent steps of the simulation:
\begin{enumerate}
\item   Send the indices for each processor to the preceding processor (if
        not processor 0) and to the next processor (if not {\nproc}-1). The 
        communication associated with this operation is proportional to
        2{\natom}/{\nproc}.
\item   Sort the combined indices of the three (or two) processors. Note that
        the CPU time associated with sorting is now
        (3{\natom}/{\nproc})$^2\log$ (3{\natom}/{\nproc}).
\item   On each processor, the indices belonging to its slab can be determined
        from the order of the array (\figref{parsort}).
\end{enumerate}

%\section{A Worked Example}
%Suppose our box size is 4.0 nm and the cut-off is 0.8 nm. For neighborsearching 
%we use {\dgrid} = 2, such that there 10 {\nsgrid} cells.


% LocalWords:  Parallelization parallelization parallelize Fortran vectorizing
% LocalWords:  parallelized vectorize Amdahl's MPI ij ji decomp mdpar nprocs gx
% LocalWords:  pid rvec vecs homenr dihedrals indices parsort nsgrid editconf
% LocalWords:  preprocessor grompp PPPM pppm trunc FFTW FFT FFTs Convolute un
% LocalWords:  SGI Quicksort formulae