system.tex

% -*- mode: latex; mode: visual-line; fill-column: 9999; coding: utf-8 -*-

\section{Benchmark Environment}
\label{sec:system}
Our benchmark environment consisted of three different XSEDE \cite{xsede} HPC resources (described in section~\ref{sec:hpcresources}), the software stack used (section~\ref{sec:software}), which had to be compiled for each resource, and the common test data set (section~\ref{sec:data}).

\subsection{HPC Resources}
\label{sec:hpcresources}

The computational experiments were executed on standard compute nodes of three XSEDE \cite{xsede} supercomputers, \emph{SDSC Comet}, \emph{PSC Bridges}, and \emph{LSU SuperMIC} (Table~\ref{tab:sys-config}).
\emph{SDSC Comet} is a 2 PFlop/s cluster with 2,020 compute nodes in total. It is optimized for running a large number of medium-size calculations (up to 1,024 cores) to support the most prevalent type of calculation on XSEDE resources.
\emph{PSC Bridges} is a 1.35 PFlop/s cluster with different types of computational nodes, including 16 GPU nodes, 8 large memory and 2 extreme memory nodes, and 752 regular nodes.
It was designed to flexibly support both traditional (medium scale calculations) and non-traditional (data analytics) HPC uses.
\emph{LSU SuperMIC} offers 360 standard compute nodes with a peak performance of 557 TFlop/s.
The parallel file system on all three machines is Lustre (\url{http://lustre.org/}) and is shared between the nodes of each cluster.

\begin{table}[ht!]
        \centering
  	\caption[Configuration of HPC resources]
	{Configuration of the HPC resources that were benchmarked. Only a subset of the total available nodes were used. IB: InfiniBand; OPA: Omni-Path Architecture.}
	\label{tab:sys-config}
	\begin{adjustbox}{max width=\textwidth}
		\begin{tabular}{c c c c c c c c c}
			\toprule
			\bfseries\thead{Name} & \bfseries\thead{Nodes} & \makecell{\bfseries\thead{Number \\of Nodes}} & \bfseries\thead{CPUs} &  \bfseries\thead{RAM} & \bfseries\thead{Network Topology} & \makecell{\bfseries\thead{Scheduler and  \\ Resource Manager}} & \makecell{\bfseries\thead{parallel\\file system}}\\
			\midrule
			\bfseries \emph{SDSC Comet} & Compute & 6400 & \makecell{2 Intel Xeon (E5-2680v3) \\ 12 cores/CPU, 2.5 GHz} &128 GB DDR4 DRAM & 56 Gbps IB & SLURM & Lustre\\
			\bfseries \emph{PSC Bridges} & RSM & 752 & \makecell{2 Intel Haswell (E5-2695 v3)  \\14 cores/CPU, 2.3 GHz} & 128 GB, DDR4-2133MHz & 12.37 Gbps OPA & SLURM & Lustre\\
			\bfseries \emph{LSU SuperMIC} & Standard & 360 & \makecell{2 Intel Ivy Bridge (E5-2680) \\10 cores/CPU, 2.8 GHz} & 64 GB, DDR3-1866MHz  & 56 Gbps IB & PBS & Lustre\\
			\bottomrule
		\end{tabular}
	\end{adjustbox}
\end{table}

\subsection{Software}
\label{sec:software}

Table~\ref{tab:version} lists the tools and libraries that were required for our computational experiments.  Many domain specific packages are not available in the standard software installation on supercomputers.
We therefore had to compile them, which in some cases required substantial effort due to non-standard building and installation procedures or lack of good documentation.
Because this is a common problem that hinders reproducibility we provide detailed version information, notes on the installation process, as well as comments on the ease of installation and the quality of the documentation in Table~\ref{tab:version}.
For the MPI implementation we used Open MPI release 1.10.7  (\url{https://www.open-mpi.org/}) consistently everywhere.
We used the \package{h5py} package for HDF5, which enables parallel HDF5 from Python because its dependencies, the HDF5 library itself and \package{mpi4py}, were both built against Open MPI.
We used Python 2.7 because it provided maximum compatibility between packages at the time when the project was started.
In principle the complete Python-dependent software stack could also be set up with Python 3.5 or higher, which is recommended because Python 2 reached end of life in January 2020.
Detailed instructions to create the computing environments together with the benchmarking code can be found in the GitHub repository as described in Section~\ref{sec:sharing}.
Carefully setting up the same software stack on the three different supercomputers allowed us to clearly demonstrate the reproducibility of our results and showed that our findings were not dependent on machine specifics.


\begin{table}[ht!]
\centering
\caption[Version of the packages used in the present study]%
{Detailed comparison on the dependencies and installation of different software packages used in the present study. Software was built from source or obtained via a package manager and installed on the multi-user HPC systems in Table~\protect\ref{tab:sys-config}. Evaluation of ease of installation and documentation uses a subjective scale with ``++'' (excellent), ``+'' (good), ``0'' (average), and ``$-$'' (difficult/lacking) and reflects the experience of a typical domain scientist at the graduate/post-graduate level in a discipline such as computational biophysics or chemistry.}
\label{tab:version}  
\begin{adjustbox}{max width=\textwidth}
\begin{tabular}{l c l c c l l}
  \toprule
            \bfseries\thead{Package} & \bfseries\thead{Version} & \bfseries\thead{Description} & \bfseries\thead{Ease of Installation} & \bfseries\thead{Documentation} & \bfseries\thead{Installation} & \bfseries\thead{Dependencies}\\
  \midrule
   \bfseries GCC & 4.9.4 & GNU Compiler Collection & 0 & ++ & \makecell[l]{via configuration \\files, environment \\or command line options, \\ minimal configuration \\ is required} &--\\
   \midrule
   \bfseries Open MPI & 1.10.7 & MPI Implementation & 0 & ++ & \makecell[l]{via configuration \\ files, environment \\or command line options, \\ minimal configuration \\ is required} &--\\
   \midrule
   \bfseries Python & 2.7.13 & Python language & + & ++ & Conda Installation & --\\
   \midrule
   \bfseries mpi4py & 3.0.0 & MPI for Python & + & ++ & Conda Installation &\makecell[l]{Python 2.7 or above, \\ MPI 1.x/2.x/3.x  \\ implementation like \\ Open MPI \\built with shared/dynamic \\libraries, Cython}\\
   \midrule
   \bfseries PHDF5 & 1.10.1 & Parallel HDF5 & $-$ & ++ & \makecell[l]{via configuration files,\\ environment \\or command line options, \\ several optional configuration\\ settings available} &\makecell[l]{MPI 1.x/2.x/3.x  \\ implementation like \\ Open MPI  \\GNU, MPIF90,  \\MPICC, MPICXX}\\
   \midrule
   \bfseries h5py &  2.7.1 & Pythonic wrapper around the HDF5 & + & ++ & Conda Installation & \makecell[l]{Python 2.7, or above,\\ PHDF5, Cython}\\    
   \midrule
   \bfseries MDAnalysis & 0.17.0 & \makecell[l]{Python library to analyze \\trajectories from MD simulations} & + & ++ & Conda Installation & \makecell[l]{Python $\ge$2.7, Cython,\\ GNU, Numpy}\\
  \bottomrule
\end{tabular}
\end{adjustbox}
\end{table}


\subsection{Data Set}
\label{sec:data}

The test system contained the protein adenylate kinase with 214 amino acid residues and 3341 atoms in total~\cite{Seyler:2014il} and the topology information (atoms types and bonds) was stored in a file in CHARMM PSF \cite{Brooks:2009pt} format.
The test trajectory was created by concatenating 600 copies of a MD trajectory with 4,187 time frames \cite{Seyler:2017aa} (saved every 240~ps for a total simulated time of 1.004~$\mu\text{s}$) in CHARMM DCD \cite{Brooks:2009pt} format and converting to Gromacs \cite{Abraham:2015aa} XTC format trajectory, as described for the ``600x'' trajectory in~\citet{Khoshlessan:2017ab}.
The trajectory had a file size of about 30 GB and contained 2,512,200 frames (corresponding to 602.4~$\mu\text{s}$ simulated time).
The file size was relatively small because water molecules that were also part of the original MD simulations were stripped to reduce the original file size by a factor of about 10; such preprocessing is a common approach if one is only interested in the protein behavior.
Thus, the trajectory represents a small to medium system size in the number of atoms and coordinates that have to be loaded into memory for each time frame.
The XTC format is a format with lossy compression \cite{Lindahl01, Spangberg:2011zr}, which also contributed to the compact file size.
XTC trades lower I/O demands for higher CPU demands during decompression and therefore performed well in our previous study~\cite{Khoshlessan:2017ab}.
In order to assess the performance of reading from an HDF5 file in parallel (see Section~\ref{sec:methods-hdf5}) we generated a trajectory-like HDF5 file with the data required to perform the RMSD calculation.
This HDF5 file was created from the XTC file by sub-selecting the atoms for which the RMSD was calculated as detailed in Section~\ref{sec:mda}; a Python script to perform the trajectory conversion can be found in the GitHub repository (see Section~\ref{sec:sharing}).
The coordinates were stored as a two-dimensional $T \times 3N$ array where the first dimension contained $T=2,512,200$ frames and the second dimension the $3N = 438$ Cartesian coordinates.
Although 2,512,200 frames represents a long simulation for current standards, such trajectories will become increasingly common due to the use of special hardware~\cite{Shaw:2009ly, Shaw:2014aa} and GPU-acceleration~\cite{Salomon-Ferrer:2013cr, Glaser:2015ys, Abraham:2015aa}.