experiments.tex

% -*- mode: latex; mode: visual-line; fill-column: 9999; coding: utf-8 -*-

\section{Computational Experiments}
\label{sec:experiments}

We had previously measured the performance of the MPI-parallelized RMSD analysis task on two different HPC resources (\emph{SDSC Comet} and \emph{TACC Stampede}) and had found that it only scaled well up to a single node due to high variance in the runtime of the MPI ranks, similar to the straggler phenomenon observed in big-data analytics \cite{Khoshlessan:2017ab}.
However, the ultimate cause for this high variance could not be ascertained.
We therefore performed more measurements with more detailed timing information (see section \ref{sec:methods}) on \emph{SDSC Comet} (described in this section) and two other supercomputers (summarized in section \ref{sec:clusters}) in order to better understand the origin of the stragglers and find solutions to overcome them. 

\subsection{RMSD Benchmark}
\label{sec:RMSD}

We measured strong scaling for the RMSD analysis task (Algorithm \ref{alg:RMSD}) with the 2,512,200 frame test trajectory (section \ref{sec:data}) on 1 to 72 cores (one to three nodes) of \emph{SDSC Comet} (Figures~\ref{fig:MPIscaling} and \ref{fig:MPIspeedup}). 
We observed poor strong scaling performance beyond a single node (24 cores), comparable to our previous results \cite{Khoshlessan:2017ab}.
A more detailed analysis showed that the RMSD computation, and to a lesser degree the read I/O, considered on their own, scaled well beyond 50 cores (yellow and blue lines in Figure~\ref{fig:ScalingComputeIO}). 
But communication (sending results back to MPI rank 0 with \texttt{MPI\_Gather()}; red line in Figure~\ref{fig:ScalingComputeIO}) and the initial file opening (loading the system information into the \texttt{MDAnalysis.Universe} data structure from a shared topology file and opening the shared trajectory file; gray line in Figure~\ref{fig:ScalingComputeIO}) started to dominate beyond 50 cores.
Communication cost and initial time for opening the trajectory were distributed unevenly across MPI ranks, as shown in Figure~\ref{fig:MPIranks}. 
The ranks that took much longer to complete than the typical execution time of the other ranks were the stragglers that hurt performance.

\begin{figure}[!htb]
  \centering
  \begin{subfigure}{.4\textwidth}
    \includegraphics[width=\linewidth]{figures/main-RMSD-t_total.pdf}
    \captionsetup{format=hang}
    \caption{Scaling total (five repeats)}
    \label{fig:MPIscaling}
  \end{subfigure}
  \hfill
  \begin{subfigure}{.4\textwidth}
    \includegraphics[width=\linewidth]{figures/main-RMSD-speed_up.pdf}
    \captionsetup{format=hang}
    \caption{Speed-up (five repeats)}
    \label{fig:MPIspeedup}
  \end{subfigure}
  \bigskip
  \begin{subfigure}{.45\textwidth}
    \includegraphics[width=\linewidth]{figures/main-RMSD-time_comp_IO_comparison.pdf}
    \captionsetup{format=hang}
    \caption{Scaling for different components (five repeats)}
    \label{fig:ScalingComputeIO}
  \end{subfigure}
  \hfill
  \begin{subfigure} {.5\textwidth}
    \includegraphics[width=\linewidth]{figures/main-RMSD-BarPlot-rank-comparison_72_5.pdf}
    \captionsetup{format=hang}
    \caption{Time comparison on different parts of the calculations per MPI rank (example)}
    \label{fig:MPIranks}
  \end{subfigure}
  \caption{Performance of the RMSD task parallelized with MPI on \emph{SDSC Comet}.
    Results were communicated back to rank 0.
    Five independent repeats were performed to collect statistics.
    (a-c) The error bars show standard deviation with respect to the mean.
    In serial, there is no communication and no data points are shown for $N=1$ in (c).
    (d) Compute \tcomp, read I/O \tIO, communication \tcomm, ending the for loop $t_{\text{end\_loop}}$, opening the trajectory $t_{\text{opening\_trajectory}}$, and overheads $t_{\text{overhead1}}$, $t_{\text{overhead2}}$ per MPI rank; see Table \ref{tab:notation} for definitions.
    These are data from one run of the five repeats.
    MPI ranks 0, 12--27 and 29--72 are stragglers.
  }
  \label{fig:MPIwithIO}
\end{figure}

We qualitatively denoted by \emph{straggler} any MPI rank that took at least about twice as long as the group of ranks that finished fastest, roughly following the original description of a straggler as a task that took an ``unusually long time to complete'' \cite{Dean2008}.
The fast-finishing ranks were generally clearly distinguishable in the per-rank timings such as in Figures~\ref{fig:MPIranks} and \ref{fig:MPIranks-Bridges}.
Such a qualitative definition of stragglers was sufficient for our purpose of identifying scalability bottlenecks, as shown in the following discussion.

\subsubsection*{Identification of Scalability Bottlenecks}

In the example shown in Figure~\ref{fig:MPIranks}, 62 ranks out of 72 took about 60~s (the stragglers) whereas the remaining ranks only took about 20~s.
In other instances, far fewer ranks were stragglers, as shown, for example, in Figure~\ref{fig:MPIranks-Bridges}.
The detailed breakdown of the time spent on each rank (Figure~\ref{fig:MPIranks}) showed that the computation, \tcomp, was relatively constant across ranks. 
The time spent on reading data from the shared trajectory file on the Lustre file system into memory, \tIO, showed variability across different ranks. 
The stragglers, however, appeared to be defined by occasionally much larger \emph{communication} times, \tcomm (line 16 in Algorithm~\ref{alg:RMSD}), which were on the order of 30~s, and by larger times to initially open the trajectory (line 2 in Algorithm~\ref{alg:RMSD}).
\tcomm varied across different ranks and was barely measurable for a few of them.
Although the data in Figure~\ref{fig:MPIranks} represented one run and in other instances different number of ranks were stragglers, the averages over all ranks in five independent repeats (Figure~\ref{fig:ScalingComputeIO}) showed that increased \tcomm were generally the reason for large variations in the run time for each rank.
This initial analysis indicated that communication was a major issue that prevented good scaling beyond a single node but the problems related to file I/O also played an important role in limiting scaling performance.


\subsubsection*{Influence of Hardware}
We ran the same benchmarks on multiple HPC systems that were equipped with a Lustre parallel file system [XSEDE's \emph{PSC Bridges} (Figure~\ref{fig:MPIwithIO-Bridges}) and \emph{LSU SuperMIC} (Figure~\ref{fig:MPIwithIO-SuperMIC})], and observed the occurrence of stragglers, in a manner very similar to the results described for \emph{SDSC Comet}.
There was no clear pattern in which certain MPI ranks would always be a straggler, and neither could we trace stragglers to specific cores or nodes.
Therefore, the phenomenon of stragglers in the RMSD case was reproducible on different clusters and thus appeared to be independent from the underlying hardware.

\subsection{Effect of Compute to I/O Ratio on Performance}
\label{sec:bound}

The results in section~\ref{sec:RMSD} indicated that opening the trajectory, communication, and read I/O were important factors that appeared to correlate with stragglers. 
In order to better characterize the RMSD task, we computed the ratio between the time to complete the computation and the time spent on I/O per frame, \RcompIO (Eq.~\ref{eq:Compute-IO}).
The average values were $\overline{\tcomp^{\text{frame}}} = 0.09\ \text{ms}$, $\overline{t_{\text{IO}}^{\text{frame}}} = 0.3\ \text{ms}$, resulting in a compute-to-I/O ratio $\RcompIO \approx 0.3$.
With $\RcompIO \ll 1$, the RMSD analysis task was characterized as I/O bound.

In other studies, better scaling was observed for analysis tasks that were more compute-intensive than the RMSD calculation, such as a radial distribution function calculation \cite{Roe:2018aa, Fan:2019aa}, i.e., analysis tasks that could be characterized as compute-bound.
Such behavior is expected, as the contribution from the parallel part of the program that requires neither I/O nor communication is increased.
From a practical point of view, it is of interest to understand the size of the effect of increasing the computational load on strong scaling, and in our case, we were interested in seeing if changes in the compute part (namely the RMSD calculation on coordinates held in memory) would have an effect on the execution time of other parts of the program.
In Appendix~\ref{sec:shiftload} we set out to analyze compute bound tasks, i.e.\ ones with $\RcompIO \gg 1$.
To assess the effect of the $\RcompIO$ ratio on performance while leaving other parameters the same, we artificially increased the computational load by repeating the RMSD calculation and measured strong scaling (Figure \ref{fig:tcomp_tIO_effect}).
With increasing $\RcompIO$, the impact of stragglers appeared to lessen (although they did not disappear) and scaling performance improved, as expected (see Appendix \ref{sec:increasedworkload}).
Better scaling also went together with a higher ratio of compute to communication ($\Rcompcomm$, Eq.~\ref{eq:Compute-comm}) as shown in Appendix \ref{sec:tcomm} but ultimately I/O seemed to be the key determinant for performance.

 \begin{figure}[!htb]
   \centering
   \begin{subfigure}{.25\textwidth}
     \includegraphics[width=\linewidth]{figures/Comparison_Speed_UP_with_and_without_IO.pdf}
     \caption{Speed up comparison}
     \label{fig:MPIspeedup-no-IO}
   \end{subfigure}
   \begin{subfigure}{0.3\textwidth}
     \includegraphics[width=\linewidth]{figures/time_comp_IO_comparison_no-IO.pdf}
     \caption{Scaling for different components}
     \label{fig:MPIScalingCompute-Comm-no-IO}     
   \end{subfigure}
   \begin{subfigure}{.4\textwidth}
     \includegraphics[width=\linewidth]{figures/BarPlot-rank-comparison-no-IO.pdf}
     \captionsetup{format=hang}
     \caption{Time comparison on different parts of the calculations per MPI rank when I/O is removed}
     \label{fig:MPIranks-no-IO}
   \end{subfigure}
   \caption{Comparison of the performance of the RMSD task with I/O ($\RcompIO \approx 0.3$) and without I/O ($\RcompIO = +\infty$) on \emph{SDSC Comet}.
     Five repeats were performed to collect statistics.
     (a-b) The error bars show standard deviation with respect to the mean.
     (b) Only compute \tcomp and communication \tcomm are included; there are no timings related to I/O (\tIO, $t_{\text{opening\_trajectory}}$) as in Figure~\protect{\ref{fig:ScalingComputeIO}}.
     (c) Compute \tcomp, read I/O $\tIO=0$, communication \tcomm, ending the for loop \text{$t_{\text{end\_loop}}=0$},
     opening the trajectory \text{$t_{\text{opening\_trajectory}}=0$}, and overheads \text{$t_{\text{overhead1}}$}, \text{$t_{\text{overhead2}}$} per MPI rank.
     (See Table \ref{tab:notation} for definitions.)}
   \label{fig:MPIwithoutIO}
\end{figure}

In order to study an extreme case of a compute-bound task that would demonstrate the effect of ``ideal'' read I/O, we eliminated all I/O from the RMSD task by randomly generating artificial trajectory data in memory; the data had the same size as if they had been obtained from the trajectory file.
The time for the data generation was excluded and no file access was necessary. 
Without any I/O, performance improved markedly (Figure~\ref{fig:MPIwithoutIO}), with reasonable scaling up to 72 cores (3 nodes).
No stragglers were observed because overall communication time decreased dramatically by more than a factor of ten and showed less variability (Figure~\ref{fig:MPIScalingCompute-Comm-no-IO}) compared to the same runs with I/O (Figure~\ref{fig:ScalingComputeIO}), even though exactly the same amount of data were communicated.
The scaling performance suffered somewhat for more than 40 processes only because the cost of communication \tcomm became comparable to the compute time \tcomp and would not decrease further.
Although in practice I/O cannot be avoided, this experiment demonstrated that the way how the trajectory file was accessed on the Lustre file system was at least one cause for the observed stragglers.
It also showed that the communication cost for the \emph{same amount of data transfer} could be dramatically higher in the presence of I/O than in its absence.


\subsection{Reducing I/O Cost}
\label{sec:I/O}
In order to improve performance we needed to employ strategies to avoid the competition over file access across different ranks when the $\RcompIO$ ratio was small.
One obvious approach when using the Lustre parallel file system is to increase the number of stripes, i.e., the number of copies of a file that are stored on different object storage targets (OSTs).
But because in our previous work we did not see scaling performance improvement with varying the stripe count \cite{Khoshlessan:2017ab} we decided to just use the system default, i.e., one stripe per file.
Instead we experimented with two different ways for reducing the I/O cost:
\begin{inparaenum}[1)]
	\item splitting the trajectory file into as many segments as the number of processes (subfiling), thus using file-per-process access, and
	\item using the HDF5 file format together with MPI-IO parallel reads instead of the XTC trajectory format.
\end{inparaenum}
We discuss these two approaches and their performance improvements in detail in the following sections.

\subsubsection{Splitting the Trajectories (subfiling)}
\label{sec:splitting-traj}
\emph{Subfiling} is a mechanism previously used for splitting a large multi-dimensional global array to a number of smaller subarrays in which each smaller array is saved in a separate file. Subfiling reduces the file system control overhead by decreasing the number of processes concurrently accessing a shared file~\cite{scalable-IO, scalable-IO1}.
Because subfiling is known to improve performance of parallel shared-file I/O, we investigated splitting our trajectory file into as many trajectory segments as the number of processes.
The trajectory file was split into $N$ segments, one for each process, with each segment having $N_{b}$ frames. 
This way, each process would access its own trajectory segment file without competing for file accesses. 

We ran a benchmark up to 8 nodes (192 cores) and observed rather better scaling behavior with efficiencies above 0.6 (Figure~\ref{fig:MPIscaling-split} and~\ref{fig:MPIspeedup-split}) with the delay time for stragglers reduced from 65~s to about 10~s for 72 processes. 
However, scaling was still far from ideal due to the MPI communication costs. 
Although the delay due to communication was much smaller than compared to parallel RMSD with shared-file I/O [compare Figure~\ref{fig:MPIranks-split} ($\tcomm^{\text{Straggler}} \gg \tcomp+\tIO$) to Figure~\ref{fig:MPIranks} ($\tcomm^{\text{Straggler}} \approx \tcomp+\tIO$)], it was still delaying several processes and resulted in longer job completion times (Figure~\ref{fig:MPIranks-split}). 
These delayed tasks impacted performance so that speed-up remained far from ideal (Figure~\ref{fig:MPIspeedup-split}).

\begin{figure}[!htb]
  \centering
  \begin{subfigure}{.4\textwidth}
    \includegraphics[width=\linewidth]{figures/tot_time_traj_splitting.pdf}
    \caption{Scaling total}
    \label{fig:MPIscaling-split}
  \end{subfigure}
  \hfill
  \begin{subfigure}{.4\textwidth}
    \includegraphics[width=\linewidth]{figures/Speed_UP_traj_splitting.pdf}
    \captionsetup{format=hang}
    \caption{Speed-up}
    \label{fig:MPIspeedup-split}
  \end{subfigure}
  \bigskip
  \begin{subfigure}{.45\textwidth}
    \includegraphics[width=\linewidth]{figures/IO_compute_scaling_traj_splitting.pdf}
    \captionsetup{format=hang}
    \caption{Scaling for different components}
    \label{fig:ScalingComputeIO-split}
  \end{subfigure}
  \hfill
  \begin{subfigure}{.5\textwidth}
    \includegraphics[width=\linewidth]{figures/split-BarPlot-rank-comparison_192_5.pdf}
    \captionsetup{format=hang}
    \caption{Time comparison on different parts of the calculations per MPI rank.}
    \label{fig:MPIranks-split}
  \end{subfigure}
  \caption{Performance of the RMSD task on \emph{SDSC Comet} when the trajectories are split into one trajectory segment per process (\emph{subfiling}).
    Five repeats were performed to collect statistics.
    In serial, there is no communication and no data points are shown for $N=1$ in (c).
    (a-c) The error bars show standard deviation with respect to the mean.
    (d) Compute \tcomp, read I/O \tIO, communication \tcomm, opening the trajectory $t_{\text{opening\_trajectory}}$, ending the for loop  $t_{\text{end\_loop}}$ (includes closing the trajectory), and overheads $t_{\text{overhead1}}$, $t_{\text{overhead2}}$ per MPI rank; see Table \protect\ref{tab:notation} for the definitions.
}
\label{fig:MPIwithIO-split}
\end{figure}


The subfiling approach appeared promising but it required preprocessing of trajectory files and additional storage space for the segments.
We benchmarked the necessary time for splitting the trajectory in parallel using different number of MPI processes (Table~\ref{tab:timing-splitting}); in general the operation scaled well, with efficiencies $> 0.8$ although performance fluctuated, as seen for the case on six nodes where the efficiency dropped to $0.34$ for the run.
These preprocessing times were not included in the estimates because we focused on better understanding the principal causes of stragglers and we wanted to make the results directly comparable to the results of the previous sections.
Nevertheless, from an end user perspective, preprocessing of trajectories can be integrated in workflows (especially as the data in Table~\ref{tab:timing-splitting} indicated good scaling) and the preprocessing time can be quickly amortized if the trajectories are analyzed repeatedly.
However, the requirement of needing as many segments as processes makes the approach somewhat inflexible as a new set of trajectory segments must be produced when a different level of parallelization is needed.
Finally, the performance of parallel file systems generally suffers when too many files are processed and so there exists a limit as to how far the subfiling approach can be pushed.

\begin{SCtable}[1.0][!htb]
  \centering
  \caption[Time necessary for writing the trajectory segments]
  {The wall-clock time spent for writing $N_{\text{seg}}$ trajectory segments using $N_{\text{p}}$ processes using MPI on \emph{SDSC Comet}.
    One set of runs was performed for the timings.
    Scaling $S$ and efficiency $E$ are relative to the 1 node case (24 MPI processes).}
  \label{tab:timing-splitting}  
  \begin{tabular}{rrrrrr}
    \toprule
    \thead{$N_{\text{seg}}$} & \thead{$N_{\text{p}}$} & \thead{nodes}
    & \thead{time (s)} & \thead{$S$} & \thead{$E$}\\
    \midrule
    24 &  24 & 1 & 89.9 & 1.0 & 1.0\\
    48 &  48 & 2 & 46.8 & 1.9 & 0.96\\
    72 &  72 & 3 & 33.7 & 2.7 & 0.89\\
    96 &  96 & 4 & 25.1 & 3.6 & 0.89\\
    144 & 144 & 6 & 43.7 & 2.1 & 0.34\\
    192 & 192 & 8 & 13.5 & 6.7 & 0.83\\  
    \bottomrule
  \end{tabular}
\end{SCtable}


Often trajectories from MD simulations on HPC machines are produced and kept in smaller files or segments that can be concatenated to form a full continuous trajectory file.
These trajectory segments could be used for the subfiling approach.
However, it might not be feasible to have exactly one segment per MPI rank, with all segments of equal size, which constitutes the ideal case for subfiling.
MDAnalysis can create virtual trajectories from separate trajectory segment files that appear to the user as a single trajectory.
In Appendix~\ref{sec:chainreader} we investigated if this so-called \emph{ChainReader} functionality could benefit from the subfiling approach.
We found some improvements in performance but discovered limitations in the design of the ChainReader (namely that all segment files are initially opened) that will have to be addressed before equivalent performance can be reached.
 
\subsubsection{MPI-based Parallel HDF5}
\label{sec:HDF5}

In the HPC community, parallel I/O with MPI-IO is widely used in order to address I/O limitations.
We investigated MPI-based Parallel HDF5 to improve I/O scaling. 
We converted our XTC trajectory file into a simple custom HDF5 format so that we could test the performance of parallel I/O with the HDF5 file format.
The time it took to convert our XTC file with $2,512,200$ frames into HDF5 format was about 5,400~s on a local workstation with network file system (NFS).

\begin{figure}[!htb]
  \centering
  \begin{subfigure}{.4\textwidth}
    \includegraphics[width=\linewidth]{figures/hdf5-t_total-combined.pdf}
    \caption{Scaling total}
    \label{fig:MPIscaling-hdf5}
  \end{subfigure}
  \hfill
  \begin{subfigure}{.4\textwidth}
    \includegraphics[width=\linewidth]{figures/hdf5-speed_up-combined.pdf}
    \caption{Speed-up}
    \label{fig:MPIspeedup-hdf5}
  \end{subfigure}
  \bigskip

  \begin{subfigure}{.45\textwidth}
    \centering
    \includegraphics[width=\linewidth]{figures/hdf5-time_comp_IO_comparison-combined.pdf}
    \captionsetup{format=hang}
    \caption{Scaling for different components}
    \label{fig:ScalingComputeIO-hdf5}
  \end{subfigure}
  \hfill
  \begin{subfigure} {.5\textwidth}
    \includegraphics[width=\linewidth]{figures/hdf5-BarPlot-rank-comparison_192_4.pdf}
    \captionsetup{format=hang}
    \caption{Time comparison on different parts of the calculations per MPI rank}
    \label{fig:MPIranks-hdf5}
  \end{subfigure}
  \caption{Performance of the RMSD task with MPI-based parallel HDF5 (MPI-IO) on \emph{SDSC Comet}.
    Data are read from the file system from a shared HDF5 file format instead of XTC format (independent I/O) and results are communicated back to rank 0. 
    Five repeats were performed to collect statistics; one repeat for 288 processes had abnormally high \tIO and was treated as an outlier and excluded from the averages but is shown as ``x'' in the graphs.
    (a-c) The error bars show standard deviation with respect to the mean.
    In serial, there is no communication and no data points are shown for $N=1$ in (c).
    (d) Compute \tcomp, read I/O \tIO, communication \tcomm, ending the for loop $t_{\text{end\_loop}}$,
    opening the trajectory $t_{\text{opening\_trajectory}}$, and overheads $t_{\text{overhead1}}$, $t_{\text{overhead2}}$ per MPI rank; see Table \ref{tab:notation} for definitions.
    These are typical data from one run of the five repeats.
  }
  \label{fig:MPIwithIO-hdf5}
  %  I am reporting the slowest rank per timing, and that is averaged over all repeats.
\end{figure}

We ran our benchmark on up to 16 nodes (384 cores) on \emph{SDSC Comet} and we observed near ideal scaling behavior (Figures~\ref{fig:MPIscaling-hdf5} and~\ref{fig:MPIspeedup-hdf5}) with parallel efficiencies above 0.8 on up to 8 nodes (Figure~\ref{fig:comparison_efficiency}) with no straggler tasks (Figure~\ref{fig:MPIranks-hdf5}).
The trajectory reading I/O (\tIO) was the dominant contribution, followed by compute (\tcomp), but because both contributions scaled well, overall scaling performance remained good, even for 384 cores.
Amongst the five repeats for 12 nodes (288 cores) we observed one run with much slower I/O than typical (Figure~\ref{fig:ScalingComputeIO-hdf5}) but we did not further investigate this spurious case and classified it as an outlier that was excluded from the statistics.
Importantly, the trajectory opening cost remained negligible (in the millisecond range) and the cost for MPI communication also remained small (below 0.1 s, even for 16 nodes).
Overall, parallel MPI with HDF5 appeared to be a robust approach to obtain good scaling, even for I/O-bound tasks.


\subsection{Potential Causes of Stragglers}
\label{sec:likelycauses}

The data indicated that an increase in the duration of both MPI communication and trajectory file access lead to large variability in the run time of individual MPI processes and ultimately poor scaling performance beyond a single node.
A discussion of likely causes for stragglers begins with the observation that opening and reading a single trajectory file from multiple MPI processes appeared to be at the center of the problem. 

In MDAnalysis, individual trajectory frames are loaded into memory, which ensures that even systems with tens of millions of atoms can be efficiently analyzed on resources with moderate RAM sizes.
The test trajectory (file size 30 GB) had $2,512,200$ frames in total so each frame was about 0.011 MB in size.
With $\tIO \approx 0.3~\text{ms}$ per frame, the data were ingested at a rate of about $40$~MB/s for a single process.
For 24 MPI ranks (one \emph{SDSC Comet} node), the aggregated reading rate would have been about 1 GB/s, well within the available bandwidth of 56 Gb/s of the InfiniBand network interface that served the Lustre file system, but nevertheless sufficient to produce substantial constant network traffic.

Furthermore, in our study the default Lustre stripe size value was 1~MB, i.e., the amount of contiguous data stored on a single Lustre OST.
Each I/O request read a single Lustre stripe but because the I/O size (0.011~MB) was smaller than the stripe size, many of these I/O requests were likely just accessing the same stripe on the same OST but nevertheless had to acquire a new reading lock for each request.
The reason for this behavior is related to ensuring POSIX consistency that relates to POSIX I/O API and POSIX I/O semantics, which can have adverse effects on scalability and performance.
Parallel file systems like Lustre implement sophisticated distributed locking mechanisms to ensure consistency.
For example, locking mechanisms ensures that a node can not read from a file or part of a file while it might be being modified by another node. 
When the application I/O is not designed in a way to avoid scenarios where multiple nodes are fighting over locks for overlapping extents, Lustre can suffer from scalability limitations~\cite{optimize_lustre}.
Continuously keeping metadata updated in order to have fully consistent reads and writes (POSIX metadata management), requires writing a new value for the file's last-accessed time (POSIX atime) every time a file is read, imposing a significant burden on parallel file system~\cite{POSIX2017}. 
Mache \textit{et al.} observed that contention for the interconnect between OSTs and compute nodes due to MPI communication may lead to variable performance in I/O measurements~\cite{Mache:2005aa}.
Conversely, our data suggest that single-shared-file I/O on Lustre can negatively affect MPI communication as well, even at moderate numbers (tens to hundreds) of concurrent requests, similar to the results from recent network simulations that predicted interference between MPI and I/O traffic~\cite{Brown:2018ab}.
Brown \textit{et al.}'s  work \cite{Brown:2018ab} indicated that MPI traffic (inter-process communication) could be affected by increasing I/O.
In particular, a few MPI processes were always delayed by one to two orders of magnitude more than the median time, which we would classify as stragglers.
In summary, our observations, seen in the context of the work by \citet{Brown:2018ab}, suggest that our observed stragglers with large variance in the communication step could be due to interference of MPI communications with the I/O requests on the same network.
Further detailed work will be needed to test this hypothesis.

Our results clearly showed that reading a single shared file is an inefficient way to use the Lustre parallel file system; instead, parallel I/O via MPI-IO and HDF5 emerged as the most promising approach to avoid stragglers and obtain good strong scaling behavior on hundreds of cores, even for I/O bound analysis tasks.