mabraham/gist:215c955ee83db8db8f23b2c1c185a476

## gistfile1.txt
\documentclass[5p,times,sort&compress]{elsarticle}
%\usepackage[nomarkers,figuresonly]{endfloat}
\usepackage{subcaption}

\DeclareGraphicsExtensions{.pdf,.png,.jpg}

\begin{document}

\title{GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers}

\author[scilifelab]{Mark James Abraham\corref{cor}}
\author{Teemu Murtola}
\author[ornl,utk]{Roland Schulz}
\author[scilifelab]{Szil\'ard P\'all}
\author[ornl,utk]{Jeremy C Smith}
\author[scilifelab]{Berk Hess}
\author[scilifelab,cbr]{Erik Lindahl}

\cortext[cor]{Corresponding author}
\address[scilifelab]{Theoretical biophysics, Science for Life Laboratory, KTH Royal Institute of Technology, 17121 Solna, Sweden}
\address[ornl]{Oak Ridge National Laboratory, 1 Bethel Valley Road, Oak Ridge, Tennessee 37831}
\address[utk]{Dept.\ Biochemistry and Cellular and Molecular Biology, University of Tennessee, M407 Walters Life Sciences, 1414  Cumberland Avenue, Knoxville TN 37996}
\address[cbr]{Center for Biomembrane Research, Dept. Biochemistry \& Biophysics, Stockholm University, SE-10691 Stockholm, Sweden}

\begin{abstract}
% Max ~100 words
GROMACS is one of the most widely used open source and free software codes in chemistry. As of version 5, it achieves better performance than ever before through parallelization on every level from SIMD registers inside cores, multithreading, heterogeneous CPU-GPU acceleration, and state-of-the-art parallelization with 3D domain decomposition. Several new algorithms provide highly efficient simulations on any hardware available, highly compressed data storage and a rich set of analysis tools. The code provides advanced techniques for free energy calculations, and supports ensemble-level parallelization both through built-in replica exchange and the separate Copernicus framework.
\end{abstract}

\begin{keyword}
molecular dynamics \sep MD \sep GPU \sep domain decomposition \sep SIMD \sep PME \sep free energy
\end{keyword}

\maketitle

\section{Motivation and Significance}
Molecular dynamics (MD) has conquered chemistry and
several other fields by providing spatial and temporal resolution
not available in experiments. Simulations have become
more accurate with better force fields, they easily sample molecular motions
on the $\mu$s scale, and ensemble techniques make it possible to study
millisecond scale processes such as protein folding. This is achieved by
evaluating the interactions of millions of particles for billions of
time steps, which can require extraordinary amounts
of computational hardware and time - the scientific quality of the result is
often proportional to the amount of sampling.
The huge application potential has led to implementations of MD in
many software packages, including
GROMACS,\cite{ProPalSch2013} AMBER,\cite{CasCheDar2005}
NAMD,\cite{Phillips2005}
CHARMM,\cite{BroBruOla1983} LAMMPS,\cite{LAMMPS} and Desmond.\cite{Desmond}
The commoditization of advanced computational techniques by these
packages is an important reason for the wide adoption of MD today.

Many implementations (including ours) provide high performance
when using large numbers of processors on supercomputers, but a key focus for
the development of GROMACS is the fundamental assumption from economic science
that {\em resources are scarce}: No matter how many cores are available, minimizing
resource usage makes it possible to run more simulations, e.g. through ensemble methods.
GROMACS aims to address this by providing the highest possible absolute performance, but also
by combining it with the highest possible efficiency on any hardware to make
best use of scarce resources. The package runs fast on every single architecture present in the Top500
supercomputer list, as well as on embedded systems and everyday laptop computers.

In contrast to many other computational challenges, applications in MD typically have an
intrinsically fixed problem size.
When studying a protein system with 30,000 atoms it is not relevant that a virus
comprising 10 million atoms would scale better. Therefore, weak scaling
performance is typically not of primary concern. However, it is critically important
to reduce the amount of computer time per unit simulation through optimization, or by improving strong scaling.
This improves ``time to solution,'' and also the efficiency, measured as the science
performed per amount of hardware or power consumed.
While some applications need long individual trajectories, there are also
many scientific questions that can be answered by using several trajectories,
and for these the overall efficiency will be higher by executing independent simulations in parallel.

However, strong scaling of MD is a very difficult
software engineering challenge that requires synchronization
of computation, coordination, and communication phases
down to 100 $\mu$s for hundreds of thousands of
cores.

Development of GROMACS has long been characterized
by dedication to high absolute simulation throughput on commodity
hardware; we want to improve the strong scaling so that both the
{\em maximum achievable} and {\em real world} throughput increases. Hardware
advances have been breathtaking, but re-engineering the software to
use the new capabilities has been very challenging, and even forces
us to reconsider some of the most fundamental MD concepts including
the neighbor lists used to track spatial interactions.
This paper will briefly recount some historical properties of GROMACS,
and then report on recent improvements in GROMACS 4.6 and 5.

\section{Software Description}
GROMACS has grown into a very large software project
with almost two million lines of code. For a detailed description
of the historical development and many algorithms included in the engine,
we refer the interested reader to the previous papers published
\cite{PaAb2015,ProPalSch2013,Bekker93a,Bervanvan1995,LinHesvan2001,vanLinHes2005,HesKutvan2008}.
For developers, one of the most important changes is that GROMACS 5
is the first release that has moved to C++. While many parts of the code remain in C
and it will take a few years to complete the transition, this has led to improvements in code modularity,
handing of memory and errors, and enabled much better Doxygen developer documentation and unit testing.

\subsection{Software Architecture}
The efficient parallelization in GROMACS is based on an
``eighth shell'' spatial domain decomposition\cite{BowDroSha2005,BowDroSha2007} to partition the
simulation system over multiple hardware units in a way that
preserves locality of reference within each domain.
Using this data parallelization, a single program multiple
data (SPMD) implementation maps each domain to
an MPI rank, each of which can in practice have access to
various kinds of hardware. Internally, all systems are described
with triclinic unit cells, which makes complex geometries such
as rhombic dodecahedron, truncated octahedron or hexagonal boxes
supported in all parts of the code. This can improve performance up to 40\%
compared to the same water thickness around a solute in a rectangular box
(Fig.~\ref{lysozyme-triclinic-dd}). Dynamic load balancing between domains is
performed in all three dimensions in triclinic geometry; this is critical for high
performance. Fig.~\ref{lysozyme-triclinic-dd} shows how the larger computational
load due to torsions and angles in the protein compared to water leads to significant
differences in domain size in the upper left part.

\begin{figure}[t]
\centerline{\includegraphics[width=3.25in]{figures/lysozyme-triclinic-dd}}
\caption{{\em Left:} The protein lysozyme (24,119 atoms) in a compact unit cell representation
  corresponding to a rhombic dodecahedron with close-to-spherical shape.
  {\em Right:} Internally, this is represented as a triclinic cell and a load-balanced staggered 6x4x3
  domain decomposition grid. The PME lattice sum is calculated on a uniform grid of
  6x4 MPI ranks (not shown).}
\label{lysozyme-triclinic-dd}
\end{figure}

Long-range electrostatics is handled by the particle-mesh Ewald (PME) method
\cite{DarYorPed1993} by using dedicated MPI ranks for the lattice summation
and a two-dimensional pencil decomposition\cite{ProPalSch2013} for the required 3D-FFT.

Historically, GROMACS has made use of MPI for domain-level parallel
decomposition across nodes, and later CPU cores too, and supplied hand-tuned assembly
kernels to access SIMD (single instruction, multiple data) units where available. However, the run-time
overheads of the former and the development-time cost of the latter
were not sustainable, and there was also the need to incorporate
accelerators (such as GPUs) into the parallelization strategy.
GROMACS 4.6 introduced a native heterogeneous parallelization setup
using both CPUs and GPUs. There are two important
reasons for still including the CPU: First, the advanced domain decomposition and load
balancing would be very difficult to implement efficiently on GPUs (which would hurt scaling).
Second, we see it as a huge advantage that {\em all} algorithms are available in all simulations,
even the esoteric or new ones not yet ported to GPUs, and the heterogeneous acceleration
makes it possible to completely hide the hardware from the user. There is only a single
GROMACS binary, and it will use all available hardware as efficiently as possible.

To make this possible, a new algorithm for evaluating short-ranged non-bonded
interaction was implemented, based on Verlet lists with automatic
buffering.\cite{PalHes2013} This recasts the traditional Verlet
algorithm to suit modern computer hardware, which permits highly
efficient offload of short-ranged work on SIMT-based (simultaneous multithreading) GPUs, as well as
efficient SIMD-based CPU implementations. This works well, since
the essential qualities of data locality and reuse are similar on both
kinds of hardware. This is an important architectural advance, since the same code base
and algorithms can be used for all hardware.
The key innovation was to transform the standard formulation of the Verlet algorithm
that uses lists of particle-particle pairs into lists of
interacting small clusters of nearby particles, and to
choose the sizes of those clusters at compile time to match the
characteristics of the SIMT or SIMD target hardware. This means
there is no requirement for the compiler to recognize the opportunity
for vectorization; it is intrinsic to the algorithm and its
implementation. Additionally, the cluster sizes are easily
adjustable parameters allowing to target new hardware with relatively
low effort. Fig.~\ref{neighbor-lists} shows how the Verlet scheme
re-casts the idea of a particle-based pair list into a list of
interacting clusters. Fig.~\ref{kernel-setup} illustrates the flow of
data in kernels executing on processors of different SIMD widths or GPUs.
\begin{figure}[t]
\centerline{\includegraphics[width=3.25in]{figures/1x1-beside-4x4}}
\caption{{\em Left:} A classical Verlet implementation treats all
  j-particles within an interaction radius (red) of the central red
  i-particle, and adds a buffer, also called ``skin'' (dashed red). Particles outside the
  buffer (unfilled) are omitted. {\em Right:} The $M \times N$ scheme
  builds lists of clusters of $N$ particles (blue), where at least one
  particle in each cluster is within the buffered interaction radius
  of any particle in the central red cluster. This envelope has an
  irregular shape (dashed red), and has an implicit additional buffer
  (unfilled blue circles) from those particles in clusters where only
  some particles are within the nominal buffer range. Actual
  interactions are based on particle distances (red
  circle, only one shown).}
\label{neighbor-lists}
\end{figure}

Unlike the old ``group'' scheme, there is no need for special kernels optimized
for common molecules such as water.  The searching can schedule kernels that will evaluate van
der Waals interactions only on the first half of atoms in a given
cluster; this runs faster on domains where only some atoms have such
interactions, which includes most water models in current
use. Naturally, if whole clusters do not have van der Waals or Coulomb interactions, their interactions
are evaluated by kernels that skip the corresponding computation
entirely. Branches are unavoidable in short-ranged MD kernels,
as the model physics permits only
interactions within a certain distance to contribute. Implementing such
code is most efficient when using selection or branch instructions
that produce null results when the interaction should not
contribute. This is also useful for other kinds of interaction
exclusions used in MD.
\begin{figure}
\centerline{\includegraphics[width=3.25in]{figures/kernel-setup}}
\caption{Illustration of the classical $1 \times 1$ neighborlist and the $4 \times 4$
  non-bonded cluster algorithm setups
  on processors with different SIMD widths. All numbers are particle
  indices in a pair list, black dots are interaction calculations, and gray
  arrows indicate loads. The content of SIMD registers for i- and j-particles (cf.\ Fig.\ \ref{neighbor-lists})
  are shown in red and blue, respectively. Dashed black lines show the computation units, with
  dotted black arrows indicating their order of execution.
  The $4 \times 4$ setup calculates 4 times as
  many interactions per memory operation. Unlike the $1 \times 1$ setup, the $4
  \times 4$ setup does not require data shuffling in registers.}
\label{kernel-setup}
\end{figure}

The maturation of SIMD intrinsics in compilers made this possible to
implement in a new higher-level fashion that retains the performance of the
previous raw hand-tuned assembly. To achieve
this, we have implemented a SIMD abstraction module that permits us to
develop CPU non-bonded kernels in a way that is nearly agnostic about
the SIMD width and feature set of the hardware upon which they will
run. In particular, we have designed an extensive new internal SIMD math library
in both single and double precision that completely avoids both table lookups and integer
instructions (which are not available for all SIMD instruction sets).
This means that porting to new CPU architectures is a
straightforward job of implementing the interface of the SIMD module
using the intrinsics suitable for the new CPU, and the old non-bonded
kernels can use them correctly. Further, several other modules in GROMACS
now use the same SIMD layer and derive the same benefits for performance
portability. Crucially, this has reduced the total size of the nonbonded kernels
to only a few hundred lines of C, while simultaneously
supporting many more SIMD instruction sets.

In particular for SIMD and GPU acceleration, GROMACS makes extensive use of
strength-reduction algorithms to enable single precision, including a single-sum
implementation of the virial calculation.\cite{Bekker93b} Some molecular simulation
packages always compute in double precision; this is available in
GROMACS for the few kinds of simulations that require it, but, by
default, a {\em mixed precision} mode is used, in which a few, critical,
reductions are performed in double precision. Other high-performance
implementations\cite{CasCheDar2005} use mixed precision to a larger extent.

The offload model for acceleration creates the need for large groups
of atoms in the same spatial region to be treated in the same neighbor
search. This conflicts with the former GROMACS model of mapping each
CPU core to an MPI rank, and thus a separate domain of atoms close in
space. Typically, a CPU has many more cores than it has accelerators,
the inefficiency of scheduling separate work for each domain on
the accelerator was high, and the existing limitations on the minimum
sizes of domains was also problematic. To alleviate this,
OpenMP-based multi-threading support was added to GROMACS to permit multiple CPU cores to
work on a single domain of atoms. This allows for domains to be larger
and thus the overall efficiency to be greatly improved. The resulting
OpenMP parallelism is also useful for running on CPU-only hardware,
thus extending the strong-scaling limit with hybrid
MPI/OpenMP.

The PME algorithm commonly used for molecular simulations is able to
shift workload between the short- and long-ranged components with
moderate efficiency, while preserving the quality of the model
physics. This permits GROMACS 5 to automatically balance the
workload for optimal performance. This is particularly useful for the
offload model implemented in GROMACS 5
because best throughput is
typically obtained when few resources lie idle. Further, when using
multiple nodes for a single simulation, the long-ranged component of
the PME calculation needs to do global communication for the 3D
FFT. This partly drives our choice of which work to offload; doing PME
work on a current generation accelerator in a simulation across
multiple nodes would accrue latency from data transfers both
across the network and from host to device and this would eliminate
any performance gain.

One major weakness of the current accelerator-offload implementation
in GROMACS is that accelerators are idle once the forces are computed
and transferred back to the CPU. Typical schemes for integrating the
forces to update the positions often need to enforce holonomic
constraints on degrees of freedom such as bond lengths, and such implementations
normally feature either iteration or inter-rank communication, which
does not suit an offload model of accelerator usage. Overcoming such
limitations is a key target for future improvements.

The end result in GROMACS 5 is an elaborate multi-level parallelism
(Fig.~\ref{multi-level-parallelism}) that maps individual
particle-particle interactions to SIMD or SIMT lanes and offloads
short-ranged work during force calculations to GPU accelerators.
Long-range and bonded interactions are evaluated on the CPU, which
uses OpenMP to split work over multiple cores for a single spatial
domain, and MPI to communicate between these domains.
This design is able
to make effective use of all of the available resources for typical
PME simulations on typical hardware. However, the design works less well if the hardware is too
unbalanced; GROMACS 5 performance will typically be optimal with
comparable expenditure on CPU and accelerator, and as many CPU sockets
as accelerators.

\section{Software Functionalities}
GROMACS is free software distributed under LGPLv2.1, which even
allows linking into commercial applications. It requires only standards-conforming
C99 and C++98 compilers, and CMake version 2.8.8. Various external libraries are either
bundled for convenience, or can be detected (e.g.\ MKL) or
even downloaded automatically (e.g. FFTW) by CMake. Portability is assured via extensive automated
continuous integration testing using Jenkins, deployed on Windows,
MacOS and Linux, using multiple versions of
compilers including those from Intel, GNU, Microsoft, LLVM and
IBM.  GROMACS supports NVIDIA GPU
architectures with compute capabilities 2.0 and later, and the new SIMD
module provides native support for a total of 13 different architectures including all x86
flavors (from SSE2 through Xeon Phi, AVX2 and the still unreleased AVX-512F/ER),
PowerPC A2 (BlueGene/Q), Sparc VIIIfx (K computer), ARM Neon, IBM VMX (Power7) and VSX (Power8).
The latest version can even run inside a browser supporting Google Native Client.

Every single commit during GROMACS development is subject to mandatory
code review and automatic regression tests, unit tests and static code
analysis before it is added to the public git repository. While the released code is
tested on an even larger set of architectures, this makes even the
rapidly moving development branch uniquely stable.

 \subsection{Simulation capabilities}
Simulations with leap-frog Verlet, velocity Verlet,
Brownian and stochastic dynamics are supported, as well as
calculations that do energy minimization, normal-mode analysis and
simulated annealing. Several techniques are available for regulating
temperature and/or pressure. Both SHAKE\cite{RycCicBer1977} and
P-LINCS\cite{HesBekBer1997,Hes2008} are available for enforcing
holonomic constraints, and the latter can be combined with virtual interaction
sites\cite{Berendsen84b} to eliminate enough fast degrees of
freedom to allow 5 fs time steps. All widely used molecular mechanics force
fields are supported, including a total of 15 flavours of AMBER,
CHARMM, GROMOS and OPLS. Several community-supported force
fields are also available.
Non-standard functional forms are supported through user tables.

Simulations may employ several kinds of
geometric restraints, use explicit or implicit solvent, and can be
atomistic or coarse-grained. {\tt mdrun} can run multiple simulations as
part of the same executable, which permits generalized ensemble
methods\cite{MitSugOka2001} such as
replica-exchange.\cite{HanOka1997,SugOka1999} Non-equilibrium methods,
such as pulling and umbrella sampling, are available, as well as
highly powerful alchemical free-energy transformations, and
essential dynamics.\cite{AmaLinBer1993} Many popular simulation file formats can be read natively,
or via a VMD plug-in.\cite{HumDalSch1996}

\begin{figure}
\centerline{\includegraphics[width=3.25in]{figures/multi-level-parallelism}}
\caption{Multi-level parallelism in GROMACS. SIMD registers are used
to parallelize cluster interaction kernels or bonded interactions in each core,
and OpenMP multithreading is used for parallelism inside spatial domains while
nonbonded interactions are handled by GPUs or other accelerators. MPI and load balancing
is used to decompose a single simulation into domains over many nodes in a cluster,
and ensemble approaches are used to parallelize with loosely coupled simulations.
High performance requires that software targets each level explicitly.}
\label{multi-level-parallelism}
\end{figure}

The code scales {\em down} to a few tens of atoms per core (when only using CPUs),
but there will always be practical limits on the degree of parallelism achievable. A typical
150,000-atom system has about thirty million particle-particle
interactions per MD step, which will not scale to a million-core system
because communication and book-keeping costs will dominate. Thus, as
core counts continue to grow, we expect ensemble-level parallelism to
play an increasingly important role in MD algorithm development. The
Copernicus framework has been developed alongside
GROMACS to serve this need and scale to tens of thousands
of simulations.\cite{ProPouRot2015} It currently supports free-energy
calculations, Markov state modeling, and the string
method using swarms\cite{PanRou2008} (\verb+http://copernicus.gromacs.org+).

\subsection{A parallel analysis framework}
A new C++ framework for developing GROMACS analysis tools has been
introduced, which makes it easy to write new tools
that require only a simple loop over all trajectory frames. The framework
also provides reusable components for grid-based neighbor searching
and common data processing tasks like histograms. Some
tools for computing basic geometric properties (distances and angles),
as well as surface area calculation, have been converted to the new
framework, though much work remains to achieve the full benefits of
the new scheme. Future development also aims to support analysing
single trajectory frames in parallel, and Python bindings.

\subsection{New simulation features}
Just as PME has eliminated cutoff artefacts for electrostatics,
there has been increasing attention to cutoff problems and
van der Waals interactions. While dispersion corrections alleviate
some issues, the fundamental problem is that complex systems
such as membranes are neither homogeneous nor isotropic.
GROMACS 5 includes a new, very accurate, Lennard-Jones PME
implementation\cite{Wennberg13} whose implementation is only
10-20\% more expensive than short cutoffs in GROMACS, and to the
best of our knowledge about an order of magnitude faster than any other
alternative. It works for both geometric and Lorentz-Berthelot combination
rules, and should enable much more accurate membrane simulations,
free energies, and improved force-field parameterization.

Other new features include Andersen-style thermostats, the Adaptive Resolution Sampling
scheme\cite{Fritsch2012} for multi-scale models,
Hamiltonian replica exchange, simulated tempering and
expanded-ensemble methods,\cite{Lyubartsev1992}
rotation of groups with the non-equilibrium pulling module,\cite{Kutzner2011} a new
computational electrophysiology module\cite{Kutzner2011b} that can swap molecules from one side of a
membrane to the other, and support for the Interactive
Molecular Dynamics\cite{IMD} protocol to view and manipulate ongoing
simulations. The high-quality ``counter-based'' parallel random number generator
Random123\cite{Random123} is now used. New bonded interactions were
introduced for coarse-grained
simulations.\cite{MonicaGoga2013} Flat-bottomed position restraints
were added to avoid perturbing models unecessarily.

GROMACS 5 also comes with a new highly flexible
and efficient compressed file format - TNG.\cite{TNG} This
improves on the previous best-in-class XTC trajectory compression by further
exploiting domain knowledge and multi-frame compression, it adds features such as
containers for general simulation data, digital signatures,
and provides a library to which tool developers may link.

\section{Impact}
In addition to the thousands of publications using GROMACS every year,
one of the most exciting parts of free software is how other people put it to use
in ways not anticipated. GROMACS has long been deployed in the
Folding@Home distributed computing project,\cite{ShiPan2000} and
it is frequently used for metadynamics together with PLUMED.\cite{PLUMED2}
Coarse-grained force fields such as MARTINI\cite{MARTINI} use the GROMACS
infrastructure to implement mesoscale physics models that access otherwise impossible
scales of time and distance. Databases of topology file inputs and associated
thermochemical results have begun to appear\cite{CalvanHon2012,vanvanCal2012},
and several online services can produce coordinates,
parameters and topologies for GROMACS simulations.\cite{SwissParam,ATB}.
Extending and reusing parts or all of GROMACS is explicitly encouraged.

\section{Performance \& Scaling}
GROMACS can scale to the largest machines in the world when using gigantic
systems, and detailed benchmarks are available on the website. For this work,
we want to illustrate the efficiency with more challenging heterogeneous benchmarks
used in recent studies: First a very small voltage sensor (VSD) embedded in a united-atom lipid bilayer in
a hexagonal box (45,700 atoms)\cite{Henrion2012}, and second a complete
ion channel (GluCl) embedded in a larger united-atom bilayer (142,000 atoms).\cite{Yoluk2013}
All simulations use PME and initial cut-offs of 1.0nm. All bonds were constrained with the
LINCS\cite{Hes2008} algorithm, and the VSD uses virtual interaction sites constructed every
step to extend time time step to 5 fs. A stochastic
velocity-rescaling thermostat was used.\cite{BusDonPar2007} Fig.~\ref{fig-benchmark} shows how
the fraction of CPU cycles spent on force evaluation in the VSD system drops dramatically when adding
SIMD and GPUs. Case "E" reflects the problem with multiple MPI ranks on the CPU and effectively
time-sharing the GPU, which introduces decomposition overhead.
With GPUs, there is only a small component left for bonded forces; the CPU
primarily evaluates the PME mesh part.
The absolute performance is much higher with acceleration (explaining the larger fractions for constraints and PME),
as illustrated in the second panel. These results were obtained on a single-socket desktop
with an 8-core Core-i7 5960X and a single NVIDIA GTX980 GPU. With both SIMD, GPU and OpenMP
acceleration, the desktop achieves close to 200ns/day for the VSD.
Fig.~\ref{fig-performance} shows absolute performance for the larger GluCl system
on CPU-only and GPU-equipped Cray clusters. Despite the much faster CPUs with AVX2 support
on the XC40, the older XC30 nodes paired with K20x GPUs beat it handily. With similar CPUs
and K40 GPUs, we expect accelerated clusters to deliver about 3X the performance of CPU-only ones.

\begin{figure}
\centering
\includegraphics[width=0.5\textwidth]{figures/gmx5-paper-data-singlenode_combined_5969X-980.pdf}
\caption{The anatomy of GROMACS performance. {\em Left:} The reference
setup (A) spends the majority of wall-time in short-range force evaluations, but as SIMD (B), OpenMP (C, D), and
GPU (E, F) acceleration are enabled this drops tremendously even on a workstation.
With GPUs (E, F), the CPU computes a relatively small amount of bonded interactions.
Relying on multi-threading (F) when using GPUs is more efficient than SPMD parallelization
with domain-decomposition (F) even though multi-threading in cache intensive code like PME is challenging.
Using GPUs
{\em Right:} The absolute performance
is orders-of-magnitude higher with accelerations, which explains the larger fraction
of CPU time spent on constraints, update, and reciprocal space PME.}
\label{fig-benchmark}
\end{figure}

\begin{figure}
\includegraphics[width=0.5\textwidth]{figures/all-machines-ion_channel}
\caption{Absolute scaling performance on a GluCl ion channel of
  142,000 atoms, 2 fs time steps without virtual sites, with PME and
  initial cutoffs 1.0 nm, running on Beskow (32 Haswell CPU cores per
  node, Cray XC40) and Piz Daint (8 Sandy Bridge CPU cores + Tesla
  K20X per node, Cray XC30).}
\label{fig-performance}
\end{figure}

\section{Conclusions}
The recent efforts to maximize single-core and
single-node performance show benefits at all levels, which makes it even
more impressive that the relative scaling has also improved substantially.
The enhancements described above boost user throughput on all kinds and
quantities of hardware, but they were focused on mature technologies.
There are many other approaches open for future performance improvements
to GROMACS.
In particular, an implementation that expresses the algorithm in fine-grained tasks
that can be preempted when high-priority work is available, and
automatically balanced between otherwise idle executors seems very attractive.
Task parallelism has been quite successful e.g.\ in NAMD with the Charm++
programming model,\cite{Phillips2005} but with the latency constraints
we are targeting we need a lower-level implementation. Currently
the most promising one is Intel's Threading
Building Blocks,\cite{TBB} because it is also C++98 and open
source. With TBB, we expect to be able to retire our home-grown
thread-MPI implementation and avoid requiring OpenMP support,
as well as opening the door to even higher absolute performance
and scaling. To get started, download the code from \verb+http://www.gromacs.org+.

\section*{Acknowledgements}
This work was supported by the European research Council (258980, BH),
the Swedish Research Council (2013-5901, EL) the Swedish e-Science
Research Center, the ScalaLife EU infrastructure (261523), the EU
FP7 CRESTA project (287703), Intel Corporation and the ORNL Adaptive Biosystems
Imaging project funded by the Biological and Environmental Research of the Office of
Science of the U.S. Department of Energy (RS, JCS).
Computational resources were provided by
the Swedish National Infrastructure for computing (2014/1-30 \& 2014/11-33),
and the Swiss National Supercomputing Centre CSCS (g43). GROMACS would not be possible
without a large and loyal team working on code review, triage, and
contributing code and fixes.

\bibliographystyle{softwarex}
\bibliography{bibliography}

\end{document}