Imagine we have n
particles in our "universe". These particles have a random initial x, y, and z coordinates to begin with. Defined by Newton's law of universal gravitation, each particle attracts every other particles in this universe using a force that is directly proportional to the product of their masses and inversely proportional to the square of the distance between their centers. As as result, these particles gain (and lose) velocities and change positions over time. The modelling of this physical mechanics is called a N-body simulation.
There currently exists many N-body simulation algorithms. Some are less advanced and highly computational costly (execution time in the order of O(N^2)
) - but simple and easy to understand. Some others are more advanced and significantly more efficient (execution in the order of O(n*log(n))
- but not as simple and easy to understand. This articles focuses on the implementation aspect of the less advanced toy algorithm - for the benefit of ease of understanding.
This article consists of two parts:
- Part 1: Optimize Single-node implementation given a fixed problem size of 16,384-particles (i.e. 268,419,072 particle-to-particle interactions) to boost the baseline code performance from 0.07% to 62%.
- Part 2: Scale Implementation and handle increased problem size of 1,048,576 particles (i.e. 1,099,510,579,200 particle-to-particle interactions) with efficient distributed computing. We do this while keeping in mind the needs to complete each time-step within the millisecond to sub-second timeframe.
We run all our N-body simulations exclusively on the Colfax Cluster nodes running Intel Xeon Phi (Knights Landing / KNL) 7210 host processors - while bearing in mind modern code practice for ease of adaptation to other Intel Architecture (such as Knights Corner / KNL, or Xeon / IVB). We use an established industry rule-of-thumb that each particle-to-particle interaction simulation would require 20 FLOP (Floating Point Operations) to estimate our code performance before versus after tuning. Our goal is to try and get as close as possible to the theoretical peak performance potential, which for a Xeon Phi 7210 host processor is 4505 GFLOPS (Giga Floating-point Operations Per Second). For coding convenience, we assume all particles to have the same mass of 1 unit.
All implementation source codes and performance logs are available in this GitHub Repository.
Our initial (un-tuned) baseline implementation, although very human-friendly and easy-to-understand, achieves only 3.2 GFLOPS (0.07% efficiency). In other words, the execution time is likely to be much higher than the expected O(N^2)
time frame. We show that with code optimization, it is possible to increase code performance to 2,874 GFLOPS (62.85% efficiency) on a single node. That's ~884x speed-up.
The following table summarises our code optimization journey using a combination of techniques including Multi-threading Parallelism, Scalar Tuning, Vectorization Tuning, and Memory Tuning.
Implementation | Average Performance (GFLOPS) | Speed-up | Efficiency |
---|---|---|---|
1 - Baseline - No Optimization | 3.2 +- 0.0 | 1.00x | 0.07% |
2 - Thread Tuning (Parallelize Level-2 (Inner) For Loop) | 12.5 +- 0.0 | 3.91x | 0.31% |
3 - Thread Tuning (Parallelize Level-1 (Outer) For Loop) | 357.6 +- 9.8 | 111.75x | 7.94% |
4 - Scalar Tuning (Avoid Precision Conversion Overhead) | 444.5 +- 0.7 | 138.91x | 9.87% |
5 - Scalar Tuning (Replace Divide with Multiply) | 580.3 +- 1.2 | 181.34x | 12.88% |
6 - Scalar Tuning (Replace Power of Half with Square Root) | 623.3 +- 1.5 | 194.78x | 13.84 % |
7 - Scalar Tuning (Enables Aggressive Optimizations on Floating-point Data with Compiler Option) | 654.4 +- 3.1 | 204.5x | 14.53% |
8 - Vectorization Tuning (Restructure Data from AoS to SoA) | 1889.7 +- 11.6 | 590.53x | 41.95% |
9 - Vectorization Tuning (Remove Peel Loop Overhead by Aligning Array Access) | 2091.9 +- 13.7 | 653.72x | 46.44% |
Memory Access Bottleneck: See Implementation 9 Tests on varying NSIZE | degradation performance with NSIZE increases | (See Section) | (See Section) |
10 - Memory Tuning (Loop Tiling with Register Blocking). Performance stays constant regardless of NSIZE increases | 2831.4 +- 8.6 | 884.81x | 62.85% |
We see that each optimization component has a contribution towards the overall performance enhancement - some may be more significant than the others.
We evolve our application to implementation 11 that distributes workload across multiple cluster nodes using Intel MPI (Message Passage Interface). We apply cluster communication tuning to reduce inefficiencies resulted from ethernet communication among cluster nodes. At the end we perform a series of (strong and weak) scalability tests to conclude some trends.
Strong Scaling Tests:
- Given a small fixed problem size (16,384-particles. i.e. 268,419,072 interactions), the more nodes we distribute workload across, the longer it takes for the job to complete. Overall performance degradates with increased nodes, likely due to the overall distributed computing benefits are outweighted by the more significant per-node performance degradation.
- Given a large fixed problem size (1,048,576 particles. i.e. 1,099,510,579,200 interactions), the more nodes we distribute workload across, the shorter it takes for the job to complete. Overall performance increases with increased nodes, likely due to the overall distribued computing benefit far outweighs the less significant per-node performance degradation.
Weak Scaling Tests:
- Given a small fixed total problem size per node (16,384 particles per node. i.e. 268,419,072 interactions per node), the rate of per-node performance degradation is significant with respect to increasing nodes. Distributed Computing is less robust.
- Given a large fixed total problem size per node (262,144 particles per node. i.e. 68,719,214,592 interactions per node.), rate of per-node performance degradation is less significant with respect to increasing nodes. Distributed Computing is more robust (until certain point).
At the end of Part 2 we perform a bonus "push our limit" test. We show that it is possible to complete one time-step of a N-body simulation for over 1 million (1,048,576) particles, or over 1 trillion (1,099,510,579,200) particle-to-particle interactions in a sub-second (~662 ms) timeframe, by distributing the (semi-) optimized code to run across 16 nodes in parallel. That's equivalent to an overall performance of 33,208 GFLOPS.
Given our toy algorithm execution time is of order O(N^2)
, imagine replacing this with a more advanced O(nlog(n))
algorithm which may result in even greater performance boost and scalability - which may deserve a separate research article.
Given we have n
particles in our "universe", each particle would experience a net gravitational attractive force (with magnitude and direction) caused by all other surrounding particles. This net force is described by Newton's law of Universal Gravitation, as described by this formula:
There are many existing algorithms currently. For the purpose of this article, we will use the most basic (and easy to understand) toy algorithm that models this N-body simulation. The algorithm has an execution time of at best in the order of O(N^2)
(i.e. order of number of particles squared). It should be noted that there are much more efficient algorithms out there (with execution time in order of O(nlog(n))
). This article focuses on the efficient code implementation aspect - should the code is 100% efficient, we should expect our toy algorithm to have an execution time of around O(N^2)
. An inefficient implementation will have execution time much higher than this expected "at-best" value.
For simplicity, we shall assume all particles to have a mass of 1 unit, with initial positions randomized (via uniform distribution). We will also assume only 16,384 particles in our universe (i.e. 268,419,072 particle-to-particle interactions, or N*(N-1)
) to ensure our job completes within reasonable short time-frame (milliseconds to few seconds per time step) for practical reasons.
We will start with our most basic and easy to understand Implementation 1, which as we will see later to be only 0.07% efficient. We will then incrementally tune our code to make it more efficient following the 5 key optimization areas: Threading Parallelism, Scalar Tuning, Vectorization Tuning, Memory Tuning, and Cluster Communication Tuning. After the 10th implementations, we shall see our code performance to improve and reach around 64% efficiency.
On the 11th Implementation, we will apply distributed computing to enable us to perform N-body simulation with over 1 million (1,048,576) particles, or equivalently, over 1 trillion ( 1,099,510,579,200) particle-to-particle interactions, at a speed of 33,208.9 GFLOPS - completing one time step in sub-second (~662 ms) timeframe. We will appreciate the power of code tuning, parallel programming, and distributed computing, as well as their limitations (such as per-node performance loss due to ethernet communication used in Cluster and other factors).
The source files and performance logs for the implementations are available to view on GitHub Repository.
All experimentations described in this article have been run exclusively on the Colfax Cluster Nodes, each installed with Intel Xeon Phi (Knights Landing / KNL) 7210 host processor. At the time of writing this article I have access to up to 60 of these nodes. It should be noted that the codes may be adapted easily on other Intel Architecture (such as Xeon and Xeon Phi Knights Corner edition) - this is out-of-scope of this article however. I would recommend to refer to the Colfax How to Series for a more comprehensive introduction and demonstrations on this.
For a Xeon Phi 7210 Host Processor (that is used for this article), the theoretical peak performance is estimated to be around 4,505 GFLOPS (Giga Floating Point Operation Per Second), for single floating-point precision (SP). See our other article on how we derive this estimation.
Also make a note on these terms:
- GFLOP = Giga Floating-point Operations. We use this to describe throughput.
- GFLOPS (same as GFLOP/s), stands for Giga Floating-point Operations Per Second. We use this to describe performance.
The initial baseline code may be easy to read and understand - unfortunately at the costly trade-off of inefficient performance / no optimization. This slide from Colfax Research provides a neat overview of the 5 core optimization areas we may look into, to increase performance:
At Node-level:
- Scalar Tuning
- Vectorization Tuning
- Memory Tuning
- Threading Tuning
At Cluster-level:
- Communication Tuning
We will start from the baseline algorithm (non optimized) code, gradually apply these tunings (stack on top of each other), and eventually arrive much higher performance efficiency and scalability.
The sample codes may be found at this GitHub repository.
nbody.cc
is our source code.Makefile
enables us to compile and submit job to run on the Colfax Cluster. We define here important parameters including (1) number of particles in our n-body simulation, and (2) number of nodes to run on, etc.
- Navigate to the implementation directory
imp-xx
. - (optional) Review and possible update
nbody.cc
as needed. - (optional) Review and tweak
Makefile
as needed. e.g. set number of nodes to run on, set number of particles, etc. - Issue
make queue-knl
to compile and run application on the Colfax Cluster Node (defined byMakefile
). - Review output file
nbody-knl.oXXXXX
at./log
, whereXXXXX
is the job number. - Review optimization report
nbody.oKNL.optrpt
for potential optimization opportunities.
Makefile makes the code compilation process scalable and repeatable. This Youtube Video by Paul Programming is a good starting point for the basics. At a very high level view this is how Makefile usually works:
- Compile source codes
xxx.cc
into an intermediate object filexxx.o
(orxxx.oYYY
). - Link the intermediate object file
xxx.o
to the output binaryxxx
(the eventual executable)
The syntax of a Makefile may look something like this - read the comments in this sample Makefile
for understanding:
# Set environmental variable to be used within makefile like this: $(ENVIRONMENTAL_VARIABLE_1)
ENVIRONMENTAL_VARIABLE_1 = 123
ENVIRONMENTAL_VARIABLE_2 = Hello
# -- Compilation and Linking Syntax (space sensitive) --
# Target: Dependencies
# Compilation / Linking command
# ------------------------------------------------------
# Compile source to object file
# If ``hello.cc`` (source file) is newer than ``hello.o`` (intermediate object file), then compile ``hello.cc`` to ``hello.o``
# - this example uses icpc as the compiler.
# - The ``-c`` flag says we are compiling the source file `hello.cc` into intermediate object file `hello.o`,
# instead of a binary.
# - The ``-o hello.o`` says, output to ``hello.o``
# - The ``hello.cc`` is the source file
hello.o : hello.cc
icpc -c -o hello.o hello.cc
# Link object file to binary
# If ``hello.o`` (intermediate object file) is newer than ``hello`` (binary executable), then compile ``hello.o`` to ``hello``
# - this example uses ``icpc`` as the compiler.
# - The ``-c`` flag says we are compiling the source file ``hello.cc`` into intermediate object file ``hello.o``,
# instead of a binary.
# - The ``-o hello.o`` says, output to ``hello.o``
# - The ``hello.cc`` is the source file
hello: hello.o
icpc -o hello hello.o
Our Makefile will look something like this.
There will be some more advanced syntax in our Makefile, such as:
- the use of suffix rules. See Suffix Rules
- funny
$@
and$<
symbols. see Stackoverflow - wildcard
%
symbol. See Static Pattern Rule Usage
Have a go read through the Makefile
and see if you understand it. I've added some comments within the Makefile
to aid understanding. See also some notes below the Makefile.
Notes:
- How did we come up with the starting
NSIZE
of 16,384? See the next section Use High Bandwidth Mode for more info. - Regarding the
-DKNLTILE
compiler flag: Though this is not used and won't impact our implementation codes, this flag may be used to inject a preprocessor macro KNLTILE into thenbody.cc
code, which we may use within our code for "if this then that" type logic (e.g. if KNL, then tile size is this. If KNC, then tile size is that, etc.). Look for something like#ifdef KNLTILE
in the code (though I might have removed these as we didn't use them).
Consider this:
- A Xeon Phi MCDRAM (NUMA Node 1) has up to 16 GiB RAM. (
16 * 2^30
bytes of RAM). - A MCDRAM is High Bandwidth (up to 480 GB/s). This is ~5 times higher bandwidth than DDR4RAM (up to 90 GB/s).
- If our application can fit into 16 GiB of RAM, we should consider using it.
Now the next question, up to (roughly) how many particles in our n-body simulation the 16 GiB MCDRAM can fit into? Let's do some (rough) math:
- As we will see in our baseline implementation shortly, each particle will have 6 float variables (or object properties): 3 correspond to the 3D position (
x
,y
,z
), and the other 3 correspond velocity travelling at that direction (vx
,vy
,vz
). - a float variable takes up 4 bytes.
- each particle will therefore take up 24 bytes (6 float variables * 4 bytes per float variable = 24 bytes).
- We can fit roughly (
16 * 2^30 / 24
) = 715,827,882 particles in MCDRAM. The actual number may be less due other factors. Still, this is a big number. Considering our processor is around 1.1 GHz (1.1 billion cycles per second), and the N-body algorithm execution time isO(N^2)
, Should we wish our baseline (serial mode) job to complete within the "second" timeframe, it might be wise to only restrict our number of particles to (maximum of) aroundsqrt(1e9) = 31,623
particles to start with. But just to be safe, we can halve our initial starting particles - to around 15,811. Interestingly, the originalMakefile
by the Colfax HPC Session 6 exercise starts with 16,384 - this not too far from our rough math at all! Note: it turns out when we get to Implementation 10, we will see the needs to "dice" the particles into tiles of 16 particles. And it happens that when we divides 16384 particles by 16 (particles per tile), it is a nice round number 1024 (i.e.2^10
). Probably explains the choice (Note to self: requires validation. This so called "rough math" could have been, sadly, purely fluke!)
Part 1: Single Node Application Optimization - Threading, Scalar Tuning, Vectorization Tuning, and Memory Access Tuning
Implementation 1 Source Files.
This is the starting point baseline implementation of the N-body simulation.
Algorithm Explanation:
- We obtain the start and end time of running the N-body simulation (i.e.
MoveParticles(...)
) by wrapping the simulation with a pair ofomp_get_wtime()
- assigned totStart
(start time) andtEnd
(end time). The duration (in seconds) may be computed by doing atEnd - tStart
. HztoInts
computes total number of particle interactions within the step. Since we have two nestedfor
loops computing interaction between all particles (including interaction of a particle with itself), there will be a total ofnParticles * nParticles
interactions. Convention Note: there is a convention of usingnParticles * (nParticles - 1)
instead, to describe the total number of particle-to-particle interactions, without a particle interacting with itself. This is a pure conventional choice. The original code that Colfax Research presents in the video usesnParticle * (nParticles - 1)
. So let's stick to that for consistency. Do appreciate the fact that asnParticles
become larger, the effect of that "minus 1" will become very insignificant. So let's not be too hung up on that - it's just convention.HztoGFLOPs
computes the total estimated number of Giga Floating Point Operations within the step. We assume each particle-to-particle interaction to require 20 operations (a well-established rule-of-thumb in the nbody simulation community). To convince ourselves the point, if we do a quick count of all the operations (such as add, minus, multiply, division, etc) within the innermost nested loop, we should come to around 20 operations - more or less. There will benParticle * (nParticles - 1)
number of interactions. So roughly we will have20 * nParticle * (nParticles - 1)
floating point operations (FLOP). To make the number smaller, we use GFLOP (instead of FLOP). 1 GFLOP = 1e9 FLOP. So in GFLOP:20 * nParticle * (nParticles - 1) * 1e-9
GFLOP for all interactions. We assign this toHztoGLOPS
.- We use the per-step
HztoInts
to compute the per-step per-second version withHztoInts / (tEnd - tStart)
. i.e. number of particle-to-particle interactions per second, in a time step. - We use the per-step
HztoGFLOPs
to compute the per-step per-second version withHztoGFLOPs / (tEnd - tStart)
. i.e. number of GFLOP Per Second (i.e. GFLOPS), in a time step. We assign this to the variablerate
. This is equivalent to the assumed mean D as per the short-cut method standard deviaiton formula.
Implementation 1 Performance Log
The average step performance: 3.2 +- 0.0 GFLOPS (0.07% of theoretical peak 4,505 GFLOPS).
This is very poor.
Can we apply some quick wins, with multi-threading parallelism, to achieve higher efficiency? We will do just that in Implementation 2.
We parallelize the inner nested loop with OMP Multi-threading. Specify Fx
, Fy
, and Fz
within the reduction to prevent data race. i.e. Fx
, Fy
, and Fz
will be visible to all threads to add to it, in an atomic manner. We learnt about this reduction
technique from our previous article Optimize a Numerical Integration Implementation with Parallel Programming and Distributed Computing.
Note the following one-liner addition just before the inner nested for loop:
#pragma omp parallel for reduction(+: Fx, Fy, Fz)
Implementation 2 Performance Log
The average step performance: 12.5 +- 0.0 GFLOPS (0.31% of theoretical peak 4,505 GFLOPS).
This is better than the 0.07% in Implementation, but still very poor.
Observations:
- We have ~16,000 particles
- We have placed the parallelism at the inner j-loop. (not the outer i-loop).
- Inefficiency 1: We perform i = ~16,000 iterations at the outer i-loop. i.e. we have scheduled to start and stop multi-threading operations ~16,000 times. This is significant overhead.
- Inefficiency 2: For Single Precision (SP) floating point operation, the VPU (Vector Processing Unit) may concurrently perform 16 computation at onces (we call 16 concurrent computation "1 vectorized computation"). In other words, at each
i
iteration we perform roughly (16,000 / 16 =) 1000 vectorized computations within the inner j-loop. We distribute 1000 vectorized computations across 256 threads (i.e. each thread performs only roughly 3-4 vectorized computations). And it turns out that 3-4 vectorized computations per thread is too small for good load balancing on Knights Landing processor. (the cost of multi-thread coordination may outweigh benefits gained). Note to self: may need validating this statement.
To remediate the inefficiencies as noted in the previous Implementation 2, here in this implementation 3, we perform parallelism at the Level-1 (outer) loop level. Note we now have 2 #pragma mp parallel for
at the two outer loops. We do not have any data races so no need for reduction
.
Implementation 3 Performance Log
The average step performance: 357.6 +- 9.8 GFLOPS (7.94% of theoretical peak 4,505 GFLOPS).
A big jump from the previous 0.31% efficiency. Can we do better?
Having a look at the optimization report we see some suspicious remarks (which we will explain below the report):
LOOP BEGIN at nbody.cc(16,3) inlined into nbody.cc(106,5)
remark #25096: Loop Interchange not done due to: Imperfect Loop Nest (Either at Source or due to other Compiler Transformations)
remark #25451: Advice: Loop Interchange, if possible, might help loopnest. Suggested Permutation : ( 1 2 ) --> ( 2 1 )
remark #15542: loop was not vectorized: inner loop was already vectorized
LOOP BEGIN at nbody.cc(22,5) inlined into nbody.cc(106,5)
remark #15415: vectorization support: non-unit strided load was generated for the variable <particle[j]>, stride is 6 [ nbody.cc(28,24) ]
remark #15415: vectorization support: non-unit strided load was generated for the variable <particle[j]>, stride is 6 [ nbody.cc(29,24) ]
remark #15415: vectorization support: non-unit strided load was generated for the variable <particle[j]>, stride is 6 [ nbody.cc(30,24) ]
remark #15305: vectorization support: vector length 16
remark #15309: vectorization support: normalized vectorization overhead 0.344
remark #15417: vectorization support: number of FP up converts: single precision to double precision 2 [ nbody.cc(32,30) ]
remark #15418: vectorization support: number of FP down converts: double precision to single precision 1 [ nbody.cc(32,30) ]
remark #15300: LOOP WAS VECTORIZED
remark #15452: unmasked strided loads: 3
remark #15475: --- begin vector cost summary ---
remark #15476: scalar cost: 170
remark #15477: vector cost: 20.000
remark #15478: estimated potential speedup: 7.830
remark #15486: divides: 3
remark #15487: type converts: 3
remark #15488: --- end vector cost summary ---
LOOP END
Good signs:
- vector length 16 (compiler automatically identified vectorization opportunities and went ahead and do vectorization)
Bad signs:
- 2 floating point (FP) conversion-up from single precision (SD) to double precision (DP). And 1 FP convert-down from DP to SP. This seems unnecessary overhead. Scalar Tuning Required.
- divides: 3 (the division operation is slower than multiplication. Can we replace duplicate divisions with a single reciprocal multiplication, and reuse it multiple time instead?). Scalar Tuning Required.
- non-unit strided load: the way we structure our data / array of particle object might not be optimized for fast memory lookup access. We can improve this by restructuring our data structure, possibly at the potential tradeoff of reduced code readability for a human / good object oriented programming style. In this article however, performance is priority. Vectorization Tuning Required
- The
estimated potential speedup: 7.830
. If code is vectorization tuned, this number should get close tovector length 16
. Vectorization Tuning Required
Now that we have identified some scalar tuning opportunities, let's try focus on these first. We will do just that for our next implementation 4.
In the optimization report we identified a bad sign: 2 Single-to-Double Precision conversions, and 1 Double-to-Single Precision conversion. This corresponds to the following line of the code:
const float drPower32 = pow(drSquared, 3.0/2.0);
Where the precision conversions come from:
- The C++ global
pow
(power) function by default takes in double precision values and output double precision values. Overloading does not occur for this global function. drSquared
is afloat
(Single Precision). the compiler automatically converts this todouble
(Double Precision). This is the first Single-to-Double Precision conversion.- The compiler converts
3.0/2.0
fromfloat
to adouble
(Double Precision) Value (probably by doing something like3.0/(double)2.0
. This is the second Single-to-Double Precision Conversion. - Now that we have two double precision inputs,
pow
returns a double precision output. But because we have specifieddrPower32
to be afloat
(Single Precision), the compiler converts fromdouble
tofloat
. This is our first Double-to-Single precision conversion. Check out this stackoverflow relating to this.
To summarize, we have 2 SP-to-DP conversions. 1 DP-to-SP conversion. None of these are necessary. This can be easily prevented with some simple scalar tunings.
The Scalar Tunings:
- To enable overloading (point 1 above), we need to use
std::pow
(instead of the globalpow
). So if we supplyfloat
inputs (single precision), it will producefloat
(single precision) output. - To ensure both our inputs are
float
(single precision). Well, we already know thatdrSquared
is afloat
(single precision). So we don't need to do anything about that. The3.0/2.0
however, needs to be explicitly defined asfloat
- we can do this with3.0f/2.0f
. Simply add thef
at the end to ensure float literal is interpreted asfloat
.
Applying the above scalar tuning the following line of code becomes:
const float drPower32 = std::pow(drSquared, 3.0f/2.0f);
Implementation 4 Performance Log
The average step performance: 477.3 +- 1.1 GFLOPS (9.87% of theoretical peak 4,505 GFLOPS).
This is slightly better than the 7.94% in the previous Implementation 3. But can we do better?
If we look at the optimization report again, we shall notice that the precision conversion part has gone:
LOOP BEGIN at nbody.cc(22,5) inlined into nbody.cc(106,5)
remark #15415: vectorization support: non-unit strided load was generated for the variable <particle[j]>, stride is 6 [ nbody.cc(28,24) ]
remark #15415: vectorization support: non-unit strided load was generated for the variable <particle[j]>, stride is 6 [ nbody.cc(29,24) ]
remark #15415: vectorization support: non-unit strided load was generated for the variable <particle[j]>, stride is 6 [ nbody.cc(30,24) ]
remark #15305: vectorization support: vector length 16
remark #15309: vectorization support: normalized vectorization overhead 0.458
remark #15300: LOOP WAS VECTORIZED
remark #15452: unmasked strided loads: 3
remark #15475: --- begin vector cost summary ---
remark #15476: scalar cost: 167
remark #15477: vector cost: 15.000
remark #15478: estimated potential speedup: 10.200
remark #15486: divides: 3
remark #15488: --- end vector cost summary ---
LOOP END
For Implementation 5, let's try tackle another scalar tuning opportunities: let's replace the 3 "divides" with "multiply" (because divides run slower than multiply).
It turns out that divides are slower operations than multiply. Let's try replacing these with "multiply" instead. According to optimization report here are the opportunities:
Fx += dx / drPower32; // divide
Fy += dy / drPower32; // divide
Fz += dz / drPower32; // divide
If we do the following instead, we will likely realise some efficiency gain:
float oodrPower32 = 1.0f / drPower32 // one over drPower32
Fx += dx * oodrPower32; // multiply
Fy += dy * oodrPower32; // multiply
Fz += dz * oodrPower32; // multiply
A note on the power function:
const float drPower32 = std::pow(drSquared, 3.0f/2.0f);
Even though we have a "divide" within the std::pow
(Power) function. It turns out (according to the HOW-to series) that the Intel C++ compiler (with MIC-AVX512 instruction) is clever enough to optimize this - by using a combination of optimized power and square root functions. So we need not to worry about this. (That said, we are going to keep our mind open and prove our point by getting rid of that "divide" with the square root function in Implementation 6. As we will see later, it makes hardly much difference.)
Implementation 5 Performance Log
Average Performance: 581.5 +- 0.7 GFLOPS (12.88 % of peak 4505 GFLOP/s).
This is a slight improvement from our previous 9.87 % efficiency.
If we check the optimization log again, the 3 divides has gone (as we expect):
LOOP BEGIN at nbody.cc(16,3)
remark #25096: Loop Interchange not done due to: Imperfect Loop Nest (Either at Source or due to other Compiler Transformations)
remark #25451: Advice: Loop Interchange, if possible, might help loopnest. Suggested Permutation : ( 1 2 ) --> ( 2 1 )
remark #15542: loop was not vectorized: inner loop was already vectorized
LOOP BEGIN at nbody.cc(22,5)
remark #15415: vectorization support: non-unit strided load was generated for the variable <particle->x[j]>, stride is 6 [ nbody.cc(28,24) ]
remark #15415: vectorization support: non-unit strided load was generated for the variable <particle->y[j]>, stride is 6 [ nbody.cc(29,24) ]
remark #15415: vectorization support: non-unit strided load was generated for the variable <particle->z[j]>, stride is 6 [ nbody.cc(30,24) ]
remark #15305: vectorization support: vector length 16
remark #15309: vectorization support: normalized vectorization overhead 0.683
remark #15300: LOOP WAS VECTORIZED
remark #15452: unmasked strided loads: 3
remark #15475: --- begin vector cost summary ---
remark #15476: scalar cost: 99
remark #15477: vector cost: 10.250
remark #15478: estimated potential speedup: 8.750
remark #15488: --- end vector cost summary ---
LOOP END
LOOP BEGIN at nbody.cc(22,5)
<Remainder loop for vectorization>
remark #15305: vectorization support: vector length 16
remark #15309: vectorization support: normalized vectorization overhead 0.646
remark #15301: REMAINDER LOOP WAS VECTORIZED
LOOP END
LOOP END
It turns out that instead of doing "x to the power of 3/2", it's slightly more efficient to do a "1 over square root of x, then do a power of 3" (in particular, this special sqrtf()
(square root) function is optimized to run on KNL node with AVX512 instruction set. In other words, we try and push our performance a wee bit further by eliminating that "divide" within the power.
We replace the previous block:
const float drPower32 = std::pow(drSquared, 3.0f/2.0f);
float oodrPower32 = 1.0f / drPower32; // one over drPower32
Fx += dx * oodrPower32; // scalar tuning
Fy += dy * oodrPower32; // scalar tuning
Fz += dz * oodrPower32; // scalar tuning
with this new block:
const float oodr = 1.0f/sqrtf(drSquared);
const float drPowerN3 = oodr*oodr*oodr;
Fx += dx * drPowerN3; // scalar tuning
Fy += dy * drPowerN3; // scalar tuning
Fz += dz * drPowerN3; // scalar tuning
Implementation 6 Performance Log
Average Performance: 623.3 +- 1.5 GFLOPS (13.84% of peak 4505 GFLOP/s).
This is (only just) slightly better than the 12.88 % efficiency we obtained from previous implementation 5. (This is disappointingly insignificant, given the reduced redability of the code. One may give it a second thought to see if this implementation is worth the trade off. Nevertheless, as our priority is performance, we are going to endorse this for now.).
OK. Enough Scalar tuning for now. In Implementation 7 we will introduce a simple tweak to our Makefile (without changing the code) to boost our performance even further.
Implementation 7 - (Scalar Tuning) Enables Aggressive Optimizations on Floating-point Data with Compiler Option
(Focus only on the Makefile
)
Say our code is now optimized from scalar tuning viewpoint (let's just assume it is for now - otherwise we will be at it forever), there is one more trick we can implement without changing any codes at all, to push scalar tuning optimization to the limit. The magic is to add an extra flag in our Intel C++ (icpc) code compilation step: -fp-mode fast=2
. This option essentially relax the floating point precision a little bit with the aim of maximizing scalar tuning optimization - aggressively. See this Intel doc.
Here we borrow a slide from Colfax Research, on Intel C++ compiler options:
In our Makefile
, simply add the -fp-model fast=2
option to our compiler. i.e.
Change this line:
OPTFLAGS = -qopt-report=5 -qopt-report-file=$@.optrpt
to
OPTFLAGS = -qopt-report=5 -qopt-report-file=$@.optrpt -fp-model fast=2
Notes:
- ensure to do a
make clean
to remove all the previously compiled files. (since we have not changed the source codenbody.cc
) - Then do a
make queue-knl
to recompile - note the-fp-model fast=2
flag.
Implementation 7 Performance Log
Average Performance: 654.4 +- 3.1 GFLOPS (14.53% of peak 4505 GFLOP/s).
This is slightly better than the 13.84 % we previously have in Implementation 6. Let's move on to our next tuning category - vectorization tuning (which we shall see very shortly, gives significant performance boost!)
Use Structure of Array (SoA), instead of Array of Structure (AoS) to make vectorization more efficient (at the trade off of potentially less elegant Object Oriented Programming (OOP) style. Here, borrowing a slide from Colfax Research:
It turns out the SoA style (e.g. particle.x[i])
, that uses unit stride) is more efficient than AoS style (e.g. particle.[i].x
, that uses non-unit stride). This implementation replaces our original AoS data structure, to the more efficient SoA.
We create a new data struct
that resembles the more efficient SoA style:
// (Optimum) Structure of Array - bad OOP. Good performance.
struct ParticleSetType {
float *x, *y, *z;
float *vx, *vy, *vz;
};
We replace all the indexing style like this:
Old AoS style:
particle.[i].x
new SoA style:
particle.x[i]
We replace all the (old AoS type) ParticleType
with (the new SoA type) ParticleSetType
.
Change the MoveParticles()
definition input a bit - replace old AoS style with the new SoA style as follows.
Before: Old AoS Style - we have a giant container (our array) of N particles (structures) on the free store. Hence Array of Structure (AoS). The only way to locate this giant array is by a local pointer variable *particle
on the stack - that points to that free store giant container.
void MoveParticles(const int nParticles, ParticleType *const particle, const float dt) {//...};
After: New SoA Style (less OOP but more efficient) - we have an object on the stack (our structure) that contains of only 6 pointers (to our arrays) - *x
, *y
, *z
, *vx
, *vy
, *vz
. Hence Structure of Array (SoA). This structure is super small. We can parse the memory address of this structure to the function. We locate each array like this: particle.x
, particle.vx
, etc.
void MoveParticles(const int nParticles, ParticleSetType & particle, const float dt) {//...};
Implementation 8 Performance Log
Average Performance: 1889.7 +- 11.6 GFLOPS (% of peak 4505 GFLOP/s).
This is a significant boost from the 14.53% from our previous Implementation 7. If we look at the optimization report again, we shall see that the "non unit stride" warning has now gone away. We however observe two sets of new warning messages:
reference particle[j] has unaligned access
PEEL LOOP WAS VECTORIZED
Printing these messages below. We will visit these in the next Implementation 9.
On Unaligned access:
LOOP BEGIN at nbody.cc(29,5) inlined into nbody.cc(124,5)
remark #15389: vectorization support: reference particle[j] has unaligned access [ nbody.cc(35,24) ]
remark #15389: vectorization support: reference particle[j] has unaligned access [ nbody.cc(36,24) ]
remark #15389: vectorization support: reference particle[j] has unaligned access [ nbody.cc(37,24) ]
remark #15381: vectorization support: unaligned access used inside loop body
remark #15305: vectorization support: vector length 16
remark #15309: vectorization support: normalized vectorization overhead 1.476
remark #15300: LOOP WAS VECTORIZED
remark #15442: entire loop may be executed in remainder
remark #15450: unmasked unaligned unit stride loads: 3
remark #15475: --- begin vector cost summary ---
remark #15476: scalar cost: 91
remark #15477: vector cost: 5.250
remark #15478: estimated potential speedup: 14.450
remark #15488: --- end vector cost summary ---
LOOP END
On Peel Loop:
LOOP BEGIN at nbody.cc(29,5) inlined into nbody.cc(124,5)
<Peeled loop for vectorization>
remark #15389: vectorization support: reference particle[j] has unaligned access [ nbody.cc(35,24) ]
remark #15389: vectorization support: reference particle[j] has unaligned access [ nbody.cc(36,24) ]
remark #15389: vectorization support: reference particle[j] has unaligned access [ nbody.cc(37,24) ]
remark #15381: vectorization support: unaligned access used inside loop body
remark #15305: vectorization support: vector length 16
remark #15309: vectorization support: normalized vectorization overhead 1.093
remark #15301: PEEL LOOP WAS VECTORIZED
LOOP END
We will address this in our next Implementation 9.
In our previous Implementation 8 we saw two remarks in the optimization report (these are related):
remark #15389: vectorization support: reference particle[j] has unaligned access [ nbody.cc(35,24) ]
remark #15301: PEEL LOOP WAS VECTORIZED
This implementation 9 will resolve both remarks. In picture it looks like this (explanation to follow).
Say we have a loop that iterate through an array particle.x[j]
. On a KNL Processor, it is an architecture requirement the starting memory address of the array (i.e. particle.x[0]
) must be multiple of 64 bytes (i.e. start at an aligned memory boundary). Otherwise the code will crash (i.e. start at an unaligned memory boundary). In the case of unaligned case (e.g. say array starts at the 90th byte.) compiler would work around this by creating a brand new loop to make the loop as if it is aligned (to make it as if it starts on the 128th byte). The creation of this "magic" new loop is called a "peel loop". This creates a bit of overhead and may slow the program a bit.
Quick notes:
- for Knights Corner (KNC) Processors: use 64 byte alignment (same as Knights Landing / KNL).
- for Xeon Processors: use 32 byte alignment
To clean this up, we can use the Intel _mm_malloc
when creating a new array instance (to ensure array start memory guaranteed to start at an address that is multiple of 64 bytes), and the associated _mm_free
to release memory (to avoid memory leak).
Note to self: I wonder if there are equivalent smart pointer that avoids memory leaks altogether (i.e. ones that won't require us to specify a "free memory" statement yet will ensure the required cabbage management to eliminate the risky memory leak).
before: To implement this change we need to replace lines from this current version (that may cause unaligned access and peel loop vectorization overhead):
// create new instance
particle.x = new float[nParticles];
// release memory
delete particle.x
after: to the new version (that will ensure aligned access and eliminate peel loop overhead):
// create new instance
particle.x = (float*) _mm_malloc(sizeof(float)*nParticles, 64);
// release memory
_mm_free(#pragma vector aligned);
We also need to tell the compiler that we have manually perform the container alignment, by specifying a #pragma vector aligned
line just in front of the (potential) peel loop - to ensure the compiler doesn't perform that (magical) peel loop vectorization overhead.
#pragma vector aligned
for (int j = 0; j < nParticles; j++) {
xxx = particle.x[j];
}
IMPORTANT NOTE: Here we are "promising" the compiler that we have manually (and correctly) aligned our arrays (i.e. first element of the array starts at the multiple of 64 bytes). If we do this promising, but without implementing our _mm_malloc
correctly, unexpected things may occur. (Though I have not yet tried. Maybe it will compile and cause run-time crashes. Or maybe it won't compile at all.). For now, just make sure we only specify this pragma if we are sure we have used _mm_malloc
correctly.
Implementation 9 Performance Log
Average Performance: 2091.9 +- 13.7 GFLOPS (46.44% of peak 4505 GFLOP/s).
This is a boost from the 41.95% in our previous Implementation 8.
Now look at the optimization report. Note that access is now align (instead of unaligned). Peel loop vectorization is no longer performed (less overhead). The estimated potential speedup is now 15.6 (which has got very close to the max of vector length 16).
LOOP BEGIN at nbody.cc(30,5) inlined into nbody.cc(129,5)
remark #15388: vectorization support: reference particle[j] has aligned access [ nbody.cc(36,24) ]
remark #15388: vectorization support: reference particle[j] has aligned access [ nbody.cc(37,24) ]
remark #15388: vectorization support: reference particle[j] has aligned access [ nbody.cc(38,24) ]
remark #15305: vectorization support: vector length 16
remark #15309: vectorization support: normalized vectorization overhead 1.317
remark #15300: LOOP WAS VECTORIZED
remark #15448: unmasked aligned unit stride loads: 3
remark #15475: --- begin vector cost summary ---
remark #15476: scalar cost: 91
remark #15477: vector cost: 5.120
remark #15478: estimated potential speedup: 15.610
remark #15488: --- end vector cost summary ---
LOOP END
There are no more vectorization optimization flags. Let's do one more test - what if we increase our problem size? (as we shall see shortly, GFLOPS degradates. This prompts for further optimization later on).
Implementation 9 Performance Log
It turns out our Implementation 9 doesn't make the most of the fast cache memory. We can visualize this by increasing NSIZE
and see the drop of performance.
multiple | NSIZE | Average Performance (GFLOPS) | Performance Multiple |
---|---|---|---|
1 | 16,384 | 2091.9 +- 13.7 | 1.00x |
2 | 32,768 | 2197.0 +- 110.4 | 1.05x |
4 | 65,536 | 1960.6 +- 23.6 | 0.94x |
8 | 131,072 | 1729.0 +- 12.9 | 0.93x |
16 | 262,144 | 1621.7 +- 0.7 | 0.78x |
32 | 524,288 | 1561.1 +- 57.8 | 0.75x |
Given the same application running on a single node, an increased problem size (NSIZE
) results in degradation in GFLOPS. This is a problem that requires fixing (in Implementation 10).
It turns out the Performance Degradation in the Performance view, can also be viewed in a (more or less) equivalent Duration view. i.e. decrease in performance equates to increase in duration.
Here are the results we get after running the job at multiples of NSIZE
. We've included an explanation below the table illustrating how we computed the Expected Duration and the ratio.
Multiple (M) | NSIZE | Expected Duration | Actual Duration | Ratio of Expected vs Actual Duration |
---|---|---|---|---|
1 | 16,384 | (same as actual) | ~2.567 ms | 1.00x |
2 | 32,768 | 10.268 ms | ~9.576 ms | 1.07x |
4 | 65,536 | 41.072 ms | ~45.130 ms | 0.91x |
8 | 131,072 | 164.288 ms | ~197.800 ms | 0.83x |
16 | 262,144 | 657.152 ms | ~848.000 ms | 0.77x |
32 | 524,288 | 2628.608 ms | ~3435.0 ms | 0.76x |
How we compute expected duration:
- We know that the Expected Duration is in the order of
NSIZE
squared. i.e.O(NSIZE^2)
- Say we denote
NSIZE = 16,384
(our baselineNSIZE
) asN0
. We run the job to get theActual Duration at N0
. - We then initialize
Expected Duration at N0
is the same as theActual Duration at N0
. - We then try out multiples of
NSIZE
by doubling it. For instance, the multipleM
at baseline is 1. WhenM
is 2, ourNSIZE
becomes2*N0
. WhenM
is 4, ourNSIZE
becomes4*N0
, etc. - We then compute the subsequent Expected Duration as:
M^2 * Duration at N0
. For example, if M = 4, and N0 is 2.567 ms, the expected duration = 4^2 * 2.218 ms = 41.702 ms. (as per table below). Likewise for other M. - We finally compute the ratio of
Expected Duration
toActual Duration
. If the performance is constant with respect (w.r.t)NSIZE
, this ratio should stay at 1x. - Reality is that performance decreases as we increases
NSIZE
(as we see in the table). TheActual Duration
therefore turns out to be greater thanExpected Duration
. The Ratio ofExpected w.r.t Actual Duration
therefore decreases from 1x. The interesting part: this ratio turns out to be fairly close to the ratio we computed in the Performance view. (And due to this, we can more or less conclude our Duration View is "equivalent" as our Performance View.) - One more note: the values in table is pretty rough. In our next implementation we will enhance the code a bit to compute average duration (as well as the existing average performance in GFLOPS). This will make reporting easier.
It turns out that the reason of decreased performance (with respect to increased NSIZE
) is due to the current code implementation not being able to use cache memory for faster memory access. A solution to this is to use a technique called Loop Tiling, which we will discuss and implement in our next Implementation 10.
Implementation 10 - Memory Tuning (Loop Tiling with Register Blocking to take advantage of L1 and L2 Cache)
Implementation 10 Source Files
We learn from our Implementation 9 tests that the code performance decreases with increased NSIZE
. It turns out this is due to the current implementation does not take advantage of L1 and L2 Caches effectively.
To understand why this is so let's recall our Xeon Phi architecture:
- L1 Cache is 32 KiB per core. That's
32 * 1024 = 32,768 bytes per core
. - L2 Cache is 512 KiB per core (1 MiB per tile = 1024 KiB per tile. Each tile has 2 cores. So
1024 KiB per tiles / 2 cores per tiles = 512 KiB per core)
. That's512 * 1024 = 524,288 bytes per core
When NSIZE
is very small, data within the inner j-loop may "fit" into L1/L2 cache for high/medium Performance. When NSIZE
becomes too high, data may not fit in cache and "spill out" into non-cache RAM, which is much slower. This causes degradation in performance.
Note to self: below is my own illustration to convince myself how cache work. I yet to validate these example. My knowledge of cache is limited at this stage.
For example:
- when
NSIZE
is very small, say 1000 particles: the inner j-loop (where we compute ourx
,y
,z
positions for each particle), we have1000 * 3 float parameters * 4 bytes per float parameters =
12,000 bytes. This can easily fit in L1 Cache for high performance. Our cache hit rate will be high. - when
NSIZE
is medium, say 16384 particles (our M = 1 case): the inner j-loop (where we compute ourx
,y
,z
positions for each particle), we have16384 * 3 float parameters * 4 bytes per float parameters =
196,608 bytes. This won't fit in L1 Cache, but may fit in the L2 Cache for medium performance. Our cache hit rate will be high. - when
NSIZE
is 262,144 particles (i.e. M = 16 case): the inner j-loop (where we compute ourx
,y
,z
positions for each particle), we have262144 * 3 float parameters * 4 bytes per float parameters =
3,145,728 bytes. This won't fit in neight L1/L2 Cache. Our cache hit rate will be pretty much zero.
It turns out that in practice, we can improve our cache performance with a technique called Loop Tiling.
Instead of explaining how loop-tiling works here (we will cover loop-tiling in a separate article for fuller understanding), for now, let's just apply the following loop tiling pattern. The "after" syntax produces the same result as the "before" syntax. It just does it in a more machine efficient way (making the most of fast cache), despite in a potentially less human readable way (unfortunately). The before vs after syntax looks like this:
Implementation 10 Performance Log
Average Performance: 2831.4 +- 8.6 (62.85% of peak 4505 GFLOP/s).
Notice the average performance is now more robust to increases in NSIZE, at an average of around 2874 +- 24.7 GFLOPS/s. Also note the significant performance boost from our previous 46.44% performance efficiency in Implementation 9.
multiple | NSIZE | Average Performance (GFLOPS) | Performance Multiple |
---|---|---|---|
1 | 16,384 | 2831.4 +- 8.6 | 1.00x |
2 | 32,768 | 2907.3 +- 6.0 | 1.03x |
4 | 65,536 | 2911.9 +- 12.9 | 1.03x |
8 | 131,072 | 2907.9 +- 10.2 | 1.03x |
16 | 262,144 | 2892.5 +- 2.5 | 1.02x |
32 | 524,288 | 2793.5 +- 107.8 | 0.99x |
This section is purely for added intuition. This Duration View is the same as the Performance View.
- Notice the actual duration is now more predictable - it's around 1x of the expected duration.
Expected Duration = M^2 * Duration(M=1) = M^2 * 1.896 ms
- Again, the ratio "Expected w.r.t Actual Duration" exactly matches to the ratio we get in the Performance view.
Multiple (M) | NSIZE | Expected Duration | Actual Duration (with tiling) | Expected to Actual Duration Ratio |
---|---|---|---|---|
1 | 16,384 | 1.896 ms | 1.896 +- 0.006 ms | 1.00x |
2 | 32,768 | 7.584 ms | 7.386 +- 0.015 ms | 1.03x |
4 | 65,536 | 30.336 ms | 29.499 +- 0.131 ms | 1.03x |
8 | 131,072 | 121.344 ms | 118.162 +- 0.418 ms | 1.03x |
16 | 262,144 | 485.376 ms | 475.148 +- 0.405 ms | 1.02x |
32 | 524,288 | 1941.504 ms | 1971.192 +- 82.462 ms | 0.98x |
Implementation 10 result suggests average performance stays (more or less) constant against increases in NSIZE
. Duration is also more predictable. It was also a significant performance boost from our previous Implementation 9.
It is worth reminding ourselves this toy N-body algorithm has execution time in the order of O(N^2). There are alternative nbody algorithms that runs in the order of O(n log n). For the purpose of this article, which is tuning the implementation given a chosen (inefficient) algorithm, what we have done so far is fit for purpose. In fact, now that we have the knowledge on implementation performance tuning, we can probably take a more efficient algorithm - one that runs in O(n log n), and apply similar tuning procedure and obtain increased performance (but this is out of scope of this article).
In the next Implementation 11, we will apply distributed computing to handle scalability (i.e. increased problem size), and perform some scalability tests (weak scaling, strong scaling, and sustain duration test).
Part 2: Distributed Computing and Scalability - Cluster Communication Tuning
Implementation 11 Source Files
In implementation 10, we achieved an average performance of 2831.4 +- 8.6 GFLOPS (62.85% of Theoretical Peak performance of 4505 GFLOPS) running parallel code on 1 node. To handle scalability however, we may distribute workload to run in parallel across multiple machines. This technique is called Distributed Computing. This is pretty much the same as saying: let's further boost our performance way beyond the theoretical single node peak performance (of 4505 GFLOPS in our case).
We now modify our code to enable distributed computing with MPI (Message Passage Interface). Recall our Hello World MPI Code - our implementation will be somewhat similar. We will also make use of MPI_Allgather
to make the communication of cluster nodes more efficient (or put it another way, less inefficient). Borrowing some slides from Colfax Research on MPI_Allgather
and code implementation guide:
We will perform both strong and weak scaling studies and learn how well our application scale. Some definitions from Wikipedia on Strong and Weak Scaling
- Strong scaling, which is defined as how the solution time varies with the number of processors for a fixed total problem size. In this study, we still see how our job duration and performance vary, given a fixed total problem size.
- Weak scaling, which is defined as how the solution time varies with the number of processors for a fixed problem size per processor. In our case, instead of per processor we will use "per node". It's equivalent because each node contains 1 Xeon Phi Processor - even though each processor may have many cores and threads.
We will go ahead and perform 3 tests
- Strong Scaling Study
- Weak Scaling Study
- (Bonus) Sustain Duration with Extra Nodes
Notes:
- Note that Speed-up and Performance Multiple is equivalent in concept. Values should be (more or less) the same.
- Speed-up = Duration (when Nodes=1) / Duration (when Nodes=Nodes)
- Performance Multiple = Duration (when Nodes=Nodes) / Duration (when Nodes=1)
Given a fixed total problem size of (16,384 particles), how does the duration and overall performance vary with the number of processors?
Implementation 11 Job Log - Strong Scaling Study - Small n-body case
(Note: may need to scroll right to see more columns in table)
Nodes | Particles | Duration (ms) | Performance (GFLOPS) | Duration Multiple | Performance Multiple | Performance Multiple (per node) |
---|---|---|---|---|---|---|
1 | 16,384 | 1.797 +- 0.009 | 2987.1 +- 15.2 | 1.00x | 1.00x | 1.000x |
2 | 16,384 | 2.723 +- 0.100 | 1974.0 +- 70.8 | 1.52x | 0.66x | 0.330x |
4 | 16,384 | 4.038 +- 0.168 | 1331.5 +- 53.1 | 2.25x | 0.45x | 0.111x |
8 | 16,384 | 7.012 +- 0.111 | 765.8 +- 12.0 | 3.90x | 0.26x | 0.032x |
16 | 16,384 | 6.050 +- 1.181 | 919.4 +- 165.5 | 3.37x | 0.31x | 0.019x |
32 | 16,384 | 13.088 +- 0.810 | 411.7 +- 24.7 | 7.28x | 0.14x | 0.004 x |
Observation: Given a small fixed problem size (16,384-particles. i.e. 268,419,072 interactions), the more nodes we distribute workload across, the longer it takes for the job to complete. Overall performance degradates with increased nodes, likely due to the overall distributed computing benefits are outweighed by the more significant per-node performance degradation.
Given a fixed total problem size of (1,048,576 particles. i.e. 1,099,510,579,200 interactions), how does the duration and overall performance vary with the number of processors? (Note: this is 64 times bigger than our small n-body case)
Implementation 11 Job Log - Strong Scaling Study - Big n-body case
(Note: may need to scroll right to see more columns in table)
Nodes | Particles | Duration (ms) | Performance (GFLOPS) | Duration Multiple | Performance Multiple | Performance Multiple (per node) |
---|---|---|---|---|---|---|
1 | 1,048,576 | 7249.285 +- 78.787 | 3033.8 +- 32.6 | 1.00x | 1.00x | 1.000x |
2 | 1,048,576 | 3686.337 +- 69.256 | 5967.4 +- 108.0 | 0.51x | 1.97x | 0.983x |
4 | 1,048,576 | 2037.912 +- 65.578 | 10801.6 +- 342.2 | 0.28x | 3.56x | 0.890x |
8 | 1,048,576 | 1028.011 +- 35.690 | 21415.4 +- 702.2 | 0.14x | 7.06x | 0.882x |
16 | 1,048,576 | 652.174 +- 10.386 | 33726.9 +- 538.1 | 0.09x | 11.12x | 0.695x |
32 | 1,048,576 | 388.851 +- 11.889 | 56604.7 +- 1731.5 | 0.05x | 18.66x | 0.583x |
Observation: Given a large fixed problem size (1,048,576 particles. i.e. 1,099,510,579,200 interactions), the more nodes we distribute workload across, the shorter it takes for the job to complete. Overall performance increases with increased nodes, likely due to the overall distributed computing benefit far outweighs the less significant per-node performance degradation.
Given a fixed total problem size per node (16,384 particles per node. i.e. 268,419,072 interactions per node), how does the duration and overall performance vary with the number of node?
Implementation 11 Job Log - Weak Scaling Study - Small n-body case
(Note: may need to scroll right to see more columns in table)
Nodes | Particles | Duration (ms) | Performance (GFLOPS) | Duration Multiple | Performance Multiple | Performance Multiple (per node) |
---|---|---|---|---|---|---|
1 | 16,384 | 1.808 +- 0.066 | 2972.5 +- 102.6 | 1.00x | 1.00x | 1.000x |
2 | 32,768 | 6.941 +- 0.300 | 3099.6 +- 132.8 | 3.84x | 1.04x | 0.521x |
4 | 65,536 | 17.775 +- 1.006 | 4847.9 +- 271.6 | 9.83x | 1.631x | 0.408x |
8 | 131,072 | 31.867 +- 0.676 | 10787.0 +- 231.5 | 17.63x | 3.63x | 0.454x |
16 | 262,144 | 69.882 +- 1.853 | 19681.0 +- 519.8 | 38.65x | 6.62x | 0..41x |
32 | 524,288 | 151.732 +- 9.016 | 36362.6 +- 2199.2 | 83.92x | 12.23x | 0.382x |
Observation: Given a small fixed total problem size per node (16,384 particles per node. i.e. 268,419,072 interactions per node), the rate of per-node performance degradation is significant with respect to increasing nodes. Distributed Computing is less robust.
Given a fixed total problem size per node (262,144 particles per node. i.e. 268,419,072 interactions per node.), how does the duration and overall performance vary with the number of node?
Implementation 11 Job Log - Weak Scaling Study - Big n-body case
(Note: may need to scroll right to see more columns in table)
Nodes | Particles | Duration (ms) | Performance (GFLOPS) | Duration Multiple | Performance Multiple | Performance Multiple (per node) |
---|---|---|---|---|---|---|
1 | 262,144 | 441.428 +- 1.394 | 3113.5 +- 9.8 | 1.00x | 1.00x | 1.000x |
2 | 524,288 | 949.323 +- 31.113 | 5797.0 +- 183.4 | 2.15x | 1.86x | 0.930x |
4 | 1,048,576 | 2023.533 +- 82.803 | 10884.8 +- 430.2 | 4.58x | 3.50x | 0.875x |
8 | 2,097,152 | 3933.780 +- 72.012 | 22367.9 +- 409.8 | 8.91x | 7.18x | 0.898x |
16 | 4,194,304 | 8839.475 +- 702.602 | 40041.8 +- 3001.9 | 20.02x | 12.86x | 0.804x |
32 | 8,388,608 | 27900.579 +- 536.408 | 50461.1 +- 968.9 | 63.21x | 16.21x | 0.506x |
Observation: Given a large fixed total problem size per node (262,144 particles per node. i.e. 68,719,214,592 interactions per node.), rate of per-node performance degradation is less significant with respect to increasing nodes. Distributed Computing is more robust (until certain point).
Let's be curious and do a mini research experiment:
- We identified previously running the N-body simulation with 262,144 particles takes roughly about 0.5 second per step (average). The current code of 10 step simulation would likely take roughly about 0.5 x 10 = 5 seconds to run.
- Since execution time is of O(N^2). If we double (x2) the problem size to 524,288 particles, it should take
2^2 = 4
times longer to run. i.e. roughly about 2 seconds per step, or 20 seconds to run. - We are going to form the hypothesis that, if we distribute across job to 4 nodes, we should be able to complete the job run in the same timeframe (5 seconds). Or in other words, to increase performance by 4 times.
Interactions = NSIZE * (NSIZE-1)
Let's do this test.
Implementation 11 Job Log - Sustain Duration
Multiple (M) | Particles (NSIZE) | Interactions | Nodes | Duration (ms) | Performance (GFLOPS) | Time Multiple | Performance Multiple | Performance Multiple (Per Node) |
---|---|---|---|---|---|---|---|---|
1 | 262,144 | 68,719,214,592 | 1 | 482.901 +- 61.725 | 2887.0 +- 321.0 | 1.00x | 1.00x | 1.00x |
2 | 524,288 | 274,877,382,656 | 4 | 533.371 +- 9.144 | 10310.2 +- 177.2 | 1.10x | 3.57x | 0.893x |
~3 | 786,384 | 618,399,009,072 | 9 | 576.369 +- 31.414 | 21518.5 +- 1103.7 | 1.19x | 7.45x | 0.828x |
4 | 1,048,576 | 1,099,510,579,200 | 16 | 662.437 +- 13.248 | 33208.9 +- 647.6 | 1.37x | 11.50x | 0.719x |
~5 | 1,310,400 | 1,717,146,849,600 | 25 | 2262 | 15187 | 4.68x | 5.26x | 0.210x |
~6 | 1,572,480 | 2,472,691,777,920 | 36 | 2728 | 18218 | 5.65x | 6.31x | 0.175x |
~7 | 1,834,560 | 3,365,608,559,040 | 49 | 4480 | 15036 | 9.27x | 5.21x | 0.106x |
Notes:
- Let
NSIZE
atM=1
beN0
, whereN0
is 262144 particles. NSIZE
(particles) at eachM
(Multiple) isM*N0
(Multiple of baseline 262144 particles)- Particle-to-particle Interactions:
NSIZE * (NSIZE-1)
- Nodes = number of nodes to distribute the job across (to sustain similar timeframe)
- Duration Multiple = multiple of Duration at
M = 1
. e.g. ForM = 2
case,Duration Multiple = 533.371 ms / 482.901 ms = 1.10
. - Performance Multiple = multiple of Performance at
M = 1
. e.g. ForM = 2
case,Performance Multiple = (10310.2 GFLOPs / 2887.0 GFLOPs) = 3.57
. - Performance Multiple (per Node) = multiple of Performance at
M = 1
, divided byNodes
. e.g. ForM = 2
case,Performance Multiple (per Node) = (10310.2 GFLOPs / 2887.0 GFLOPs) / 4 nodes = 0.893
.
Observations:
- We see that, it is possible to process 1 million particles (M=4 case: ~1 trillion interactions), at roughly the same time as processing 262,144 particles (M=1 case: ~69 billion interactions) - as short as ~0.5 second (500 ms).
- We see a gradual decline in per-node performance, as we increase our number of nodes. This is likely due to communication delay of Ethernet cable solution. This loss become significant when we have 25 or more nodes - the duration is no longer sub-second (but instead, in the region of 2-5 seconds).
According to this Colfax Research slide, it appears that the Intel Omni-Path Fabric 100 (low latency) communication solution may be more robust (in comparison to 1 Gigabit Ethernet solution - which is what the experiments in this article are based on):
If we look at the optimization report again we shall see some additional potential optimization opportunities.
LOOP BEGIN at nbody.cc(74,20)
remark #15389: vectorization support: reference particle->vx[ii+$i2] has unaligned access
remark #15389: vectorization support: reference particle->vx[?+$i2] has unaligned access
remark #15388: vectorization support: reference Fx[$i2] has aligned access
remark #15381: vectorization support: unaligned access used inside loop body
remark #15305: vectorization support: vector length 16
remark #15427: loop was completely unrolled
remark #15309: vectorization support: normalized vectorization overhead 0.700
remark #15300: LOOP WAS VECTORIZED
remark #15448: unmasked aligned unit stride loads: 1
remark #15450: unmasked unaligned unit stride loads: 1
remark #15451: unmasked unaligned unit stride stores: 1
remark #15475: --- begin vector cost summary ---
remark #15476: scalar cost: 8
remark #15477: vector cost: 0.620
remark #15478: estimated potential speedup: 7.520
remark #15488: --- end vector cost summary ---
LOOP END
(similar for vy
and vz
)
Probably something worth looking into in a potential future Implementation 12. For the purpose of this article, let's just stop here and celebrate.
In Part 1 we started from a baseline N-body simulation code running on a single Xeon Phi enabled cluster node. We applied threading parallelism, scalar tuning, vectorization tuning, and memory tuning to take performance up from the original 0.07% up to 62.85%. In Part 2 we applied cluster communication tuning and used distributed computing to complete performing over 1 trillion particle-to-particle interactions within a sub-second time step, by distributing workload across 16 cluster nodes (semi-) efficiently. We appreciated the strength (scalability and parallelism) and limitation (communication and other losses) of our implementations, which we made an attempt to minimize. It is my hope that this article would help others who are new to High Performance Computing (HPC) to appreciate the sort of problem HPC helps to solve, and how good modern code practices can help doing it well.
Many thanks to Intel for sponsoring my access to the Intel Xeon Phi cluster, and Colfax Research for the high quality DeepDive How-to Series, which has inspired me to perform the experiments and document the steps and results.
- Sample Code GitHub Repository.
- Colfax How to Series - session 06 on N-body
- Estimate Theoretical Peak FLOPS for Xeon Phi.
- Youtube Video on Makefile by Paul Programming
- Makefile Suffix Rules
- Stackoverflow - what do Makefile symbols mean
- Makefile Static Pattern Rule Usage
- short-cut method standard deviaiton formula
- Optimize a Numerical Integration Implementation with Parallel Programming and Distributed Computing.
- Hello World MPI Code
Very nice article.. expecting to apply optimization strategies discussed for other problems as well.