Atlas7/nbody.md

## nbody.md

      
    Raw
  

              nbody.md
            
          
    Abstract

Imagine we have n particles in our "universe". These particles have a random initial x, y, and z coordinates to begin with. Defined by Newton's law of universal gravitation, each particle attracts every other particles in this universe using a force that is directly proportional to the product of their masses and inversely proportional to the square of the distance between their centers. As as result, these particles gain (and lose) velocities and change positions over time. The modelling of this physical mechanics is called a N-body simulation.
There currently exists many N-body simulation algorithms. Some are less advanced and highly computational costly (execution time in the order of O(N^2)) - but simple and easy to understand. Some others are more advanced and significantly more efficient (execution in the order of O(n*log(n)) - but not as simple and easy to understand. This articles focuses on the implementation aspect of the less advanced toy algorithm - for the benefit of ease of understanding.
This article consists of two parts:

Part 1: Optimize Single-node implementation given a fixed problem size of 16,384-particles (i.e. 268,419,072 particle-to-particle interactions) to boost the baseline code performance from 0.07% to 62%.
Part 2: Scale Implementation and handle increased problem size of 1,048,576 particles (i.e. 1,099,510,579,200  particle-to-particle interactions) with efficient distributed computing. We do this while keeping in mind the needs to complete each time-step within the millisecond to sub-second timeframe.

We run all our N-body simulations exclusively on the Colfax Cluster nodes running Intel Xeon Phi (Knights Landing / KNL) 7210 host processors - while bearing in mind modern code practice for ease of adaptation to other Intel Architecture (such as Knights Corner / KNL, or Xeon / IVB). We use an established industry rule-of-thumb that each particle-to-particle interaction simulation would require 20 FLOP (Floating Point Operations) to estimate our code performance before versus after tuning. Our goal is to try and get as close as possible to the theoretical peak performance potential, which for a Xeon Phi 7210 host processor is 4505 GFLOPS (Giga Floating-point Operations Per Second). For coding convenience, we assume all particles to have the same mass of 1 unit.
All implementation source codes and performance logs are available in this GitHub Repository.
Part 1 - Single Node Application Optimization

Our initial (un-tuned) baseline implementation, although very human-friendly and easy-to-understand, achieves only 3.2 GFLOPS (0.07% efficiency). In other words, the execution time is likely to be much higher than the expected O(N^2) time frame. We show that with code optimization, it is possible to increase code performance to 2,874 GFLOPS (62.85% efficiency) on a single node. That's ~884x speed-up.
The following table summarises our code optimization journey using a combination of techniques including Multi-threading Parallelism, Scalar Tuning, Vectorization Tuning, and Memory Tuning.


Implementation
Average Performance (GFLOPS)
Speed-up
Efficiency


1 - Baseline - No Optimization
3.2 +- 0.0
1.00x
0.07%


2 - Thread Tuning (Parallelize Level-2 (Inner) For Loop)
12.5 +- 0.0
3.91x
0.31%


3 - Thread Tuning (Parallelize Level-1 (Outer) For Loop)
357.6 +- 9.8
111.75x
7.94%


4 - Scalar Tuning (Avoid Precision Conversion Overhead)
444.5 +- 0.7
138.91x
9.87%


5 - Scalar Tuning (Replace Divide with Multiply)
580.3 +- 1.2
181.34x
12.88%


6 - Scalar Tuning (Replace Power of Half with Square Root)
623.3 +- 1.5
194.78x
13.84 %


7 - Scalar Tuning (Enables Aggressive Optimizations on Floating-point Data with Compiler Option)
654.4 +- 3.1
204.5x
14.53%


8 - Vectorization Tuning (Restructure Data from AoS to SoA)
1889.7 +- 11.6
590.53x
41.95%


9 - Vectorization Tuning (Remove Peel Loop Overhead by Aligning Array Access)
2091.9 +- 13.7
653.72x
46.44%


Memory Access Bottleneck: See Implementation 9 Tests on varying NSIZE
degradation performance with NSIZE increases
(See Section)
(See Section)


10 - Memory Tuning (Loop Tiling with Register Blocking). Performance stays constant regardless of NSIZE increases
2831.4 +- 8.6
884.81x
62.85%


We see that each optimization component has a contribution towards the overall performance enhancement - some may be more significant than the others.
Part 2 - Distributed Computing and Scalability

We evolve our application to implementation 11 that distributes workload across multiple cluster nodes using Intel MPI (Message Passage Interface). We apply cluster communication tuning to reduce inefficiencies resulted from ethernet communication among cluster nodes. At the end we perform a series of (strong and weak) scalability tests to conclude some trends.
Strong Scaling Tests:

Given a small fixed problem size (16,384-particles. i.e. 268,419,072 interactions), the more nodes we distribute workload across, the longer it takes for the job to complete. Overall performance degradates with increased nodes, likely due to the overall distributed computing benefits are outweighted by the more significant per-node performance degradation.
Given a large fixed problem size (1,048,576 particles. i.e. 1,099,510,579,200 interactions), the more nodes we distribute workload across, the shorter it takes for the job to complete. Overall performance increases with increased nodes, likely due to the overall distribued computing benefit far outweighs the less significant per-node performance degradation.

Weak Scaling Tests:

Given a small fixed total problem size per node (16,384 particles per node. i.e. 268,419,072 interactions per node), the rate of per-node performance degradation is significant with respect to increasing nodes. Distributed Computing is less robust.
Given a large fixed total problem size per node (262,144 particles per node. i.e. 68,719,214,592 interactions per node.), rate of per-node performance degradation is less significant with respect to increasing nodes. Distributed Computing is more robust (until certain point).

At the end of Part 2 we perform a bonus "push our limit" test. We show that it is possible to complete one time-step of a N-body simulation for over 1 million (1,048,576) particles, or over 1 trillion (1,099,510,579,200) particle-to-particle interactions in a sub-second (~662 ms) timeframe, by distributing the (semi-) optimized code to run across 16 nodes in parallel. That's equivalent to an overall performance of 33,208 GFLOPS.
Given our toy algorithm execution time is of order O(N^2), imagine replacing this with a more advanced O(nlog(n)) algorithm which may result in even greater performance boost and scalability - which may deserve a separate research article.
Introduction

Given we have n particles in our "universe", each particle would experience a net gravitational attractive force (with magnitude and direction) caused by all other surrounding particles. This net force is described by Newton's law of Universal Gravitation, as described by this formula:

There are many existing algorithms currently. For the purpose of this article, we will use the most basic (and easy to understand) toy algorithm that models this N-body simulation. The algorithm has an execution time of at best in the order of O(N^2) (i.e. order of number of particles squared). It should be noted that there are much more efficient algorithms out there (with execution time in order of O(nlog(n))). This article focuses on the efficient code implementation aspect - should the code is 100% efficient, we should expect our toy algorithm to have an execution time of around O(N^2). An inefficient implementation will have execution time much higher than this expected "at-best" value.
For simplicity, we shall assume all particles to have a mass of 1 unit, with initial positions randomized (via uniform distribution). We will also assume only 16,384 particles in our universe (i.e. 268,419,072 particle-to-particle interactions, or N*(N-1)) to ensure our job completes within reasonable short time-frame (milliseconds to few seconds per time step) for practical reasons.
We will start with our most basic and easy to understand Implementation 1, which as we will see later to be only 0.07% efficient. We will then incrementally tune our code to make it more efficient following the 5 key optimization areas: Threading Parallelism, Scalar Tuning, Vectorization Tuning, Memory Tuning, and Cluster Communication Tuning. After the 10th implementations, we shall see our code performance to improve and reach around 64% efficiency.
On the 11th Implementation, we will apply distributed computing to enable us to perform N-body simulation with over 1 million (1,048,576) particles, or equivalently, over 1 trillion (	1,099,510,579,200) particle-to-particle interactions, at a speed of 33,208.9 GFLOPS - completing one time step in sub-second (~662 ms) timeframe. We will appreciate the power of code tuning, parallel programming, and distributed computing, as well as their limitations (such as per-node performance loss due to ethernet communication used in Cluster and other factors).
The source files and performance logs for the implementations are available to view on GitHub Repository.
All experimentations described in this article have been run exclusively on the Colfax Cluster Nodes, each installed with Intel Xeon Phi (Knights Landing / KNL) 7210 host processor. At the time of writing this article I have access to up to 60 of these nodes. It should be noted that the codes may be adapted easily on other Intel Architecture (such as Xeon and Xeon Phi Knights Corner edition) - this is out-of-scope of this article however. I would recommend to refer to the Colfax How to Series for a more comprehensive introduction and demonstrations on this.
Theoretical Xeon Phi Peak Performance Estimation

For a Xeon Phi 7210 Host Processor (that is used for this article), the theoretical peak performance is estimated to be around 4,505 GFLOPS (Giga Floating Point Operation Per Second), for single floating-point precision (SP). See our other article on how we derive this estimation.
Also make a note on these terms:

GFLOP = Giga Floating-point Operations. We use this to describe throughput.
GFLOPS (same as GFLOP/s), stands for Giga Floating-point Operations Per Second. We use this to describe performance.

Code Optimization Strategy

The initial baseline code may be easy to read and understand - unfortunately at the costly trade-off of inefficient performance / no optimization. This slide from Colfax Research provides a neat overview of the 5 core optimization areas we may look into, to increase performance:

At Node-level:

Scalar Tuning
Vectorization Tuning
Memory Tuning
Threading Tuning

At Cluster-level:

Communication Tuning

We will start from the baseline algorithm (non optimized) code, gradually apply these tunings (stack on top of each other), and eventually arrive much higher performance efficiency and scalability.
GitHub Repository

The sample codes may be found at this GitHub repository.

nbody.cc is our source code.
Makefile enables us to compile and submit job to run on the Colfax Cluster. We define here important parameters including (1) number of particles in our n-body simulation, and (2) number of nodes to run on, etc.

Experimentation Pipeline


Navigate to the implementation directory imp-xx.
(optional) Review and possible update nbody.cc as needed.
(optional) Review and tweak Makefile as needed. e.g. set number of nodes to run on, set number of particles, etc.
Issue make queue-knl to compile and run application on the Colfax Cluster Node (defined by Makefile).
Review output file nbody-knl.oXXXXX at ./log, where XXXXX is the job number.
Review optimization report nbody.oKNL.optrpt for potential optimization opportunities.

A Note On Generic Makefile Syntax

Makefile makes the code compilation process scalable and repeatable. This Youtube Video by Paul Programming is a good starting point for the basics. At a very high level view this is how Makefile usually works:

Compile source codes xxx.cc into an intermediate object file xxx.o (or xxx.oYYY).
Link the intermediate object file xxx.o to the output binary xxx (the eventual executable)

The syntax of a Makefile may look something like this - read the comments in this sample Makefile for understanding:
# Set environmental variable to be used within makefile like this: $(ENVIRONMENTAL_VARIABLE_1)
ENVIRONMENTAL_VARIABLE_1 = 123
ENVIRONMENTAL_VARIABLE_2 = Hello

# -- Compilation and Linking Syntax (space sensitive) --
# Target: Dependencies
#    Compilation / Linking command
# ------------------------------------------------------

# Compile source to object file
# If ``hello.cc`` (source file) is newer than ``hello.o`` (intermediate object file), then compile ``hello.cc`` to ``hello.o``
#   - this example uses icpc as the compiler.
#   - The ``-c`` flag says we are compiling the source file `hello.cc` into intermediate object file `hello.o`,
#     instead of a binary.
#   - The ``-o hello.o`` says, output to ``hello.o``
#   - The ``hello.cc`` is the source file
hello.o : hello.cc
  icpc -c -o hello.o hello.cc  

# Link object file to binary
# If ``hello.o`` (intermediate object file) is newer than ``hello`` (binary executable), then compile ``hello.o`` to ``hello``
#   - this example uses ``icpc`` as the compiler.
#   - The ``-c`` flag says we are compiling the source file ``hello.cc`` into intermediate object file ``hello.o``,
#     instead of a binary.
#   - The ``-o hello.o`` says, output to ``hello.o``
#   - The ``hello.cc`` is the source file
hello: hello.o
 icpc -o hello hello.o
Our Makefile

Our Makefile will look something like this.
There will be some more advanced syntax in our Makefile, such as:

the use of suffix rules. See Suffix Rules
funny $@ and $< symbols. see Stackoverflow
wildcard % symbol. See Static Pattern Rule Usage

Have a go read through the Makefile and see if you understand it. I've added some comments within the Makefile to aid understanding. See also some notes below the Makefile.
Notes:

How did we come up with the starting NSIZE of 16,384? See the next section Use High Bandwidth Mode for more info.
Regarding the -DKNLTILE compiler flag: Though this is not used and won't impact our implementation codes, this flag may be used to inject a preprocessor macro KNLTILE into the nbody.cc code, which we may use within our code for "if this then that" type logic (e.g. if KNL, then tile size is this. If KNC, then tile size is that, etc.). Look for something like #ifdef KNLTILE in the code (though I might have removed these as we didn't use them).

Use High Bandwidth Mode

Consider this:

A Xeon Phi MCDRAM (NUMA Node 1) has up to 16 GiB RAM. (16 * 2^30 bytes of RAM).
A MCDRAM is High Bandwidth (up to 480 GB/s). This is ~5 times higher bandwidth than DDR4RAM (up to 90 GB/s).
If our application can fit into 16 GiB of RAM, we should consider using it.

Now the next question, up to (roughly) how many particles in our n-body simulation the 16 GiB MCDRAM can fit into? Let's do some (rough) math:

As we will see in our baseline implementation shortly, each particle will have 6 float variables (or object properties): 3 correspond to the 3D position (x, y, z), and the other 3 correspond velocity travelling at that direction (vx, vy, vz).
a float variable takes up 4 bytes.
each particle will therefore take up 24 bytes (6 float variables * 4 bytes per float variable = 24 bytes).
We can fit roughly (16 * 2^30 / 24) = 715,827,882 particles in MCDRAM. The actual number may be less due other factors. Still, this is a big number. Considering our processor is around 1.1 GHz (1.1 billion cycles per second), and the N-body algorithm execution time is O(N^2), Should we wish our baseline (serial mode) job to complete within the "second" timeframe, it might be wise to only restrict our number of particles to (maximum of) around sqrt(1e9) = 31,623 particles to start with. But just to be safe, we can halve our initial starting particles - to around 15,811. Interestingly, the original Makefile by the Colfax HPC Session 6 exercise starts with 16,384 - this not too far from our rough math at all! Note: it turns out when we get to Implementation 10, we will see the needs to "dice" the particles into tiles of 16 particles. And it happens that when we divides 16384 particles by 16 (particles per tile), it is a nice round number 1024 (i.e. 2^10). Probably explains the choice (Note to self: requires validation. This so called "rough math" could have been, sadly, purely fluke!)


Part 1:  Single Node Application Optimization - Threading, Scalar Tuning, Vectorization Tuning, and Memory Access Tuning

Implementation 1 - Baseline

Implementation 1 Source Files.
This is the starting point baseline implementation of the N-body simulation.
Algorithm Explanation:

We obtain the start and end time of running the N-body simulation (i.e. MoveParticles(...)) by wrapping the simulation with a pair of omp_get_wtime() - assigned to tStart (start time) and tEnd (end time). The duration (in seconds) may be computed by doing a tEnd - tStart.
HztoInts computes total number of particle interactions within the step. Since we have two nested for loops computing interaction between all particles (including interaction of a particle with itself), there will be a total of nParticles * nParticles interactions. Convention Note: there is a convention of using nParticles * (nParticles - 1) instead, to describe the total number of particle-to-particle interactions, without a particle interacting with itself. This is a pure conventional choice. The original code that Colfax Research presents in the video uses nParticle * (nParticles - 1). So let's stick to that for consistency. Do appreciate the fact that as nParticles become larger, the effect of that "minus 1" will become very insignificant. So let's not be too hung up on that - it's just convention.
HztoGFLOPs computes the total estimated number of Giga Floating Point Operations within the step. We assume each particle-to-particle interaction to require 20 operations (a well-established rule-of-thumb in the nbody simulation community). To convince ourselves the point, if we do a quick count of all the operations (such as add, minus, multiply, division, etc) within the innermost nested loop, we should come to around 20 operations - more or less. There will be nParticle * (nParticles - 1) number of interactions. So roughly we will have 20 * nParticle * (nParticles - 1) floating point operations (FLOP). To make the number smaller, we use GFLOP (instead of FLOP). 1 GFLOP = 1e9 FLOP. So in GFLOP: 20 * nParticle * (nParticles - 1) * 1e-9 GFLOP for all interactions. We assign this to HztoGLOPS.
We use the per-step HztoInts to compute the per-step per-second version with HztoInts / (tEnd - tStart). i.e. number of particle-to-particle interactions per second, in a time step.
We use the per-step HztoGFLOPs to compute the per-step per-second version with HztoGFLOPs / (tEnd - tStart). i.e. number of GFLOP Per Second (i.e. GFLOPS), in a time step. We assign this to the variable rate. This is equivalent to the assumed mean D as per the short-cut method standard deviaiton formula.

Implementation 1 Performance

Implementation 1 Performance Log
The average step performance: 3.2 +- 0.0 GFLOPS (0.07% of theoretical peak 4,505 GFLOPS).
Implementation 1 Remarks

This is very poor.
Can we apply some quick wins, with multi-threading parallelism, to achieve higher efficiency? We will do just that in Implementation 2.
Implementation 2 - Parallelize Level-2 (inner) loop

Implmentation 2 Source Files
We parallelize the inner nested loop with OMP Multi-threading. Specify Fx, Fy, and Fz within the reduction to prevent data race. i.e. Fx, Fy, and Fz will be visible to all threads to add to it, in an atomic manner. We learnt about this reduction technique from our previous article Optimize a Numerical Integration Implementation with Parallel Programming and Distributed Computing.
Note the following one-liner addition just before the inner nested for loop:
#pragma omp parallel for reduction(+: Fx, Fy, Fz)

Implementation 2 Performance

Implementation 2 Performance Log
The average step performance: 12.5 +- 0.0 GFLOPS (0.31% of theoretical peak 4,505 GFLOPS).
Implementation 2 Remarks

This is better than the 0.07% in Implementation, but still very poor.
Observations:

We have ~16,000 particles
We have placed the parallelism at the inner j-loop. (not the outer i-loop).
Inefficiency 1: We perform i = ~16,000 iterations at the outer i-loop. i.e. we have scheduled to start and stop multi-threading operations ~16,000 times. This is significant overhead.
Inefficiency 2: For Single Precision (SP) floating point operation, the VPU (Vector Processing Unit) may concurrently perform 16 computation at onces (we call 16 concurrent computation "1 vectorized computation"). In other words, at each i iteration we perform roughly (16,000  / 16 =) 1000 vectorized computations within the inner j-loop. We distribute 1000 vectorized computations across 256 threads (i.e. each thread performs only roughly 3-4 vectorized computations). And it turns out that 3-4 vectorized computations per thread is too small for good load balancing on Knights Landing processor. (the cost of multi-thread coordination may outweigh benefits gained). Note to self: may need validating this statement.

Implementation 3 - Parallelize Level-1 (outer) loop

Implementation 3 Source Files
To remediate the inefficiencies as noted in the previous Implementation 2, here in this implementation 3, we perform parallelism at the Level-1 (outer) loop level. Note we now have 2 #pragma mp parallel for at the two outer loops. We do not have any data races so no need for reduction.
Implementation 3 Performance

Implementation 3 Performance Log
The average step performance: 357.6 +- 9.8 GFLOPS (7.94% of theoretical peak 4,505 GFLOPS).
Implementation 3 Remarks

A big jump from the previous 0.31% efficiency. Can we do better?
Having a look at the optimization report we see some suspicious remarks (which we will explain below the report):
LOOP BEGIN at nbody.cc(16,3) inlined into nbody.cc(106,5)
   remark #25096: Loop Interchange not done due to: Imperfect Loop Nest (Either at Source or due to other Compiler Transformations)
   remark #25451: Advice: Loop Interchange, if possible, might help loopnest. Suggested Permutation : ( 1 2 ) --> ( 2 1 )
   remark #15542: loop was not vectorized: inner loop was already vectorized

   LOOP BEGIN at nbody.cc(22,5) inlined into nbody.cc(106,5)
      remark #15415: vectorization support: non-unit strided load was generated for the variable <particle[j]>, stride is 6   [ nbody.cc(28,24) ]
      remark #15415: vectorization support: non-unit strided load was generated for the variable <particle[j]>, stride is 6   [ nbody.cc(29,24) ]
      remark #15415: vectorization support: non-unit strided load was generated for the variable <particle[j]>, stride is 6   [ nbody.cc(30,24) ]
      remark #15305: vectorization support: vector length 16
      remark #15309: vectorization support: normalized vectorization overhead 0.344
      remark #15417: vectorization support: number of FP up converts: single precision to double precision 2   [ nbody.cc(32,30) ]
      remark #15418: vectorization support: number of FP down converts: double precision to single precision 1   [ nbody.cc(32,30) ]
      remark #15300: LOOP WAS VECTORIZED
      remark #15452: unmasked strided loads: 3
      remark #15475: --- begin vector cost summary ---
      remark #15476: scalar cost: 170
      remark #15477: vector cost: 20.000
      remark #15478: estimated potential speedup: 7.830
      remark #15486: divides: 3
      remark #15487: type converts: 3
      remark #15488: --- end vector cost summary ---
   LOOP END

Good signs:

vector length 16 (compiler automatically identified vectorization opportunities and went ahead and do vectorization)

Bad signs:

2 floating point (FP) conversion-up from single precision (SD) to double precision (DP). And 1 FP convert-down from DP to SP. This seems unnecessary overhead. Scalar Tuning Required.
divides: 3  (the division operation is slower than multiplication. Can we replace duplicate divisions with a single reciprocal multiplication, and reuse it multiple time instead?). Scalar Tuning Required.
non-unit strided load: the way we structure our data / array of particle object might not be optimized for fast memory lookup access. We can improve this by restructuring our data structure, possibly at the potential tradeoff of reduced code readability for a human / good object oriented programming style. In this article however, performance is priority. Vectorization Tuning Required
The estimated potential speedup: 7.830. If code is vectorization tuned, this number should get close to vector length 16. Vectorization Tuning Required

Idea for next Implementation 4

Now that we have identified some scalar tuning opportunities, let's try focus on these first. We will do just that for our next implementation 4.
Implementation 4 - Scalar Tuning (Avoid Precision Conversion Overhead)

Implementation 4 Source Files
In the optimization report we identified a bad sign: 2 Single-to-Double Precision conversions, and 1 Double-to-Single Precision conversion. This corresponds to the following line of the code:
const float drPower32  = pow(drSquared, 3.0/2.0);
Where the precision conversions come from:

The C++ global pow (power) function by default takes in double precision values and output double precision values. Overloading does not occur for this global function.
drSquared is a float (Single Precision). the compiler automatically converts this to double (Double Precision). This is the first Single-to-Double Precision conversion.
The compiler converts 3.0/2.0 from float to a double (Double Precision) Value (probably by doing something like 3.0/(double)2.0. This is the second Single-to-Double Precision Conversion.
Now that we have two double precision inputs, pow returns a double precision output. But because we have specified drPower32 to be a float (Single Precision), the compiler converts from double to float. This is our first Double-to-Single precision conversion. Check out this stackoverflow relating to this.

To summarize, we have 2 SP-to-DP conversions. 1 DP-to-SP conversion. None of these are necessary. This can be easily prevented with some simple scalar tunings.
The Scalar Tunings:

To enable overloading (point 1 above), we need to use std::pow (instead of the global pow). So if we supply float inputs (single precision), it will produce float (single precision) output.
To ensure both our inputs are float (single precision). Well, we already know that drSquared is a float (single precision). So we don't need to do anything about that. The 3.0/2.0 however, needs to be explicitly defined as float - we can do this with 3.0f/2.0f. Simply add the f at the end to ensure float literal is interpreted as float.

Applying the above scalar tuning the following line of code becomes:
const float drPower32  = std::pow(drSquared, 3.0f/2.0f);
Implementation 4 Performance

Implementation 4 Performance Log
The average step performance: 477.3 +- 1.1 GFLOPS (9.87% of theoretical peak 4,505 GFLOPS).
Implementation 4 Remarks

This is slightly better than the 7.94% in the previous Implementation 3. But can we do better?
If we look at the optimization report again, we shall notice that the precision conversion part has gone:
   LOOP BEGIN at nbody.cc(22,5) inlined into nbody.cc(106,5)
      remark #15415: vectorization support: non-unit strided load was generated for the variable <particle[j]>, stride is 6   [ nbody.cc(28,24) ]
      remark #15415: vectorization support: non-unit strided load was generated for the variable <particle[j]>, stride is 6   [ nbody.cc(29,24) ]
      remark #15415: vectorization support: non-unit strided load was generated for the variable <particle[j]>, stride is 6   [ nbody.cc(30,24) ]
      remark #15305: vectorization support: vector length 16
      remark #15309: vectorization support: normalized vectorization overhead 0.458
      remark #15300: LOOP WAS VECTORIZED
      remark #15452: unmasked strided loads: 3
      remark #15475: --- begin vector cost summary ---
      remark #15476: scalar cost: 167
      remark #15477: vector cost: 15.000
      remark #15478: estimated potential speedup: 10.200
      remark #15486: divides: 3
      remark #15488: --- end vector cost summary ---
   LOOP END

For Implementation 5, let's try tackle another scalar tuning opportunities: let's replace the 3 "divides" with "multiply" (because divides run slower than multiply).
Implementation 5 - Scalar Tuning (Replace 3 Divides with Multiplies)

Implementation 5 Source Files
It turns out that divides are slower operations than multiply. Let's try replacing these with "multiply" instead. According to optimization report here are the opportunities:
Fx += dx / drPower32;  // divide
Fy += dy / drPower32;  // divide
Fz += dz / drPower32;  // divide
If we do the following instead, we will likely realise some efficiency gain:
float oodrPower32 = 1.0f / drPower32   // one over drPower32
Fx += dx * oodrPower32;  // multiply
Fy += dy * oodrPower32;  // multiply
Fz += dz * oodrPower32;  // multiply

Implementation 5 - Remark on Power Function

A note on the power function:
const float drPower32  = std::pow(drSquared, 3.0f/2.0f);
Even though we have a "divide" within the std::pow (Power) function. It turns out (according to the HOW-to series) that the Intel C++ compiler (with MIC-AVX512 instruction) is clever enough to optimize this - by using a combination of optimized power and square root functions. So we need not to worry about this. (That said, we are going to keep our mind open and prove our point by getting rid of that "divide" with the square root function in Implementation 6. As we will see later, it makes hardly much difference.)
Implementation 5 Performance

Implementation 5 Performance Log
Average Performance: 581.5 +- 0.7 GFLOPS (12.88 % of peak 4505 GFLOP/s).
Implementation 5 Remarks

This is a slight improvement from our previous 9.87 % efficiency.
If we check the optimization log again, the 3 divides has gone (as we expect):
LOOP BEGIN at nbody.cc(16,3)
   remark #25096: Loop Interchange not done due to: Imperfect Loop Nest (Either at Source or due to other Compiler Transformations)
   remark #25451: Advice: Loop Interchange, if possible, might help loopnest. Suggested Permutation : ( 1 2 ) --> ( 2 1 )
   remark #15542: loop was not vectorized: inner loop was already vectorized

   LOOP BEGIN at nbody.cc(22,5)
      remark #15415: vectorization support: non-unit strided load was generated for the variable <particle->x[j]>, stride is 6   [ nbody.cc(28,24) ]
      remark #15415: vectorization support: non-unit strided load was generated for the variable <particle->y[j]>, stride is 6   [ nbody.cc(29,24) ]
      remark #15415: vectorization support: non-unit strided load was generated for the variable <particle->z[j]>, stride is 6   [ nbody.cc(30,24) ]
      remark #15305: vectorization support: vector length 16
      remark #15309: vectorization support: normalized vectorization overhead 0.683
      remark #15300: LOOP WAS VECTORIZED
      remark #15452: unmasked strided loads: 3
      remark #15475: --- begin vector cost summary ---
      remark #15476: scalar cost: 99
      remark #15477: vector cost: 10.250
      remark #15478: estimated potential speedup: 8.750
      remark #15488: --- end vector cost summary ---
   LOOP END

   LOOP BEGIN at nbody.cc(22,5)
   <Remainder loop for vectorization>
      remark #15305: vectorization support: vector length 16
      remark #15309: vectorization support: normalized vectorization overhead 0.646
      remark #15301: REMAINDER LOOP WAS VECTORIZED
   LOOP END
LOOP END

Implementation 6 - Scalar Tuning (Replace Power of Half with Square Root)

Implementation 6 Source Files
It turns out that instead of doing "x to the power of 3/2", it's slightly more efficient to do a "1 over square root of x, then do a power of 3" (in particular, this special sqrtf() (square root) function is optimized to run on KNL node with AVX512 instruction set. In other words, we try and push our performance a wee bit further by eliminating that "divide" within the power.
We replace the previous block:
const float drPower32  = std::pow(drSquared, 3.0f/2.0f);
float oodrPower32 = 1.0f / drPower32;   // one over drPower32

Fx += dx * oodrPower32;  // scalar tuning
Fy += dy * oodrPower32;  // scalar tuning
Fz += dz * oodrPower32;  // scalar tuning
with this new block:
const float oodr = 1.0f/sqrtf(drSquared);
const float drPowerN3 = oodr*oodr*oodr;

Fx += dx * drPowerN3;  // scalar tuning
Fy += dy * drPowerN3;  // scalar tuning
Fz += dz * drPowerN3;  // scalar tuning
Implementation 6 Performance

Implementation 6 Performance Log
Average Performance: 623.3 +- 1.5 GFLOPS (13.84% of peak 4505 GFLOP/s).
Implementation 6 Remarks

This is (only just) slightly better than the 12.88 % efficiency we obtained from previous implementation 5. (This is disappointingly insignificant, given the reduced redability of the code. One may give it a second thought to see if this implementation is worth the trade off. Nevertheless, as our priority is performance, we are going to endorse this for now.).
OK. Enough Scalar tuning for now. In Implementation 7 we will introduce a simple tweak to our Makefile (without changing the code) to boost our performance even further.
Implementation 7 - (Scalar Tuning) Enables Aggressive Optimizations on Floating-point Data with Compiler Option

Implementation 7 Source Files
(Focus only on the Makefile)
Say our code is now optimized from scalar tuning viewpoint (let's just assume it is for now - otherwise we will be at it forever), there is one more trick we can implement without changing any codes at all, to push scalar tuning optimization to the limit. The magic is to add an extra flag in our Intel C++ (icpc) code compilation step: -fp-mode fast=2. This option essentially relax the floating point precision a little bit with the aim of maximizing scalar tuning optimization - aggressively. See this Intel doc.
Here we borrow a slide from Colfax Research, on Intel C++ compiler options:

In our Makefile, simply add the -fp-model fast=2 option to our compiler. i.e.
Change this line:
OPTFLAGS = -qopt-report=5 -qopt-report-file=$@.optrpt

to
OPTFLAGS = -qopt-report=5 -qopt-report-file=$@.optrpt -fp-model fast=2

Notes:

ensure to do a make clean to remove all the previously compiled files. (since we have not changed the source code nbody.cc)
Then do a make queue-knl to recompile - note the -fp-model fast=2 flag.

Implementation 7 Performance

Implementation 7 Performance Log
Average Performance: 654.4 +- 3.1 GFLOPS (14.53% of peak 4505 GFLOP/s).
Implementation 7 Remarks

This is slightly better than the 13.84 % we previously have in Implementation 6. Let's move on to our next tuning category - vectorization tuning (which we shall see very shortly, gives significant performance boost!)
Implementation 8 - Vectorization Tuning (Restructure data from AoS to SoA)

Implementation 8 Source Files
Use Structure of Array (SoA), instead of Array of Structure (AoS) to make vectorization more efficient (at the trade off of potentially less elegant Object Oriented Programming (OOP) style. Here, borrowing a slide from Colfax Research:

It turns out the SoA style (e.g. particle.x[i]), that uses unit stride) is more efficient than AoS style (e.g. particle.[i].x, that uses non-unit stride). This implementation replaces our original AoS data structure, to the more efficient SoA.
Implementation 8 - the change

We create a new data struct that resembles the more efficient SoA style:
// (Optimum) Structure of Array - bad OOP. Good performance.
struct ParticleSetType {
  float *x, *y, *z;
  float *vx, *vy, *vz;
};
We replace all the indexing style like this:
Old AoS style:
particle.[i].x
new SoA style:
particle.x[i]

We replace all the (old AoS type) ParticleType with (the new SoA type) ParticleSetType.
Change the MoveParticles() definition input a bit - replace old AoS style with the new SoA style as follows.
Before: Old AoS Style - we have a giant container (our array) of N particles (structures) on the free store. Hence Array of Structure (AoS). The only way to locate this giant array is by a local pointer variable *particle on the stack - that points to that free store giant container.
void MoveParticles(const int nParticles, ParticleType *const particle, const float dt) {//...};
After: New SoA Style (less OOP but more efficient) -  we have an object on the stack (our structure) that contains of only 6 pointers (to our arrays) - *x, *y, *z, *vx, *vy, *vz. Hence Structure of Array (SoA). This structure is super small. We can parse the memory address of this structure to the function. We locate each array like this: particle.x, particle.vx, etc.
void MoveParticles(const int nParticles, ParticleSetType & particle, const float dt) {//...};
Implementation 8 Performance

Implementation 8 Performance Log
Average Performance: 1889.7 +- 11.6 GFLOPS (% of peak 4505 GFLOP/s).
Implementation 8 Remarks

This is a significant boost from the 14.53% from our previous Implementation 7. If we look at the optimization report again, we shall see that the "non unit stride" warning has now gone away. We however observe two sets of new warning messages:

reference particle[j] has unaligned access
PEEL LOOP WAS VECTORIZED

Printing these messages below. We will visit these in the next Implementation 9.
On Unaligned access:
   LOOP BEGIN at nbody.cc(29,5) inlined into nbody.cc(124,5)
      remark #15389: vectorization support: reference particle[j] has unaligned access   [ nbody.cc(35,24) ]
      remark #15389: vectorization support: reference particle[j] has unaligned access   [ nbody.cc(36,24) ]
      remark #15389: vectorization support: reference particle[j] has unaligned access   [ nbody.cc(37,24) ]
      remark #15381: vectorization support: unaligned access used inside loop body
      remark #15305: vectorization support: vector length 16
      remark #15309: vectorization support: normalized vectorization overhead 1.476
      remark #15300: LOOP WAS VECTORIZED
      remark #15442: entire loop may be executed in remainder
      remark #15450: unmasked unaligned unit stride loads: 3
      remark #15475: --- begin vector cost summary ---
      remark #15476: scalar cost: 91
      remark #15477: vector cost: 5.250
      remark #15478: estimated potential speedup: 14.450
      remark #15488: --- end vector cost summary ---
   LOOP END

On Peel Loop:
   LOOP BEGIN at nbody.cc(29,5) inlined into nbody.cc(124,5)
   <Peeled loop for vectorization>
      remark #15389: vectorization support: reference particle[j] has unaligned access   [ nbody.cc(35,24) ]
      remark #15389: vectorization support: reference particle[j] has unaligned access   [ nbody.cc(36,24) ]
      remark #15389: vectorization support: reference particle[j] has unaligned access   [ nbody.cc(37,24) ]
      remark #15381: vectorization support: unaligned access used inside loop body
      remark #15305: vectorization support: vector length 16
      remark #15309: vectorization support: normalized vectorization overhead 1.093
      remark #15301: PEEL LOOP WAS VECTORIZED
   LOOP END

We will address this in our next Implementation 9.
Implementation 9 - Vectorization Tuning (Remove Peel Loop Overhead by Aligning Array Access)

Implementation 9 Source Files
In our previous Implementation 8 we saw two remarks in the optimization report (these are related):
remark #15389: vectorization support: reference particle[j] has unaligned access   [ nbody.cc(35,24) ]
remark #15301: PEEL LOOP WAS VECTORIZED

This implementation 9 will resolve both remarks. In picture it looks like this (explanation to follow).

Say we have a loop that iterate through an array particle.x[j]. On a KNL Processor, it is an architecture requirement the starting memory address of the array (i.e. particle.x[0]) must be multiple of 64 bytes (i.e. start at an aligned memory boundary). Otherwise the code will crash (i.e. start at an unaligned memory boundary). In the case of unaligned case (e.g. say array starts at the 90th byte.) compiler would work around this by creating a brand new loop to make the loop as if it is aligned (to make it as if it starts on the 128th byte). The creation of this "magic" new loop is called a "peel loop". This creates a bit of overhead and may slow the program a bit.

Quick notes:

for Knights Corner (KNC) Processors: use 64 byte alignment (same as Knights Landing / KNL).
for Xeon Processors: use 32 byte alignment


To clean this up, we can use the Intel _mm_malloc when creating a new array instance (to ensure array start memory guaranteed to start at an address that is multiple of 64 bytes), and the associated _mm_free to release memory (to avoid memory leak).

Note to self: I wonder if there are equivalent smart pointer that avoids memory leaks altogether (i.e. ones that won't require us to specify a "free memory" statement yet will ensure the required cabbage management to eliminate the risky memory leak).

before: To implement this change we need to replace lines from this current version (that may cause unaligned access and peel loop vectorization overhead):
// create new instance
particle.x = new float[nParticles];

// release memory
delete particle.x
after: to the new version (that will ensure aligned access and eliminate peel loop overhead):
// create new instance
particle.x = (float*) _mm_malloc(sizeof(float)*nParticles, 64);

// release memory
_mm_free(#pragma vector aligned);
We also need to tell the compiler that we have manually perform the container alignment, by specifying a #pragma vector aligned line just in front of the (potential) peel loop - to ensure the compiler doesn't perform that (magical) peel loop vectorization overhead.
#pragma vector aligned
for (int j = 0; j < nParticles; j++) {
  xxx = particle.x[j];
}
IMPORTANT NOTE: Here we are "promising" the compiler that we have manually (and correctly) aligned our arrays (i.e. first element of the array starts at the multiple of 64 bytes). If we do this promising, but without implementing our _mm_malloc correctly, unexpected things may occur. (Though I have not yet tried. Maybe it will compile and cause run-time crashes. Or maybe it won't compile at all.). For now, just make sure we only specify this pragma if we are sure we have used _mm_malloc correctly.
Implementation 9 Performance

Implementation 9 Performance Log
Average Performance: 2091.9 +- 13.7 GFLOPS (46.44% of peak 4505 GFLOP/s).
Implementation 9 Remarks

This is a boost from the 41.95% in our previous Implementation 8.
Now look at the optimization report. Note that access is now align (instead of unaligned). Peel loop vectorization is no longer performed (less overhead). The estimated potential speedup is now 15.6 (which has got very close to the max of vector length 16).
   LOOP BEGIN at nbody.cc(30,5) inlined into nbody.cc(129,5)
      remark #15388: vectorization support: reference particle[j] has aligned access   [ nbody.cc(36,24) ]
      remark #15388: vectorization support: reference particle[j] has aligned access   [ nbody.cc(37,24) ]
      remark #15388: vectorization support: reference particle[j] has aligned access   [ nbody.cc(38,24) ]
      remark #15305: vectorization support: vector length 16
      remark #15309: vectorization support: normalized vectorization overhead 1.317
      remark #15300: LOOP WAS VECTORIZED
      remark #15448: unmasked aligned unit stride loads: 3
      remark #15475: --- begin vector cost summary ---
      remark #15476: scalar cost: 91
      remark #15477: vector cost: 5.120
      remark #15478: estimated potential speedup: 15.610
      remark #15488: --- end vector cost summary ---
   LOOP END

There are no more vectorization optimization flags. Let's do one more test - what if we increase our problem size? (as we shall see shortly, GFLOPS degradates. This prompts for further optimization later on).
Cache Memory Access Bottleneck: Test Implementation 9 with varying particle sizes

Implementation 9 Performance Log
It turns out our Implementation 9 doesn't make the most of the fast cache memory. We can visualize this by increasing NSIZE and see the drop of performance.
Performance View


multiple
NSIZE
Average Performance (GFLOPS)
Performance Multiple


1
16,384
2091.9 +- 13.7
1.00x


2
32,768
2197.0 +- 110.4
1.05x


4
65,536
1960.6 +- 23.6
0.94x


8
131,072
1729.0 +- 12.9
0.93x


16
262,144
1621.7 +- 0.7
0.78x


32
524,288
1561.1 +- 57.8
0.75x


Given the same application running on a single node, an increased problem size (NSIZE) results in degradation in GFLOPS. This is a problem that requires fixing (in Implementation 10).

Bonus: Duration View

It turns out the Performance Degradation in the Performance view, can also be viewed in a (more or less) equivalent Duration view. i.e. decrease in performance equates to increase in duration.
Here are the results we get after running the job at multiples of NSIZE. We've included an explanation below the table illustrating how we computed the Expected Duration and the ratio.


Multiple (M)
NSIZE
Expected Duration
Actual Duration
Ratio of Expected vs Actual Duration


1
16,384
(same as actual)
~2.567 ms
1.00x


2
32,768
10.268 ms
~9.576 ms
1.07x


4
65,536
41.072 ms
~45.130 ms
0.91x


8
131,072
164.288 ms
~197.800  ms
0.83x


16
262,144
657.152 ms
~848.000 ms
0.77x


32
524,288
2628.608 ms
~3435.0 ms
0.76x


How we compute expected duration:

We know that the Expected Duration is in the order of NSIZE squared. i.e. O(NSIZE^2)
Say we denote NSIZE = 16,384 (our baseline NSIZE) as N0. We run the job to get the Actual Duration at N0.
We then initialize Expected Duration at N0 is the same as the Actual Duration at N0.
We then try out multiples of NSIZE by doubling it. For instance, the multiple M at baseline is 1. When M is 2, our NSIZE becomes 2*N0. When M is 4, our NSIZE becomes 4*N0, etc.
We then compute the subsequent Expected Duration as: M^2 * Duration at N0. For example, if M = 4, and N0 is 2.567 ms, the expected duration = 4^2 * 2.218 ms = 41.702 ms. (as per table below). Likewise for other M.
We finally compute the ratio of Expected Duration to Actual Duration. If the performance is constant with respect (w.r.t) NSIZE, this ratio should stay at 1x.
Reality is that performance decreases as we increases NSIZE (as we see in the table). The Actual Duration therefore turns out to be greater than Expected Duration. The Ratio of Expected w.r.t Actual Duration therefore decreases from 1x. The interesting part: this ratio turns out to be fairly close to the ratio we computed in the Performance view. (And due to this, we can more or less conclude our Duration View is "equivalent" as our Performance View.)
One more note: the values in table is pretty rough. In our next implementation we will enhance the code a bit to compute average duration (as well as the existing average performance in GFLOPS). This will make reporting easier.


Memory Access Bottleneck - Remarks

It turns out that the reason of decreased performance (with respect to increased NSIZE) is due to the current code implementation not being able to use cache memory for faster memory access. A solution to this is to use a technique called Loop Tiling, which we will discuss and implement in our next Implementation 10.
Implementation 10 - Memory Tuning (Loop Tiling with Register Blocking to take advantage of L1 and L2 Cache)

Implementation 10 Source Files
We learn from our Implementation 9 tests that the code performance decreases with increased NSIZE. It turns out this is due to the current implementation does not take advantage of L1 and L2 Caches effectively.
To understand why this is so let's recall our Xeon Phi architecture:


L1 Cache is 32 KiB per core. That's 32 * 1024 = 32,768 bytes per core.
L2 Cache is 512 KiB per core (1 MiB per tile = 1024 KiB per tile. Each tile has 2 cores. So 1024 KiB per tiles / 2 cores per tiles = 512 KiB per core). That's 512 * 1024 = 524,288 bytes per core

When NSIZE is very small, data within the inner j-loop may "fit" into L1/L2 cache for high/medium Performance. When NSIZE becomes too high, data may not fit in cache and "spill out" into non-cache RAM, which is much slower. This causes degradation in performance.

Note to self: below is my own illustration to convince myself how cache work. I yet to validate these example. My knowledge of cache is limited at this stage.
For example:

when NSIZE is very small, say 1000 particles: the inner j-loop (where we compute our x, y, z positions for each particle), we have 1000 * 3 float parameters * 4 bytes per float parameters =  12,000 bytes. This can easily fit in L1 Cache for high performance. Our cache hit rate will be high.
when NSIZE is medium, say 16384 particles (our M = 1 case): the inner j-loop (where we compute our x, y, z positions for each particle), we have 16384 * 3 float parameters * 4 bytes per float parameters =  196,608 bytes. This won't fit in L1 Cache, but may fit in the L2 Cache for medium performance. Our cache hit rate will be high.
when NSIZE is 262,144 particles (i.e. M = 16 case): the inner j-loop (where we compute our x, y, z positions for each particle), we have 262144 * 3 float parameters * 4 bytes per float parameters =  3,145,728 bytes. This won't fit in neight L1/L2 Cache. Our cache hit rate will be pretty much zero.


It turns out that in practice, we can improve our cache performance with a technique called Loop Tiling.

Instead of explaining how loop-tiling works here (we will cover loop-tiling in a separate article for fuller understanding), for now, let's just apply the following loop tiling pattern. The "after" syntax produces the same result as the "before" syntax. It just does it in a more machine efficient way (making the most of fast cache), despite in a potentially less human readable way (unfortunately). The before vs after syntax looks like this:

Implementation 10 Performance

Implementation 10 Performance Log
Average Performance: 2831.4 +- 8.6 (62.85% of peak 4505 GFLOP/s).
Notice the average performance is now more robust to increases in NSIZE, at an average of around 2874 +- 24.7 GFLOPS/s. Also note the significant performance boost from our previous 46.44% performance efficiency in Implementation 9.


multiple
NSIZE
Average Performance (GFLOPS)
Performance Multiple


1
16,384
2831.4 +- 8.6
1.00x


2
32,768
2907.3 +- 6.0
1.03x


4
65,536
2911.9 +- 12.9
1.03x


8
131,072
2907.9 +- 10.2
1.03x


16
262,144
2892.5 +- 2.5
1.02x


32
524,288
2793.5 +- 107.8
0.99x


Implementation 10 Output - Duration View

This section is purely for added intuition. This Duration View is the same as the Performance View.

Notice the actual duration is now more predictable - it's around 1x of the expected duration.
Expected Duration = M^2 * Duration(M=1) = M^2 * 1.896 ms
Again, the ratio "Expected w.r.t Actual Duration" exactly matches to the ratio we get in the Performance view.


Multiple (M)
NSIZE
Expected Duration
Actual Duration (with tiling)
Expected to Actual Duration Ratio


1
16,384
1.896 ms
1.896 +- 0.006 ms
1.00x


2
32,768
7.584 ms
7.386 +- 0.015 ms
1.03x


4
65,536
30.336 ms
29.499 +- 0.131 ms
1.03x


8
131,072
121.344 ms
118.162 +- 0.418 ms
1.03x


16
262,144
485.376 ms
475.148 +- 0.405 ms
1.02x


32
524,288
1941.504 ms
1971.192 +- 82.462 ms
0.98x


Implementation 10 Remarks

Implementation 10 result suggests average performance stays (more or less) constant against increases in NSIZE. Duration is also more predictable. It was also a significant performance boost from our previous Implementation 9.
It is worth reminding ourselves this toy N-body algorithm has execution time in the order of O(N^2). There are alternative nbody algorithms that runs in the order of O(n log n). For the purpose of this article, which is tuning the implementation given a chosen (inefficient) algorithm, what we have done so far is fit for purpose. In fact, now that we have the knowledge on implementation performance tuning, we can probably take a more efficient algorithm - one that runs in O(n log n), and apply similar tuning procedure and obtain increased performance (but this is out of scope of this article).
In the next Implementation 11, we will apply distributed computing to handle scalability (i.e. increased problem size), and perform some scalability tests (weak scaling, strong scaling, and sustain duration test).

Part 2: Distributed Computing and Scalability - Cluster Communication Tuning

Implementation 11 - Cluster Communication Tuning (for Scalability)

Implementation 11 Source Files
In implementation 10, we achieved an average performance of 2831.4 +- 8.6 GFLOPS (62.85% of Theoretical Peak performance of 4505 GFLOPS) running parallel code on 1 node. To handle scalability however, we may distribute workload to run in parallel across multiple machines. This technique is called Distributed Computing. This is pretty much the same as saying: let's further boost our performance way beyond the theoretical single node peak performance (of 4505 GFLOPS in our case).
We now modify our code to enable distributed computing with MPI (Message Passage Interface). Recall our Hello World MPI Code - our implementation will be somewhat similar. We will also make use of MPI_Allgather to make the communication of cluster nodes more efficient (or put it another way, less inefficient). Borrowing some slides from Colfax Research on MPI_Allgather and code implementation guide:


Strong and Weak Scaling Studies - on Implementation 11

We will perform both strong and weak scaling studies and learn how well our application scale. Some definitions from Wikipedia on Strong and Weak Scaling

Strong scaling, which is defined as how the solution time varies with the number of processors for a fixed total problem size. In this study, we still see how our job duration and performance vary, given a fixed total problem size.
Weak scaling, which is defined as how the solution time varies with the number of processors for a fixed problem size per processor. In our case, instead of per processor we will use "per node". It's equivalent because each node contains 1 Xeon Phi Processor - even though each processor may have many cores and threads.

We will go ahead and perform 3 tests

Strong Scaling Study
Weak Scaling Study
(Bonus) Sustain Duration with Extra Nodes

Notes:

Note that Speed-up and Performance Multiple is equivalent in concept. Values should be (more or less) the same.
Speed-up = Duration (when Nodes=1) / Duration (when Nodes=Nodes)
Performance Multiple = Duration (when Nodes=Nodes) / Duration (when Nodes=1)

Strong Scaling Study - Small n-body case (Less Robust)

Given a fixed total problem size of (16,384 particles), how does the duration and overall performance vary with the number of processors?
Implementation 11 Job Log - Strong Scaling Study - Small n-body case
(Note: may need to scroll right to see more columns in table)


Nodes
Particles
Duration (ms)
Performance (GFLOPS)
Duration Multiple
Performance Multiple
Performance Multiple (per node)


1
16,384
1.797 +- 0.009
2987.1 +- 15.2
1.00x
1.00x
1.000x


2
16,384
2.723 +- 0.100
1974.0 +- 70.8
1.52x
0.66x
0.330x


4
16,384
4.038 +- 0.168
1331.5 +- 53.1
2.25x
0.45x
0.111x


8
16,384
7.012 +- 0.111
765.8 +- 12.0
3.90x
0.26x
0.032x


16
16,384
6.050 +- 1.181
919.4 +- 165.5
3.37x
0.31x
0.019x


32
16,384
13.088 +- 0.810
411.7 +- 24.7
7.28x
0.14x
0.004 x


Observation: Given a small fixed problem size (16,384-particles. i.e. 268,419,072 interactions), the more nodes we distribute workload across, the longer it takes for the job to complete. Overall performance degradates with increased nodes, likely due to the overall distributed computing benefits are outweighed by the more significant per-node performance degradation.
Strong Scaling Study - Big n-body case (More Robust)

Given a fixed total problem size of (1,048,576 particles. i.e. 1,099,510,579,200 interactions), how does the duration and overall performance vary with the number of processors? (Note: this is 64 times bigger than our small n-body case)
Implementation 11 Job Log - Strong Scaling Study - Big n-body case
(Note: may need to scroll right to see more columns in table)


Nodes
Particles
Duration (ms)
Performance (GFLOPS)
Duration Multiple
Performance Multiple
Performance Multiple (per node)


1
1,048,576
7249.285 +- 78.787
3033.8 +- 32.6
1.00x
1.00x
1.000x


2
1,048,576
3686.337 +- 69.256
5967.4 +- 108.0
0.51x
1.97x
0.983x


4
1,048,576
2037.912 +- 65.578
10801.6 +- 342.2
0.28x
3.56x
0.890x


8
1,048,576
1028.011 +- 35.690
21415.4 +- 702.2
0.14x
7.06x
0.882x


16
1,048,576
652.174 +- 10.386
33726.9 +- 538.1
0.09x
11.12x
0.695x


32
1,048,576
388.851 +- 11.889
56604.7 +- 1731.5
0.05x
18.66x
0.583x


Observation: Given a large fixed problem size (1,048,576 particles. i.e. 1,099,510,579,200 interactions), the more nodes we distribute workload across, the shorter it takes for the job to complete. Overall performance increases with increased nodes, likely due to the overall distributed computing benefit far outweighs the less significant per-node performance degradation.
Weak Scaling Study - Small n-body case (Less Robust)

Given a fixed total problem size per node (16,384 particles per node. i.e. 268,419,072 interactions per node), how does the duration and overall performance vary with the number of node?
Implementation 11 Job Log - Weak Scaling Study - Small n-body case
(Note: may need to scroll right to see more columns in table)


Nodes
Particles
Duration (ms)
Performance (GFLOPS)
Duration Multiple
Performance Multiple
Performance Multiple (per node)


1
16,384
1.808 +- 0.066
2972.5 +- 102.6
1.00x
1.00x
1.000x


2
32,768
6.941 +- 0.300
3099.6 +- 132.8
3.84x
1.04x
0.521x


4
65,536
17.775 +- 1.006
4847.9 +- 271.6
9.83x
1.631x
0.408x


8
131,072
31.867 +- 0.676
10787.0 +- 231.5
17.63x
3.63x
0.454x


16
262,144
69.882 +- 1.853
19681.0 +- 519.8
38.65x
6.62x
0..41x


32
524,288
151.732 +- 9.016
36362.6 +- 2199.2
83.92x
12.23x
0.382x


Observation: Given a small fixed total problem size per node (16,384 particles per node. i.e. 268,419,072 interactions per node), the rate of per-node performance degradation is significant with respect to increasing nodes. Distributed Computing is less robust.
Weak Scaling Study - Big n-body case (More Robust)

Given a fixed total problem size per node (262,144 particles per node. i.e. 268,419,072 interactions per node.), how does the duration and overall performance vary with the number of node?
Implementation 11 Job Log - Weak Scaling Study - Big n-body case
(Note: may need to scroll right to see more columns in table)


Nodes
Particles
Duration (ms)
Performance (GFLOPS)
Duration Multiple
Performance Multiple
Performance Multiple (per node)


1
262,144
441.428 +- 1.394
3113.5 +- 9.8
1.00x
1.00x
1.000x


2
524,288
949.323 +- 31.113
5797.0 +- 183.4
2.15x
1.86x
0.930x


4
1,048,576
2023.533 +- 82.803
10884.8 +- 430.2
4.58x
3.50x
0.875x


8
2,097,152
3933.780 +- 72.012
22367.9 +- 409.8
8.91x
7.18x
0.898x


16
4,194,304
8839.475 +- 702.602
40041.8 +- 3001.9
20.02x
12.86x
0.804x


32
8,388,608
27900.579 +- 536.408
50461.1 +- 968.9
63.21x
16.21x
0.506x


Observation: Given a large fixed total problem size per node (262,144 particles per node. i.e. 68,719,214,592 interactions per node.), rate of per-node performance degradation is less significant with respect to increasing nodes. Distributed Computing is more robust (until certain point).
Bonus: Sustain Duration with Extra Nodes (Robust up to 16 Nodes)

Let's be curious and do a mini research experiment:

We identified previously running the N-body simulation with 262,144 particles takes roughly about 0.5 second per step (average). The current code of 10 step simulation would likely take roughly about 0.5 x 10 = 5 seconds to run.
Since execution time is of O(N^2). If we double (x2) the problem size to 524,288 particles, it should take 2^2 = 4 times longer to run. i.e. roughly about 2 seconds per step, or 20 seconds to run.
We are going to form the hypothesis that, if we distribute across job to 4 nodes, we should be able to complete the job run in the same timeframe (5 seconds). Or in other words, to increase performance by 4 times.
Interactions = NSIZE * (NSIZE-1)

Let's do this test.
Implementation 11 Job Log - Sustain Duration


Multiple (M)
Particles (NSIZE)
Interactions
Nodes
Duration (ms)
Performance (GFLOPS)
Time Multiple
Performance Multiple
Performance Multiple (Per Node)


1
262,144
68,719,214,592
1
482.901 +- 61.725
2887.0 +- 321.0
1.00x
1.00x
1.00x


2
524,288
274,877,382,656
4
533.371 +- 9.144
10310.2 +- 177.2
1.10x
3.57x
0.893x


~3
786,384
618,399,009,072
9
576.369 +- 31.414
21518.5 +- 1103.7
1.19x
7.45x
0.828x


4
1,048,576
1,099,510,579,200
16
662.437 +- 13.248
33208.9 +- 647.6
1.37x
11.50x
0.719x


~5
1,310,400
1,717,146,849,600
25
2262
15187
4.68x
5.26x
0.210x


~6
1,572,480
2,472,691,777,920
36
2728
18218
5.65x
6.31x
0.175x


~7
1,834,560
3,365,608,559,040
49
4480
15036
9.27x
5.21x
0.106x


Notes:

Let NSIZE at M=1 be N0, where N0 is 262144 particles.
NSIZE (particles) at each M (Multiple) is M*N0 (Multiple of baseline 262144 particles)
Particle-to-particle Interactions: NSIZE * (NSIZE-1)
Nodes = number of nodes to distribute the job across (to sustain similar timeframe)
Duration Multiple = multiple of Duration at M = 1. e.g. For M = 2 case, Duration Multiple = 533.371 ms / 482.901 ms = 1.10.
Performance Multiple = multiple of Performance at M = 1. e.g. For M = 2 case, Performance Multiple = (10310.2 GFLOPs / 2887.0 GFLOPs) = 3.57.
Performance Multiple (per Node) = multiple of Performance at M = 1, divided by Nodes. e.g. For M = 2 case, Performance Multiple (per Node) = (10310.2 GFLOPs / 2887.0 GFLOPs) / 4 nodes = 0.893.

Observations:

We see that, it is possible to process 1 million particles (M=4 case: ~1 trillion interactions), at roughly the same time as processing 262,144 particles (M=1 case: ~69 billion interactions) - as short as ~0.5 second (500 ms).
We see a gradual decline in per-node performance, as we increase our number of nodes. This is likely due to communication delay of Ethernet cable solution. This loss become significant when we have 25 or more nodes - the duration is no longer sub-second (but instead, in the region of 2-5 seconds).


According to this Colfax Research slide, it appears that the Intel Omni-Path Fabric 100 (low latency) communication solution may be more robust (in comparison to 1 Gigabit Ethernet solution - which is what the experiments in this article are based on):


Implementation 11 Remarks

If we look at the optimization report again we shall see some additional potential optimization opportunities.
   LOOP BEGIN at nbody.cc(74,20)
      remark #15389: vectorization support: reference particle->vx[ii+$i2] has unaligned access
      remark #15389: vectorization support: reference particle->vx[?+$i2] has unaligned access
      remark #15388: vectorization support: reference Fx[$i2] has aligned access
      remark #15381: vectorization support: unaligned access used inside loop body
      remark #15305: vectorization support: vector length 16
      remark #15427: loop was completely unrolled
      remark #15309: vectorization support: normalized vectorization overhead 0.700
      remark #15300: LOOP WAS VECTORIZED
      remark #15448: unmasked aligned unit stride loads: 1 
      remark #15450: unmasked unaligned unit stride loads: 1 
      remark #15451: unmasked unaligned unit stride stores: 1 
      remark #15475: --- begin vector cost summary ---
      remark #15476: scalar cost: 8 
      remark #15477: vector cost: 0.620 
      remark #15478: estimated potential speedup: 7.520 
      remark #15488: --- end vector cost summary ---
   LOOP END

(similar for vy and vz)
Probably something worth looking into in a potential future Implementation 12. For the purpose of this article, let's just stop here and celebrate.
Conclusion

In Part 1 we started from a baseline N-body simulation code running on a single Xeon Phi enabled cluster node. We applied threading parallelism, scalar tuning, vectorization tuning, and memory tuning to take performance up from the original 0.07% up to 62.85%. In Part 2 we applied cluster communication tuning and used distributed computing to complete performing over 1 trillion particle-to-particle interactions within a sub-second time step, by distributing workload across 16 cluster nodes (semi-) efficiently. We appreciated the strength (scalability and parallelism) and limitation (communication and other losses) of our implementations, which we made an attempt to minimize. It is my hope that this article would help others who are new to High Performance Computing (HPC) to appreciate the sort of problem HPC helps to solve, and how good modern code practices can help doing it well.
Acknowledgement

Many thanks to Intel for sponsoring my access to the Intel Xeon Phi cluster, and Colfax Research for the high quality DeepDive How-to Series, which has inspired me to perform the experiments and document the steps and results.
References


Sample Code GitHub Repository.
Colfax How to Series - session 06 on N-body
Estimate Theoretical Peak FLOPS for Xeon Phi.
Youtube Video on Makefile by Paul Programming
Makefile Suffix Rules
Stackoverflow - what do Makefile symbols mean
Makefile Static Pattern Rule Usage
short-cut method standard deviaiton formula
Optimize a Numerical Integration Implementation with Parallel Programming and Distributed Computing.
Hello World MPI Code


Intel Colfax Cluster - Notes - Index Page
Implementation	Average Performance (GFLOPS)	Speed-up	Efficiency
1 - Baseline - No Optimization	3.2 +- 0.0	1.00x	0.07%
2 - Thread Tuning (Parallelize Level-2 (Inner) For Loop)	12.5 +- 0.0	3.91x	0.31%
3 - Thread Tuning (Parallelize Level-1 (Outer) For Loop)	357.6 +- 9.8	111.75x	7.94%
4 - Scalar Tuning (Avoid Precision Conversion Overhead)	444.5 +- 0.7	138.91x	9.87%
5 - Scalar Tuning (Replace Divide with Multiply)	580.3 +- 1.2	181.34x	12.88%
6 - Scalar Tuning (Replace Power of Half with Square Root)	623.3 +- 1.5	194.78x	13.84 %
7 - Scalar Tuning (Enables Aggressive Optimizations on Floating-point Data with Compiler Option)	654.4 +- 3.1	204.5x	14.53%
8 - Vectorization Tuning (Restructure Data from AoS to SoA)	1889.7 +- 11.6	590.53x	41.95%
9 - Vectorization Tuning (Remove Peel Loop Overhead by Aligning Array Access)	2091.9 +- 13.7	653.72x	46.44%
Memory Access Bottleneck: See Implementation 9 Tests on varying NSIZE	degradation performance with NSIZE increases	(See Section)	(See Section)
10 - Memory Tuning (Loop Tiling with Register Blocking). Performance stays constant regardless of NSIZE increases	2831.4 +- 8.6	884.81x	62.85%
multiple	NSIZE	Average Performance (GFLOPS)	Performance Multiple
1	16,384	2091.9 +- 13.7	1.00x
2	32,768	2197.0 +- 110.4	1.05x
4	65,536	1960.6 +- 23.6	0.94x
8	131,072	1729.0 +- 12.9	0.93x
16	262,144	1621.7 +- 0.7	0.78x
32	524,288	1561.1 +- 57.8	0.75x
Multiple (M)	NSIZE	Expected Duration	Actual Duration	Ratio of Expected vs Actual Duration
1	16,384	(same as actual)	~2.567 ms	1.00x
2	32,768	10.268 ms	~9.576 ms	1.07x
4	65,536	41.072 ms	~45.130 ms	0.91x
8	131,072	164.288 ms	~197.800 ms	0.83x
16	262,144	657.152 ms	~848.000 ms	0.77x
32	524,288	2628.608 ms	~3435.0 ms	0.76x
Multiple (M)	NSIZE	Expected Duration	Actual Duration (with tiling)	Expected to Actual Duration Ratio
1	16,384	1.896 ms	1.896 +- 0.006 ms	1.00x
2	32,768	7.584 ms	7.386 +- 0.015 ms	1.03x
4	65,536	30.336 ms	29.499 +- 0.131 ms	1.03x
8	131,072	121.344 ms	118.162 +- 0.418 ms	1.03x
16	262,144	485.376 ms	475.148 +- 0.405 ms	1.02x
32	524,288	1941.504 ms	1971.192 +- 82.462 ms	0.98x
Nodes	Particles	Duration (ms)	Performance (GFLOPS)	Duration Multiple	Performance Multiple	Performance Multiple (per node)
1	16,384	1.797 +- 0.009	2987.1 +- 15.2	1.00x	1.00x	1.000x
2	16,384	2.723 +- 0.100	1974.0 +- 70.8	1.52x	0.66x	0.330x
4	16,384	4.038 +- 0.168	1331.5 +- 53.1	2.25x	0.45x	0.111x
8	16,384	7.012 +- 0.111	765.8 +- 12.0	3.90x	0.26x	0.032x
16	16,384	6.050 +- 1.181	919.4 +- 165.5	3.37x	0.31x	0.019x
32	16,384	13.088 +- 0.810	411.7 +- 24.7	7.28x	0.14x	0.004 x
Nodes	Particles	Duration (ms)	Performance (GFLOPS)	Duration Multiple	Performance Multiple	Performance Multiple (per node)
1	1,048,576	7249.285 +- 78.787	3033.8 +- 32.6	1.00x	1.00x	1.000x
2	1,048,576	3686.337 +- 69.256	5967.4 +- 108.0	0.51x	1.97x	0.983x
4	1,048,576	2037.912 +- 65.578	10801.6 +- 342.2	0.28x	3.56x	0.890x
8	1,048,576	1028.011 +- 35.690	21415.4 +- 702.2	0.14x	7.06x	0.882x
16	1,048,576	652.174 +- 10.386	33726.9 +- 538.1	0.09x	11.12x	0.695x
32	1,048,576	388.851 +- 11.889	56604.7 +- 1731.5	0.05x	18.66x	0.583x
Nodes	Particles	Duration (ms)	Performance (GFLOPS)	Duration Multiple	Performance Multiple	Performance Multiple (per node)
1	16,384	1.808 +- 0.066	2972.5 +- 102.6	1.00x	1.00x	1.000x
2	32,768	6.941 +- 0.300	3099.6 +- 132.8	3.84x	1.04x	0.521x
4	65,536	17.775 +- 1.006	4847.9 +- 271.6	9.83x	1.631x	0.408x
8	131,072	31.867 +- 0.676	10787.0 +- 231.5	17.63x	3.63x	0.454x
16	262,144	69.882 +- 1.853	19681.0 +- 519.8	38.65x	6.62x	0..41x
32	524,288	151.732 +- 9.016	36362.6 +- 2199.2	83.92x	12.23x	0.382x
Nodes	Particles	Duration (ms)	Performance (GFLOPS)	Duration Multiple	Performance Multiple	Performance Multiple (per node)
1	262,144	441.428 +- 1.394	3113.5 +- 9.8	1.00x	1.00x	1.000x
2	524,288	949.323 +- 31.113	5797.0 +- 183.4	2.15x	1.86x	0.930x
4	1,048,576	2023.533 +- 82.803	10884.8 +- 430.2	4.58x	3.50x	0.875x
8	2,097,152	3933.780 +- 72.012	22367.9 +- 409.8	8.91x	7.18x	0.898x
16	4,194,304	8839.475 +- 702.602	40041.8 +- 3001.9	20.02x	12.86x	0.804x
32	8,388,608	27900.579 +- 536.408	50461.1 +- 968.9	63.21x	16.21x	0.506x
Multiple (M)	Particles (NSIZE)	Interactions	Nodes	Duration (ms)	Performance (GFLOPS)	Time Multiple	Performance Multiple	Performance Multiple (Per Node)
1	262,144	68,719,214,592	1	482.901 +- 61.725	2887.0 +- 321.0	1.00x	1.00x	1.00x
2	524,288	274,877,382,656	4	533.371 +- 9.144	10310.2 +- 177.2	1.10x	3.57x	0.893x
~3	786,384	618,399,009,072	9	576.369 +- 31.414	21518.5 +- 1103.7	1.19x	7.45x	0.828x
4	1,048,576	1,099,510,579,200	16	662.437 +- 13.248	33208.9 +- 647.6	1.37x	11.50x	0.719x
~5	1,310,400	1,717,146,849,600	25	2262	15187	4.68x	5.26x	0.210x
~6	1,572,480	2,472,691,777,920	36	2728	18218	5.65x	6.31x	0.175x
~7	1,834,560	3,365,608,559,040	49	4480	15036	9.27x	5.21x	0.106x