Instantly share code, notes, and snippets.

# Atlas7/map-reduce-add.mdSecret Last active Aug 11, 2017

Intel Colfax Cluster - Perform 18 billion billion operations on Xeon Phi (Knights Landing) Cluster Node in Sub-millisecond

Intel Colfax Cluster - Notes - Index Page

# Disclaimers / Edits

Before proceeding reading this article, please read the following edits. It is very likely the statement "run 18 billion billion operations in sub millisecon' is an unrealistic one, due to a strong possibility the compiler might have computed an anaytical solution instead of performing all the operations as expected in a loop.

EDIT 1: 11th Aug 2017: apparently performing 18 billion billion iterations in sub millisecond might not have been a realistic outcome. It is possible that the compiler might have cleverly skipped the entire for loop entirely and return an analytical answer (i.e. a constant) for this particular example. It is highly advisable to perform another test but with a maybe more non linear operation (instead of the linear "add 1"). That said, if this hypothesis is true it is quite nice (or the opposite) to know that a compiler is clever enough to dodge unneccessary for loop and return the same (analytical) answer. Validation of this hypothesis might be required.

EDIT 2: 11th Aug 2017: according a back of envelope calculation on time estimation... Each Xeon Phi thread runs at 1.213 GHz. To run 18 billion billion operations, it would theoratically take ~190 years on a single thread, or 2 years on ~256 threads. So for Config 2 and 3 when we suggested job finished within sub milli-second, it is very likely that the compiler has injected some "magic" and return an analytical solution (i.e. a constant). Below is a Python 3 code that computes this theoratical time estimation:

```# Estimated time to run 18 billion billion operations on a single thread
Node_GHz = 1.213   # each thread is 1.213 GHz
ops_per_second = Node_GHz * 1e9
total_ops = 18e18      # 18 billion billion ops
total_seconds_needed = total_ops / ops_per_second
total_minutes_needed = total_seconds_needed / 60
total_hours_needed = total_minutes_needed / 60
total_days_needed = total_hours_needed / 24
total_years_needed = total_days_needed / 365
print("1 Single Serial Thread: would take {0} years to perform 18 billion billion operations".format(total_years_needed))   # ~190 years
print("256 Parallel Threads: would take {0} years to perform 18 billion billion operations".format(total_years_needed/256))    # ~2 years```

Output:

``````1 Single Serial Thread: would take 470.5492627434151 years to perform 18 billion billion operations
256 Parallel Threads: would take 1.8380830575914653 years to perform 18 billion billion operations
``````

# Abstract

As an experiment, we perform a simple "add 1" operations (but) 18 billion billon times (18,000,000,000,000,000,000) on a Xeon Phi (Knights Landing) Cluster Node. Apparently the total number of grains of sand on Earth is rough about 7.5 billion billion (according to this NPR article) - so imagine our experiment as trying to "count" the number of sands on Earth twice!

We do this in 3 configurations as follows and observed different magnitude of elapsed time:

1. Configuration 1 (Serial Mode - No OpenMP - 1 Thread), elapsed time: "forever" / 13 hours+++ (and still running)
2. Configuration 2 (Parallel Reduction Mode - OpenMP - 256 Threads), elapsed time: 5.160000 ms (Note: with `printf` turned off this reduces to 0.640000 ms)
3. Configuration 3 (Parallel Reduction Mode - OpenMP - 1 Thread), elapsed time: 0.050000 ms (Note: with `printf` turned off this reduces to 0.030000 ms)

We observe in the absence of OpenMP and code optimisation, the program takes "forever" to run. With OpenMP, the job completes within milliseconds and even sub millisecond. A significant speed up. Note also with OpenMP, it is possible that a 1 thread application "beats" 256 threads one, probably due to reduced overhead in coordinating threads (?).

This post summarises the codes and results in detail.

# Config 1: Serial Mode - No OpenMP - loop 18 billion billion times

Let's start with a very simple for loop like code. To perform an operation 18 billion billion times. (For reference, for a C++ `unsigned long long int`, the max value is 18,446,744,073,709,551,615 (see references section). I picked 18 billion billion to ensure our "add 1" operations do not make our total exceed this upper bound. It is an arbitary choice with the goal of visualising a "BIG" loop.

Code `hello-serial-add.cc`:

```#include <cstdio>
#include <time.h>
#include <locale.h>

void main() {

// enable thoudsand separateor formating in prinf
setlocale(LC_ALL, "");

// start timer
const clock_t begin_time = clock();

// Set total number of increments to add up
unsigned long long int n = 18000000000000000000;

// initialize total
unsigned long long int total = 0;

for (unsigned long long int i = 0; i < n; i++) {
total += 1;
}

printf("----------\n");
printf("Final Total: %'llu (must equal to %'llu)\n", total, n);
printf("Overall elapsed time: %f ms\n", float( clock () - begin_time ) / (CLOCKS_PER_SEC));

}```

Compile to `hello-serial-add` with the Intel C++ compiler `icpc`:

``````icpc -o hello-serial-add hello-serial-add.cc
``````

Shell Script `hello-serial-add.sh`:

```echo "hello-serial-add starts"
cd /home/u4443/deepdive/lec-04

Submit job to KNL node:

``````u4443@c001 lec-04]\$ qsub hello-serial-add.sh -l nodes=1:knl
21171
``````

Job runs for more than 10 minutes and still running - had to kill it.

``````[u4443@c001 lec-04]\$ qstat
Job ID                    Name             User            Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
21171.c001                 ...serial-add.sh u4443           00:10:05 R batch
``````

Note to self: I have just resubmitted job 21178 - to see if the above job will actually finish within the 24 hour timeframe. (When I checked next day the job has already run for 13 hours+ and still running. So I have decided to kill it to free up resource on the cluster node. Point proven though - this code takes "forever" to run).

``````[u4443@c001 ~]\$ qstat
Job ID                    Name             User            Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
21178.c001                 ...serial-add.sh u4443           13:03:31 R batch
``````

One more try: what about same code, but compile with the `-qopenmp` option? would that help? (and create a shell script accordingly to point to this newly compiled code):

``````icpc -qopenmp -o hello-serial-add-openmp hello-serial-add.cc
``````

~90 minutes later and it is still running... (so again, point proven, I am killing it).

``````[u4443@c001 lec-04]\$ qstat
Job ID                    Name             User            Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
21198.c001                 ...add-openmp.sh u4443           01:25:21 R batch
``````

So it seems we have hit a problem. If we are to perform a `for loop` for 18 billion billion times, job just runs forever / freeze. Let's try something else - Parallel Reduction with OpenMP (see next Config 2 and 3 below).

# Config 2: Parallel Reduction Mode - with 256 OpenMP threads - loop 18 billion billion times

As the Config 1 serial mode seems to take forever to run, we are going to try optimise the code a little bit with OpenMP Parallel Reduction "modern code" - distributing the addition operations over 256 threads on the KNL node.

Borrowing a slide from the Colfax Research How to (Deep Dive) Series Lecture 4 on Parallel Reduction. This is what the mental picture looks like (please ignore the numbers / values in the diagram as it corresponds to a different type of arithmetic operation to our example. Just focus on the "distribute add operations into multi threads, and add thread total together". Sort of like Map Reduce):

We effectively distribute the for loop over multiple threads with `#pragma omp parallel` and `#pragma omp for`. With each thread own its own private thread total. Let each thread does part of the "add something" operation. When we have our thread totals, we add them up in a non conflicting order with `#pragma omp atomic` and get our final total.

This is what the code looks like. (Note all the newly added `#pragma omp` macros).

Code `hellp-parallel-map-reduce-add-3.cc`:

```#include <omp.h>
#include <cstdio>
#include <time.h>
#include <locale.h>

void main() {

// enable thoudsand separateor formating in prinf
setlocale(LC_ALL, "");

// start timer
const clock_t begin_time = clock();

// Set total number of increments to add up
// 18,000,000,000,000,000,000
unsigned long long int n = 18000000000000000000;

// Initialise Final Total
unsigned long long int total = 0;

printf("----------\n");

#pragma omp parallel
{
unsigned long long int total_thr = 0;

#pragma omp for
for (unsigned long int i=0; i<n; i++) {
total_thr += 1;
};

// barrier here - only start when all threads are done
#pragma omp atomic
total += total_thr;
}

printf("\n--------\n");
printf("Final Total: %'llu (must equal to %'llu)\n", total, n);
printf("Overall elapsed time: %f ms\n", float( clock () - begin_time ) / (CLOCKS_PER_SEC));

}```

Compile with OpenMP, to `hello-parallel-map-reduce-add-3`:

``````icpc -qopenmp -o hello-parallel-map-reduce-add-3 hello-parallel-map-reduce-add-3.cc
``````

Create a Shell Script `hello-parallel-map-reduce-add-3.sh` to run the binary:

```echo "hello-parallel-map-reduce-add-3 starts"
cd /home/u4443/deepdive/lec-04

Submit job to KNL Cluser Node via `qsub`:

``````[u4443@c001 lec-04]\$ qsub hello-parallel-map-reduce-add-3.sh -l nodes=1:knl
21173.c001
``````

Check Output:

``````[u4443@c001 lec-04]\$ cat hello-parallel-map-reduce-add-3.sh.o21173

########################################################################
# Colfax Cluster - https://colfaxresearch.com/
#      Date:           Thu Aug 10 12:10:53 PDT 2017
#    Job ID:           21173.c001
#      User:           u4443
# Resources:           neednodes=1:knl,nodes=1:knl,walltime=24:00:00
########################################################################

----------

--------
Final Total: 18,000,000,000,000,000,000 (must equal to 18,000,000,000,000,000,000)
Overall elapsed time: 5.160000 ms

########################################################################
# Colfax Cluster
# End of output for job 21173.c001
# Date: Thu Aug 10 12:10:54 PDT 2017
########################################################################
``````

Job completes in ~5 milliseconds. This is pretty impressive. (considering config 1 serial mode seems to run forever / 13 hours++).

# Config 3: Parallel Reduction Mode - with 1 OpenMP thread - loop 18 billion billion times

Inspired by the 5 millisecond job run with 256 threads, we now wonder how the same code may perform, with just 1 Thread. We will simply reuse the Config 2 code, except in our shell script we manually control to use only 1 OMP thread (defined by environmental variable `OMP_NUM_THREADS`).

Shell Script `hello-parallel-map-reduce-add-3b.sh`:

```echo "hello-parallel-map-reduce-add-3 starts"
cd /home/u4443/deepdive/lec-04

Submit job to KNL node via `qsub`:

``````[u4443@c001 lec-04]\$ qsub hello-parallel-map-reduce-add-3b.sh -l nodes=1:knl
21174.c001
``````

Output:

``````[u4443@c001 lec-04]\$ cat hello-parallel-map-reduce-add-3b.sh.o21174

########################################################################
# Colfax Cluster - https://colfaxresearch.com/
#      Date:           Thu Aug 10 12:18:35 PDT 2017
#    Job ID:           21174.c001
#      User:           u4443
# Resources:           neednodes=1:knl,nodes=1:knl,walltime=24:00:00
########################################################################

----------

--------
Final Total: 18,000,000,000,000,000,000 (must equal to 18,000,000,000,000,000,000)
Overall elapsed time: 0.050000 ms

########################################################################
# Colfax Cluster
# End of output for job 21174.c001
# Date: Thu Aug 10 12:18:36 PDT 2017
########################################################################

[u4443@c001 lec-04]\$
``````

Wow! 0.05 milliseconds! (recall that in Config 2 when we ran the same code with 256 threads it took 5 milliseconds.). This is pretty interesting observation because it appears that, by optimising our code to run in OpenMP format, the code can run incredibly fast, with just 1 thread!!!

Let's try one more thing. In Config 2 and 3, we have this `printf` statement to print out the thread total. To make the comparison between config 2 and config 3 a bit more fair (256 thread total `printf` vs 1 thread total `printf`), let's comment out this the following line in the code and try this out again (we will do in the next 2 sections)

```// comment out this line for next 2 tests

# Config 2 - without printing thread total

Same code as Config 2, just comment out the `printf` within the thread-level for loop. We observe slight reduction in run time.

``````[u4443@c001 lec-04]\$ qsub hello-parallel-map-reduce-add-4.sh -l nodes=1:knl
21176.c001
cat: hello-parallel-map-reduce-add-4.sh.o21176: No such file or directory

########################################################################
# Colfax Cluster - https://colfaxresearch.com/
#      Date:           Thu Aug 10 12:53:51 PDT 2017
#    Job ID:           21176.c001
#      User:           u4443
# Resources:           neednodes=1:knl,nodes=1:knl,walltime=24:00:00
########################################################################

----------

--------
Final Total: 18,000,000,000,000,000,000 (must equal to 18,000,000,000,000,000,000)
Overall elapsed time: 0.640000 ms

########################################################################
# Colfax Cluster
# End of output for job 21176.c001
# Date: Thu Aug 10 12:53:52 PDT 2017
########################################################################
``````

For Config 2 Without the print statement we reduced from 5 milliseconds to 0.64 milli seconds. It's sub millisecond so we are pretty happy.

# Config 3 - without printing thread total

Same code as Config 3, just comment out the `printf` within the thread-level for loop. We observe slight reduction in run time.

``````[u4443@c001 lec-04]\$ qsub hello-parallel-map-reduce-add-4b.sh -l nodes=1:knl
21177.c001

########################################################################
# Colfax Cluster - https://colfaxresearch.com/
#      Date:           Thu Aug 10 12:58:44 PDT 2017
#    Job ID:           21177.c001
#      User:           u4443
# Resources:           neednodes=1:knl,nodes=1:knl,walltime=24:00:00
########################################################################

----------

--------
Final Total: 18,000,000,000,000,000,000 (must equal to 18,000,000,000,000,000,000)
Overall elapsed time: 0.030000 ms

########################################################################
# Colfax Cluster
# End of output for job 21177.c001
# Date: Thu Aug 10 12:58:45 PDT 2017
########################################################################
``````

For Config 2 Without the print statement we reduced from 0.05 milliseconds to 0.03 milli seconds. Slight reduction as expected.

# A Note on Elapsed Time

The code sample here uses the start and end time (from the `clock()` utility provided by `<time.h>`) in the computation of duration. It is only afterwards I learn that you could actually use the `omp_get_wtime()` from `<omp.h>` library to do similar. Worth trying out next time.

# A Note on OpenMP compilation Report

To see which part of the code was optimised in what way, we could add the `-qopt-report` option in our compilation script:

``````icpc -qopenmp -o hello-parallel-map-reduce-add-3 hello-parallel-map-reduce-add-3.cc -qopt-report
``````

This will output a report called `hello-parallel-map-reduce-add-3.optrpt`. It looks like this (note that the for loop was not vectorized):

``````[u4443@c001 lec-04]\$ cat hello-parallel-map-reduce-add-3.optrpt
Intel(R) Advisor can now assist with vectorization and show optimization
report messages with your source code.

Report from: Interprocedural optimizations [ipo]

INLINING OPTION VALUES:
-inline-factor: 100
-inline-min-size: 30
-inline-max-size: 230
-inline-max-total-size: 2000
-inline-max-per-routine: 10000
-inline-max-per-compile: 500000

Begin optimization report for: main()

Report from: Interprocedural optimizations [ipo]

Report from: OpenMP optimizations [openmp]

hello-parallel-map-reduce-add-3.cc(28:1-28:1):OMP:main:  OpenMP DEFINED REGION WAS PARALLELIZED

Report from: Loop nest, Vector & Auto-parallelization optimizations [loop, vec, par]

remark #15335: loop was not vectorized: vectorization possible but seems inefficient. Use vector always directive or -vec-threshold0 to override
remark #25456: Number of Array Refs Scalar Replaced In Loop: 1
LOOP END

Report from: Code generation optimizations [cg]

Hardware registers
Reserved     :    2[ rsp rip]
Available    :   39[ rax rdx rcx rbx rbp rsi rdi r8-r15 mm0-mm7 zmm0-zmm15]
Callee-save  :    6[ rbx rbp r12-r15]
Assigned     :   14[ rax rdx rcx rbx rsi rdi r8-r10 r12-r15 zmm0]

Routine temporaries
Total         :     155
Global    :      25
Local     :     130
Regenerable   :      84
Spilled       :       0

Routine stack
Variables     :      60 bytes*
Reads     :       8 [2.00e+00 ~ 1.2%]
Writes    :       9 [2.30e+01 ~ 13.5%]
Spills        :      40 bytes*
Reads     :      10 [5.00e+00 ~ 2.9%]
Writes    :      10 [0.00e+00 ~ 0.0%]

Notes

*Non-overlapping variables and spills may share stack space,
so the total stack size might be less than this.

===========================================================================
[u4443@c001 lec-04]\$
``````

I was hoping this report would shed some light on why 1 thread "beats" 256 threads by a margin. May come back to this later.

# A Note on Warming up Threads

It was also later realised we could "warm up" threads by adding an artifial block like this in the code:

```  // warm up threads
#pragma omp parallel
{
}```

I did so and re-run Config 2 and 3. Not much change in performance.

# Conclusion

We perform an experiment to perform the "add 1" operation 18 billion billion times. It is learnt from the experiment that, with the aid of OpenMP framework and code optimisation (with `#pragma omp`) we can easily peform the task in sub millisecond timeframe (Config 2: ~0.5 ms with 256 threads and Config 2: ~ 0.03 ms with 1 thread). The fact optimised code can finish in sub millisecond on just 1 thread is an interesting one - it shows that with code optimisation we can run the code super fast on just 1 thread. Without OpenMP and code optimisation, the job (Config 1) runs "forever" / more than 13 hours. Based on this it appears OpenMP and code optimisation can help significantly improve code performance. In the case of running an operation 18 billion billion time, this is not just a "nice" feature, but a neccessary requirement.

# References

Intel Colfax Cluster - Notes - Index Page