Create a gist now

Instantly share code, notes, and snippets.

@Atlas7 /map-reduce-add.md Secret
Last active Aug 11, 2017

What would you like to do?
Intel Colfax Cluster - Perform 18 billion billion operations on Xeon Phi (Knights Landing) Cluster Node in Sub-millisecond

Intel Colfax Cluster - Notes - Index Page


Disclaimers / Edits

Before proceeding reading this article, please read the following edits. It is very likely the statement "run 18 billion billion operations in sub millisecon' is an unrealistic one, due to a strong possibility the compiler might have computed an anaytical solution instead of performing all the operations as expected in a loop.

EDIT 1: 11th Aug 2017: apparently performing 18 billion billion iterations in sub millisecond might not have been a realistic outcome. It is possible that the compiler might have cleverly skipped the entire for loop entirely and return an analytical answer (i.e. a constant) for this particular example. It is highly advisable to perform another test but with a maybe more non linear operation (instead of the linear "add 1"). That said, if this hypothesis is true it is quite nice (or the opposite) to know that a compiler is clever enough to dodge unneccessary for loop and return the same (analytical) answer. Validation of this hypothesis might be required.

EDIT 2: 11th Aug 2017: according a back of envelope calculation on time estimation... Each Xeon Phi thread runs at 1.213 GHz. To run 18 billion billion operations, it would theoratically take ~190 years on a single thread, or 2 years on ~256 threads. So for Config 2 and 3 when we suggested job finished within sub milli-second, it is very likely that the compiler has injected some "magic" and return an analytical solution (i.e. a constant). Below is a Python 3 code that computes this theoratical time estimation:

# Estimated time to run 18 billion billion operations on a single thread
Node_GHz = 1.213   # each thread is 1.213 GHz
ops_per_second = Node_GHz * 1e9
total_ops = 18e18      # 18 billion billion ops
total_seconds_needed = total_ops / ops_per_second
total_minutes_needed = total_seconds_needed / 60
total_hours_needed = total_minutes_needed / 60
total_days_needed = total_hours_needed / 24
total_years_needed = total_days_needed / 365
print("1 Single Serial Thread: would take {0} years to perform 18 billion billion operations".format(total_years_needed))   # ~190 years
print("256 Parallel Threads: would take {0} years to perform 18 billion billion operations".format(total_years_needed/256))    # ~2 years

Output:

1 Single Serial Thread: would take 470.5492627434151 years to perform 18 billion billion operations
256 Parallel Threads: would take 1.8380830575914653 years to perform 18 billion billion operations

Abstract

As an experiment, we perform a simple "add 1" operations (but) 18 billion billon times (18,000,000,000,000,000,000) on a Xeon Phi (Knights Landing) Cluster Node. Apparently the total number of grains of sand on Earth is rough about 7.5 billion billion (according to this NPR article) - so imagine our experiment as trying to "count" the number of sands on Earth twice!

We do this in 3 configurations as follows and observed different magnitude of elapsed time:

  1. Configuration 1 (Serial Mode - No OpenMP - 1 Thread), elapsed time: "forever" / 13 hours+++ (and still running)
  2. Configuration 2 (Parallel Reduction Mode - OpenMP - 256 Threads), elapsed time: 5.160000 ms (Note: with printf turned off this reduces to 0.640000 ms)
  3. Configuration 3 (Parallel Reduction Mode - OpenMP - 1 Thread), elapsed time: 0.050000 ms (Note: with printf turned off this reduces to 0.030000 ms)

We observe in the absence of OpenMP and code optimisation, the program takes "forever" to run. With OpenMP, the job completes within milliseconds and even sub millisecond. A significant speed up. Note also with OpenMP, it is possible that a 1 thread application "beats" 256 threads one, probably due to reduced overhead in coordinating threads (?).

This post summarises the codes and results in detail.

Config 1: Serial Mode - No OpenMP - loop 18 billion billion times

Let's start with a very simple for loop like code. To perform an operation 18 billion billion times. (For reference, for a C++ unsigned long long int, the max value is 18,446,744,073,709,551,615 (see references section). I picked 18 billion billion to ensure our "add 1" operations do not make our total exceed this upper bound. It is an arbitary choice with the goal of visualising a "BIG" loop.

Code hello-serial-add.cc:

#include <cstdio>
#include <time.h>
#include <locale.h>

void main() {

  // enable thoudsand separateor formating in prinf
  setlocale(LC_ALL, "");

  // start timer
  const clock_t begin_time = clock();

  // Set total number of increments to add up
  unsigned long long int n = 18000000000000000000;

  // initialize total
  unsigned long long int total = 0;

  for (unsigned long long int i = 0; i < n; i++) {
    total += 1;
  }

  printf("Numbers to add: %'llu\n", n);
  printf("----------\n");
  printf("Final Total: %'llu (must equal to %'llu)\n", total, n);
  printf("Overall elapsed time: %f ms\n", float( clock () - begin_time ) / (CLOCKS_PER_SEC));

}

Compile to hello-serial-add with the Intel C++ compiler icpc:

icpc -o hello-serial-add hello-serial-add.cc

Shell Script hello-serial-add.sh:

echo "hello-serial-add starts"
cd /home/u4443/deepdive/lec-04
./hello-serial-add
echo "hello-serial-add ends"

Submit job to KNL node:

u4443@c001 lec-04]$ qsub hello-serial-add.sh -l nodes=1:knl
21171

Job runs for more than 10 minutes and still running - had to kill it.

[u4443@c001 lec-04]$ qstat
Job ID                    Name             User            Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
21171.c001                 ...serial-add.sh u4443           00:10:05 R batch

Note to self: I have just resubmitted job 21178 - to see if the above job will actually finish within the 24 hour timeframe. (When I checked next day the job has already run for 13 hours+ and still running. So I have decided to kill it to free up resource on the cluster node. Point proven though - this code takes "forever" to run).

[u4443@c001 ~]$ qstat
Job ID                    Name             User            Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
21178.c001                 ...serial-add.sh u4443           13:03:31 R batch

One more try: what about same code, but compile with the -qopenmp option? would that help? (and create a shell script accordingly to point to this newly compiled code):

icpc -qopenmp -o hello-serial-add-openmp hello-serial-add.cc

~90 minutes later and it is still running... (so again, point proven, I am killing it).

[u4443@c001 lec-04]$ qstat
Job ID                    Name             User            Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
21198.c001                 ...add-openmp.sh u4443           01:25:21 R batch

So it seems we have hit a problem. If we are to perform a for loop for 18 billion billion times, job just runs forever / freeze. Let's try something else - Parallel Reduction with OpenMP (see next Config 2 and 3 below).

Config 2: Parallel Reduction Mode - with 256 OpenMP threads - loop 18 billion billion times

As the Config 1 serial mode seems to take forever to run, we are going to try optimise the code a little bit with OpenMP Parallel Reduction "modern code" - distributing the addition operations over 256 threads on the KNL node.

Borrowing a slide from the Colfax Research How to (Deep Dive) Series Lecture 4 on Parallel Reduction. This is what the mental picture looks like (please ignore the numbers / values in the diagram as it corresponds to a different type of arithmetic operation to our example. Just focus on the "distribute add operations into multi threads, and add thread total together". Sort of like Map Reduce):

parallel-reduction.png

We effectively distribute the for loop over multiple threads with #pragma omp parallel and #pragma omp for. With each thread own its own private thread total. Let each thread does part of the "add something" operation. When we have our thread totals, we add them up in a non conflicting order with #pragma omp atomic and get our final total.

This is what the code looks like. (Note all the newly added #pragma omp macros).

Code hellp-parallel-map-reduce-add-3.cc:

#include <omp.h>
#include <cstdio>
#include <time.h>
#include <locale.h>

void main() {

  // enable thoudsand separateor formating in prinf
  setlocale(LC_ALL, "");

  // start timer
  const clock_t begin_time = clock();

  // Set total number of increments to add up
  // 18,000,000,000,000,000,000
  unsigned long long int n = 18000000000000000000;

  // Initialise Final Total
  unsigned long long int total = 0;

  // Display total number of threads used and numbers to add
  const unsigned int num_threads = omp_get_max_threads();
  printf("\nOpenMP with %lu threads\n", num_threads);
  printf("Numbers to add: %'llu\n", n);
  printf("----------\n");

#pragma omp parallel
  {
    // Initialise Tread Total in the Thread private memory address space
    unsigned long long int total_thr = 0;

    // Distribute numbers to add across threads
#pragma omp for
    for (unsigned long int i=0; i<n; i++) {
      total_thr += 1;
    };

    printf("Thread %u total: %'llu\n ", omp_get_thread_num(), total_thr);

    // barrier here - only start when all threads are done
#pragma omp atomic
      total += total_thr;
  }

  printf("\n--------\n");
  printf("Final Total: %'llu (must equal to %'llu)\n", total, n);
  printf("Overall elapsed time: %f ms\n", float( clock () - begin_time ) / (CLOCKS_PER_SEC));

}

Compile with OpenMP, to hello-parallel-map-reduce-add-3:

icpc -qopenmp -o hello-parallel-map-reduce-add-3 hello-parallel-map-reduce-add-3.cc

Create a Shell Script hello-parallel-map-reduce-add-3.sh to run the binary:

echo "hello-parallel-map-reduce-add-3 starts"
cd /home/u4443/deepdive/lec-04
./hello-parallel-map-reduce-add-3
echo "hello-parallel-map-reduce-add-3 ends"

Submit job to KNL Cluser Node via qsub:

[u4443@c001 lec-04]$ qsub hello-parallel-map-reduce-add-3.sh -l nodes=1:knl
21173.c001

Check Output:

[u4443@c001 lec-04]$ cat hello-parallel-map-reduce-add-3.sh.o21173

########################################################################
# Colfax Cluster - https://colfaxresearch.com/
#      Date:           Thu Aug 10 12:10:53 PDT 2017
#    Job ID:           21173.c001
#      User:           u4443
# Resources:           neednodes=1:knl,nodes=1:knl,walltime=24:00:00
########################################################################

hello-parallel-map-reduce-add-3 starts

OpenMP with 256 threads
Numbers to add: 18,000,000,000,000,000,000
----------
Thread 252 total: 70,312,500,000,000,000
 Thread 233 total: 70,312,500,000,000,000
 Thread 51 total: 70,312,500,000,000,000
 Thread 244 total: 70,312,500,000,000,000
 Thread 1 total: 70,312,500,000,000,000
 Thread 227 total: 70,312,500,000,000,000
 Thread 157 total: 70,312,500,000,000,000
 Thread 23 total: 70,312,500,000,000,000
 Thread 63 total: 70,312,500,000,000,000
 Thread 141 total: 70,312,500,000,000,000
 Thread 86 total: 70,312,500,000,000,000
 Thread 220 total: 70,312,500,000,000,000
 Thread 221 total: 70,312,500,000,000,000
 Thread 214 total: 70,312,500,000,000,000
 Thread 156 total: 70,312,500,000,000,000
 Thread 60 total: 70,312,500,000,000,000
 Thread 208 total: 70,312,500,000,000,000
 Thread 228 total: 70,312,500,000,000,000
 Thread 132 total: 70,312,500,000,000,000
 Thread 93 total: 70,312,500,000,000,000
 Thread 0 total: 70,312,500,000,000,000
 Thread 20 total: 70,312,500,000,000,000
 Thread 21 total: 70,312,500,000,000,000
 Thread 146 total: 70,312,500,000,000,000
 Thread 145 total: 70,312,500,000,000,000
 Thread 100 total: 70,312,500,000,000,000
 Thread 152 total: 70,312,500,000,000,000
 Thread 27 total: 70,312,500,000,000,000
 Thread 18 total: 70,312,500,000,000,000
 Thread 71 total: 70,312,500,000,000,000
 Thread 198 total: 70,312,500,000,000,000
 Thread 137 total: 70,312,500,000,000,000
 Thread 222 total: 70,312,500,000,000,000
 Thread 169 total: 70,312,500,000,000,000
 Thread 139 total: 70,312,500,000,000,000
 Thread 199 total: 70,312,500,000,000,000
 Thread 7 total: 70,312,500,000,000,000
 Thread 97 total: 70,312,500,000,000,000
 Thread 107 total: 70,312,500,000,000,000
 Thread 105 total: 70,312,500,000,000,000
 Thread 166 total: 70,312,500,000,000,000
 Thread 67 total: 70,312,500,000,000,000
 Thread 171 total: 70,312,500,000,000,000
 Thread 104 total: 70,312,500,000,000,000
 Thread 94 total: 70,312,500,000,000,000
 Thread 219 total: 70,312,500,000,000,000
 Thread 192 total: 70,312,500,000,000,000
 Thread 43 total: 70,312,500,000,000,000
 Thread 95 total: 70,312,500,000,000,000
 Thread 40 total: 70,312,500,000,000,000
 Thread 122 total: 70,312,500,000,000,000
 Thread 52 total: 70,312,500,000,000,000
 Thread 58 total: 70,312,500,000,000,000
 Thread 238 total: 70,312,500,000,000,000
 Thread 180 total: 70,312,500,000,000,000
 Thread 39 total: 70,312,500,000,000,000
 Thread 48 total: 70,312,500,000,000,000
 Thread 112 total: 70,312,500,000,000,000
 Thread 50 total: 70,312,500,000,000,000
 Thread 103 total: 70,312,500,000,000,000
 Thread 175 total: 70,312,500,000,000,000
 Thread 128 total: 70,312,500,000,000,000
 Thread 162 total: 70,312,500,000,000,000
 Thread 253 total: 70,312,500,000,000,000
 Thread 226 total: 70,312,500,000,000,000
 Thread 163 total: 70,312,500,000,000,000
 Thread 125 total: 70,312,500,000,000,000
 Thread 188 total: 70,312,500,000,000,000
 Thread 28 total: 70,312,500,000,000,000
 Thread 124 total: 70,312,500,000,000,000
 Thread 144 total: 70,312,500,000,000,000
 Thread 129 total: 70,312,500,000,000,000
 Thread 80 total: 70,312,500,000,000,000
 Thread 254 total: 70,312,500,000,000,000
 Thread 41 total: 70,312,500,000,000,000
 Thread 62 total: 70,312,500,000,000,000
 Thread 204 total: 70,312,500,000,000,000
 Thread 34 total: 70,312,500,000,000,000
 Thread 255 total: 70,312,500,000,000,000
 Thread 206 total: 70,312,500,000,000,000
 Thread 82 total: 70,312,500,000,000,000
 Thread 140 total: 70,312,500,000,000,000
 Thread 134 total: 70,312,500,000,000,000
 Thread 13 total: 70,312,500,000,000,000
 Thread 151 total: 70,312,500,000,000,000
 Thread 202 total: 70,312,500,000,000,000
 Thread 70 total: 70,312,500,000,000,000
 Thread 251 total: 70,312,500,000,000,000
 Thread 79 total: 70,312,500,000,000,000
 Thread 56 total: 70,312,500,000,000,000
 Thread 47 total: 70,312,500,000,000,000
 Thread 2 total: 70,312,500,000,000,000
 Thread 142 total: 70,312,500,000,000,000
 Thread 143 total: 70,312,500,000,000,000
 Thread 165 total: 70,312,500,000,000,000
 Thread 154 total: 70,312,500,000,000,000
 Thread 85 total: 70,312,500,000,000,000
 Thread 17 total: 70,312,500,000,000,000
 Thread 84 total: 70,312,500,000,000,000
 Thread 149 total: 70,312,500,000,000,000
 Thread 89 total: 70,312,500,000,000,000
 Thread 225 total: 70,312,500,000,000,000
 Thread 191 total: 70,312,500,000,000,000
 Thread 4 total: 70,312,500,000,000,000
 Thread 211 total: 70,312,500,000,000,000
 Thread 81 total: 70,312,500,000,000,000
 Thread 77 total: 70,312,500,000,000,000
 Thread 99 total: 70,312,500,000,000,000
 Thread 3 total: 70,312,500,000,000,000
 Thread 135 total: 70,312,500,000,000,000
 Thread 92 total: 70,312,500,000,000,000
 Thread 138 total: 70,312,500,000,000,000
 Thread 150 total: 70,312,500,000,000,000
 Thread 185 total: 70,312,500,000,000,000
 Thread 153 total: 70,312,500,000,000,000
 Thread 24 total: 70,312,500,000,000,000
 Thread 74 total: 70,312,500,000,000,000
 Thread 33 total: 70,312,500,000,000,000
 Thread 32 total: 70,312,500,000,000,000
 Thread 155 total: 70,312,500,000,000,000
 Thread 216 total: 70,312,500,000,000,000
 Thread 14 total: 70,312,500,000,000,000
 Thread 148 total: 70,312,500,000,000,000
 Thread 16 total: 70,312,500,000,000,000
 Thread 109 total: 70,312,500,000,000,000
 Thread 130 total: 70,312,500,000,000,000
 Thread 9 total: 70,312,500,000,000,000
 Thread 232 total: 70,312,500,000,000,000
 Thread 106 total: 70,312,500,000,000,000
 Thread 66 total: 70,312,500,000,000,000
 Thread 38 total: 70,312,500,000,000,000
 Thread 201 total: 70,312,500,000,000,000
 Thread 11 total: 70,312,500,000,000,000
 Thread 223 total: 70,312,500,000,000,000
 Thread 90 total: 70,312,500,000,000,000
 Thread 133 total: 70,312,500,000,000,000
 Thread 194 total: 70,312,500,000,000,000
 Thread 42 total: 70,312,500,000,000,000
 Thread 57 total: 70,312,500,000,000,000
 Thread 224 total: 70,312,500,000,000,000
 Thread 240 total: 70,312,500,000,000,000
 Thread 159 total: 70,312,500,000,000,000
 Thread 197 total: 70,312,500,000,000,000
 Thread 161 total: 70,312,500,000,000,000
 Thread 113 total: 70,312,500,000,000,000
 Thread 200 total: 70,312,500,000,000,000
 Thread 73 total: 70,312,500,000,000,000
 Thread 236 total: 70,312,500,000,000,000
 Thread 131 total: 70,312,500,000,000,000
 Thread 158 total: 70,312,500,000,000,000
 Thread 176 total: 70,312,500,000,000,000
 Thread 172 total: 70,312,500,000,000,000
 Thread 243 total: 70,312,500,000,000,000
 Thread 242 total: 70,312,500,000,000,000
 Thread 183 total: 70,312,500,000,000,000
 Thread 110 total: 70,312,500,000,000,000
 Thread 239 total: 70,312,500,000,000,000
 Thread 117 total: 70,312,500,000,000,000
 Thread 111 total: 70,312,500,000,000,000
 Thread 87 total: 70,312,500,000,000,000
 Thread 127 total: 70,312,500,000,000,000
 Thread 5 total: 70,312,500,000,000,000
 Thread 209 total: 70,312,500,000,000,000
 Thread 246 total: 70,312,500,000,000,000
 Thread 213 total: 70,312,500,000,000,000
 Thread 249 total: 70,312,500,000,000,000
 Thread 31 total: 70,312,500,000,000,000
 Thread 234 total: 70,312,500,000,000,000
 Thread 230 total: 70,312,500,000,000,000
 Thread 170 total: 70,312,500,000,000,000
 Thread 54 total: 70,312,500,000,000,000
 Thread 30 total: 70,312,500,000,000,000
 Thread 179 total: 70,312,500,000,000,000
 Thread 59 total: 70,312,500,000,000,000
 Thread 53 total: 70,312,500,000,000,000
 Thread 174 total: 70,312,500,000,000,000
 Thread 245 total: 70,312,500,000,000,000
 Thread 46 total: 70,312,500,000,000,000
 Thread 64 total: 70,312,500,000,000,000
 Thread 108 total: 70,312,500,000,000,000
 Thread 98 total: 70,312,500,000,000,000
 Thread 35 total: 70,312,500,000,000,000
 Thread 61 total: 70,312,500,000,000,000
 Thread 189 total: 70,312,500,000,000,000
 Thread 229 total: 70,312,500,000,000,000
 Thread 101 total: 70,312,500,000,000,000
 Thread 215 total: 70,312,500,000,000,000
 Thread 147 total: 70,312,500,000,000,000
 Thread 37 total: 70,312,500,000,000,000
 Thread 83 total: 70,312,500,000,000,000
 Thread 205 total: 70,312,500,000,000,000
 Thread 126 total: 70,312,500,000,000,000
 Thread 29 total: 70,312,500,000,000,000
 Thread 121 total: 70,312,500,000,000,000
 Thread 212 total: 70,312,500,000,000,000
 Thread 96 total: 70,312,500,000,000,000
 Thread 36 total: 70,312,500,000,000,000
 Thread 69 total: 70,312,500,000,000,000
 Thread 68 total: 70,312,500,000,000,000
 Thread 184 total: 70,312,500,000,000,000
 Thread 76 total: 70,312,500,000,000,000
 Thread 26 total: 70,312,500,000,000,000
 Thread 196 total: 70,312,500,000,000,000
 Thread 65 total: 70,312,500,000,000,000
 Thread 12 total: 70,312,500,000,000,000
 Thread 22 total: 70,312,500,000,000,000
 Thread 160 total: 70,312,500,000,000,000
 Thread 168 total: 70,312,500,000,000,000
 Thread 25 total: 70,312,500,000,000,000
 Thread 193 total: 70,312,500,000,000,000
 Thread 15 total: 70,312,500,000,000,000
 Thread 91 total: 70,312,500,000,000,000
 Thread 218 total: 70,312,500,000,000,000
 Thread 120 total: 70,312,500,000,000,000
 Thread 10 total: 70,312,500,000,000,000
 Thread 164 total: 70,312,500,000,000,000
 Thread 203 total: 70,312,500,000,000,000
 Thread 207 total: 70,312,500,000,000,000
 Thread 19 total: 70,312,500,000,000,000
 Thread 210 total: 70,312,500,000,000,000
 Thread 195 total: 70,312,500,000,000,000
 Thread 173 total: 70,312,500,000,000,000
 Thread 75 total: 70,312,500,000,000,000
 Thread 49 total: 70,312,500,000,000,000
 Thread 8 total: 70,312,500,000,000,000
 Thread 186 total: 70,312,500,000,000,000
 Thread 123 total: 70,312,500,000,000,000
 Thread 181 total: 70,312,500,000,000,000
 Thread 235 total: 70,312,500,000,000,000
 Thread 72 total: 70,312,500,000,000,000
 Thread 118 total: 70,312,500,000,000,000
 Thread 182 total: 70,312,500,000,000,000
 Thread 247 total: 70,312,500,000,000,000
 Thread 55 total: 70,312,500,000,000,000
 Thread 119 total: 70,312,500,000,000,000
 Thread 88 total: 70,312,500,000,000,000
 Thread 102 total: 70,312,500,000,000,000
 Thread 44 total: 70,312,500,000,000,000
 Thread 45 total: 70,312,500,000,000,000
 Thread 177 total: 70,312,500,000,000,000
 Thread 115 total: 70,312,500,000,000,000
 Thread 114 total: 70,312,500,000,000,000
 Thread 178 total: 70,312,500,000,000,000
 Thread 167 total: 70,312,500,000,000,000
 Thread 231 total: 70,312,500,000,000,000
 Thread 237 total: 70,312,500,000,000,000
 Thread 190 total: 70,312,500,000,000,000
 Thread 136 total: 70,312,500,000,000,000
 Thread 217 total: 70,312,500,000,000,000
 Thread 78 total: 70,312,500,000,000,000
 Thread 248 total: 70,312,500,000,000,000
 Thread 250 total: 70,312,500,000,000,000
 Thread 6 total: 70,312,500,000,000,000
 Thread 116 total: 70,312,500,000,000,000
 Thread 241 total: 70,312,500,000,000,000
 Thread 187 total: 70,312,500,000,000,000

--------
Final Total: 18,000,000,000,000,000,000 (must equal to 18,000,000,000,000,000,000)
Overall elapsed time: 5.160000 ms
hello-parallel-map-reduce-add-3 ends

########################################################################
# Colfax Cluster
# End of output for job 21173.c001
# Date: Thu Aug 10 12:10:54 PDT 2017
########################################################################

Job completes in ~5 milliseconds. This is pretty impressive. (considering config 1 serial mode seems to run forever / 13 hours++).

Config 3: Parallel Reduction Mode - with 1 OpenMP thread - loop 18 billion billion times

Inspired by the 5 millisecond job run with 256 threads, we now wonder how the same code may perform, with just 1 Thread. We will simply reuse the Config 2 code, except in our shell script we manually control to use only 1 OMP thread (defined by environmental variable OMP_NUM_THREADS).

Shell Script hello-parallel-map-reduce-add-3b.sh:

echo "hello-parallel-map-reduce-add-3 starts"
export OMP_NUM_THREADS=1
cd /home/u4443/deepdive/lec-04
./hello-parallel-map-reduce-add-3
echo "hello-parallel-map-reduce-add-3 ends"

Submit job to KNL node via qsub:

[u4443@c001 lec-04]$ qsub hello-parallel-map-reduce-add-3b.sh -l nodes=1:knl
21174.c001

Output:

[u4443@c001 lec-04]$ cat hello-parallel-map-reduce-add-3b.sh.o21174

########################################################################
# Colfax Cluster - https://colfaxresearch.com/
#      Date:           Thu Aug 10 12:18:35 PDT 2017
#    Job ID:           21174.c001
#      User:           u4443
# Resources:           neednodes=1:knl,nodes=1:knl,walltime=24:00:00
########################################################################

hello-parallel-map-reduce-add-3 starts

OpenMP with 1 threads
Numbers to add: 18,000,000,000,000,000,000
----------
Thread 0 total: 18,000,000,000,000,000,000

--------
Final Total: 18,000,000,000,000,000,000 (must equal to 18,000,000,000,000,000,000)
Overall elapsed time: 0.050000 ms
hello-parallel-map-reduce-add-3 ends

########################################################################
# Colfax Cluster
# End of output for job 21174.c001
# Date: Thu Aug 10 12:18:36 PDT 2017
########################################################################

[u4443@c001 lec-04]$

Wow! 0.05 milliseconds! (recall that in Config 2 when we ran the same code with 256 threads it took 5 milliseconds.). This is pretty interesting observation because it appears that, by optimising our code to run in OpenMP format, the code can run incredibly fast, with just 1 thread!!!

Let's try one more thing. In Config 2 and 3, we have this printf statement to print out the thread total. To make the comparison between config 2 and config 3 a bit more fair (256 thread total printf vs 1 thread total printf), let's comment out this the following line in the code and try this out again (we will do in the next 2 sections)

// comment out this line for next 2 tests
//printf("Thread %u total: %'llu\n ", omp_get_thread_num(), total_thr);

Config 2 - without printing thread total

Same code as Config 2, just comment out the printf within the thread-level for loop. We observe slight reduction in run time.

[u4443@c001 lec-04]$ qsub hello-parallel-map-reduce-add-4.sh -l nodes=1:knl
21176.c001
[u4443@c001 lec-04]$ cat hello-parallel-map-reduce-add-4.sh.o21176
cat: hello-parallel-map-reduce-add-4.sh.o21176: No such file or directory
[u4443@c001 lec-04]$ cat hello-parallel-map-reduce-add-4.sh.o21176

########################################################################
# Colfax Cluster - https://colfaxresearch.com/
#      Date:           Thu Aug 10 12:53:51 PDT 2017
#    Job ID:           21176.c001
#      User:           u4443
# Resources:           neednodes=1:knl,nodes=1:knl,walltime=24:00:00
########################################################################

hello-parallel-map-reduce-add-4 starts

OpenMP with 256 threads
Numbers to add: 18,000,000,000,000,000,000
----------

--------
Final Total: 18,000,000,000,000,000,000 (must equal to 18,000,000,000,000,000,000)
Overall elapsed time: 0.640000 ms
hello-parallel-map-reduce-add-4 ends

########################################################################
# Colfax Cluster
# End of output for job 21176.c001
# Date: Thu Aug 10 12:53:52 PDT 2017
########################################################################

For Config 2 Without the print statement we reduced from 5 milliseconds to 0.64 milli seconds. It's sub millisecond so we are pretty happy.

Config 3 - without printing thread total

Same code as Config 3, just comment out the printf within the thread-level for loop. We observe slight reduction in run time.

[u4443@c001 lec-04]$ qsub hello-parallel-map-reduce-add-4b.sh -l nodes=1:knl
21177.c001
[u4443@c001 lec-04]$ cat hello-parallel-map-reduce-add-4b.sh.o21177

########################################################################
# Colfax Cluster - https://colfaxresearch.com/
#      Date:           Thu Aug 10 12:58:44 PDT 2017
#    Job ID:           21177.c001
#      User:           u4443
# Resources:           neednodes=1:knl,nodes=1:knl,walltime=24:00:00
########################################################################

hello-parallel-map-reduce-add-4 starts

OpenMP with 1 threads
Numbers to add: 18,000,000,000,000,000,000
----------

--------
Final Total: 18,000,000,000,000,000,000 (must equal to 18,000,000,000,000,000,000)
Overall elapsed time: 0.030000 ms
hello-parallel-map-reduce-add-4 ends

########################################################################
# Colfax Cluster
# End of output for job 21177.c001
# Date: Thu Aug 10 12:58:45 PDT 2017
########################################################################

For Config 2 Without the print statement we reduced from 0.05 milliseconds to 0.03 milli seconds. Slight reduction as expected.

A Note on Elapsed Time

The code sample here uses the start and end time (from the clock() utility provided by <time.h>) in the computation of duration. It is only afterwards I learn that you could actually use the omp_get_wtime() from <omp.h> library to do similar. Worth trying out next time.

A Note on OpenMP compilation Report

To see which part of the code was optimised in what way, we could add the -qopt-report option in our compilation script:

icpc -qopenmp -o hello-parallel-map-reduce-add-3 hello-parallel-map-reduce-add-3.cc -qopt-report

This will output a report called hello-parallel-map-reduce-add-3.optrpt. It looks like this (note that the for loop was not vectorized):

[u4443@c001 lec-04]$ cat hello-parallel-map-reduce-add-3.optrpt
Intel(R) Advisor can now assist with vectorization and show optimization
  report messages with your source code.
See "https://software.intel.com/en-us/intel-advisor-xe" for details.


    Report from: Interprocedural optimizations [ipo]

INLINING OPTION VALUES:
  -inline-factor: 100
  -inline-min-size: 30
  -inline-max-size: 230
  -inline-max-total-size: 2000
  -inline-max-per-routine: 10000
  -inline-max-per-compile: 500000


Begin optimization report for: main()

    Report from: Interprocedural optimizations [ipo]

INLINE REPORT: (main()) [1] hello-parallel-map-reduce-add-3.cc(6,13)


    Report from: OpenMP optimizations [openmp]

hello-parallel-map-reduce-add-3.cc(43:7-43:7):OMP:main:  OpenMP multithreaded code generation for ATOMIC was successful
hello-parallel-map-reduce-add-3.cc(28:1-28:1):OMP:main:  OpenMP DEFINED REGION WAS PARALLELIZED

    Report from: Loop nest, Vector & Auto-parallelization optimizations [loop, vec, par]


LOOP BEGIN at hello-parallel-map-reduce-add-3.cc(34,1)
   remark #15335: loop was not vectorized: vectorization possible but seems inefficient. Use vector always directive or -vec-threshold0 to override
   remark #25456: Number of Array Refs Scalar Replaced In Loop: 1
LOOP END

    Report from: Code generation optimizations [cg]

hello-parallel-map-reduce-add-3.cc(6,13):remark #34051: REGISTER ALLOCATION : [main] hello-parallel-map-reduce-add-3.cc:6

    Hardware registers
        Reserved     :    2[ rsp rip]
        Available    :   39[ rax rdx rcx rbx rbp rsi rdi r8-r15 mm0-mm7 zmm0-zmm15]
        Callee-save  :    6[ rbx rbp r12-r15]
        Assigned     :   14[ rax rdx rcx rbx rsi rdi r8-r10 r12-r15 zmm0]

    Routine temporaries
        Total         :     155
            Global    :      25
            Local     :     130
        Regenerable   :      84
        Spilled       :       0

    Routine stack
        Variables     :      60 bytes*
            Reads     :       8 [2.00e+00 ~ 1.2%]
            Writes    :       9 [2.30e+01 ~ 13.5%]
        Spills        :      40 bytes*
            Reads     :      10 [5.00e+00 ~ 2.9%]
            Writes    :      10 [0.00e+00 ~ 0.0%]

    Notes

        *Non-overlapping variables and spills may share stack space,
         so the total stack size might be less than this.


===========================================================================
[u4443@c001 lec-04]$

I was hoping this report would shed some light on why 1 thread "beats" 256 threads by a margin. May come back to this later.

A Note on Warming up Threads

It was also later realised we could "warm up" threads by adding an artifial block like this in the code:

  // warm up threads
#pragma omp parallel
  {
  }

I did so and re-run Config 2 and 3. Not much change in performance.

Conclusion

We perform an experiment to perform the "add 1" operation 18 billion billion times. It is learnt from the experiment that, with the aid of OpenMP framework and code optimisation (with #pragma omp) we can easily peform the task in sub millisecond timeframe (Config 2: ~0.5 ms with 256 threads and Config 2: ~ 0.03 ms with 1 thread). The fact optimised code can finish in sub millisecond on just 1 thread is an interesting one - it shows that with code optimisation we can run the code super fast on just 1 thread. Without OpenMP and code optimisation, the job (Config 1) runs "forever" / more than 13 hours. Based on this it appears OpenMP and code optimisation can help significantly improve code performance. In the case of running an operation 18 billion billion time, this is not just a "nice" feature, but a neccessary requirement.

References


Intel Colfax Cluster - Notes - Index Page

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment