Skip to content

Instantly share code, notes, and snippets.

@goyalankit
Last active August 29, 2015 13:57
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save goyalankit/9612034 to your computer and use it in GitHub Desktop.
Save goyalankit/9612034 to your computer and use it in GitHub Desktop.
Results using the macpo compiler. Trace comparison.
simple loop with single variable being written + vectorization.
for(i=0; i<n; i++){
  c[i] = a[i] + b[i]
}
  • Same compiler(Rose Compiler)(No vectorization): Both valgrind and macpo traces were exactly the same.

  • Icc Compiler(vectorization): Code was vectorized. Traces were same


simple loop with multiple variables being written + vectorization.
int n= 10
for(i=0; i<n; i++){
  c[i] = a[i] + b[i]
  a[i] = c[i] + b[i]
}
  • Same compiler(Rose Compiler): traces were same. Code was not vectorized.

    • levenshtein distance: a=0, b=0, c=0
  • Icc Compiler(vectorization): Traces don't match.

    • levenshtein distance: a=18, b=10, c=10
  • Icc Compiler(vectorization off explicitly) : Traces were different.

    • levenshtein distance: a=0, b=10, c=10 (b read twice and c is also read by macpo.)
  • No Vectorization.
  • Variable b is read twice in case of macpo.
  • Variable c is never read in icc.
ICC Compiler(valgrind trace) with vectorization.

icc   macpo
40    60
Ra0   Ra0
Ra1   Rb0
Ra2   Wc0
Rb0   Rc0
Rb1   Rb0
Rb2   Wa0
Wc0   Ra1
Wc1   Rb1
Wc2   Wc1
Wa0   Rc1
Wa1   Rb1
Wa2   Wa1

n = 10;
All "a" are loaded in batches of 2 values.
All "b" are loaded in batches of 2 values.
All "c" are written in batches of 2 values.
All "a" are written in batches of 2 values.

Compiler is doing vectorization here. Each load is of 16 bytes of double values.
I think the compiler is keeping the results in registers and then writing them all at once.

n=20;
4 "a" values are loaded in batches of 2 values.
4 "b" values are loaded in batches of 2 values.
4 "c" are written in batches of 2 values.
4 "a" are written in batches of 2 values.

---------------------------------------------------------------

ICC Vectorization OFF

 R:b:0
 R:b:1
 R:b:2
 R:b:3
 
 R:a:0
 W:c:0
 W:a:0
 
 R:a:1
 W:c:1
 W:a:1
 
 R:a:2
 W:c:2
 W:a:2



simple loop with single variable being written + Loop invariant.
for (i = 0; i < n; i++) {
  a[i] = b[0];
}

  • Rose compiler: traces were same.

  • icc(with vectorization): traces were different. Code was vectorized. Loop invariant opt. not performed.

    • levenshtein distance: b=9, a=0 (b was read 10 times.) No loop invariant optimization.
    • Total of 5 instruction in case of icc to write due to vectorization.
  • icc(without vectorization): traces were different. Loop invariant opt. not performed.

    • levenshtein distance: b=9, a=0 (b was read 10 times.) No loop invariant optimization.
    • Total of 5 instruction in case of icc to write due to vectorization.

Matrix-matrix multiplication
int n = 100
for (i = 0; i < n; i++) {
  for (k = 0; k < n; k++) {
    for (j = 0; j < n; j++) {
      c[i][j] += a[i][k] * b[k][j];
    }
  }
}
  • Rose compiler: Both valgrind and macpo traces were exactly the same. Note: macpo RW -> R, W

  • icc(w/o vectorization): Traces were not same.

    • levenshtein distance:b = 200, a = 100 ,c = 1800
  • ICC: 2100 R/W operations, Macpo: 4000 RW operations. Note: macpo RW -> R, W

for(i=0; i<n; i++){
  c[i] = a[i] + b[i]
}
Sampling: Disabled; Vectorization: OFF
Compiler: ICC; Trace: Valgrind; 

[macpo] Reuse distances:
var: a: 2 (8 times) 0 (7 times) inf. (2 times).
var: b: 2 (8 times) inf. (3 times) 3 (3 times).
var: c: 2 (10 times) 1 (5 times) inf. (3 times).

[macpo] Cache conflicts:
var: a, conflict ratio: 0%.
var: b, conflict ratio: 0%.
var: c, conflict ratio: 0%.

[macpo] Analyzing records for stride values.
var: a: 1 (19 times).
var: b: 1 (19 times).
var: c: 1 (19 times).
Sampling: Disabled; Vectorization: OFF
Compiler: Rose; Trace: Valgrind;

[macpo] Reuse distances:
var: a: 2 (16 times) inf. (3 times) 8 (1 times).
var: b: 2 (16 times) inf. (2 times) 37 (1 times).
var: c: 2 (16 times) inf. (3 times) 8 (1 times).

[macpo] Cache conflicts:
var: a, conflict ratio: 0%.
var: b, conflict ratio: 0%.
var: c, conflict ratio: 0%.

[macpo] Analyzing records for stride values.
var: a: 1 (19 times).
var: b: 1 (19 times).
var: c: 1 (19 times).


int n= 10
for(i=0; i<n; i++){
  c[i] = a[i] + b[i]
  a[i] = c[i] + b[i]
}
Sampling: Disabled
Compiler: ICC
Trace: Valgrind
Vectorization: OFF

[macpo] Analyzing logs created from the binary /work/0268 at Tue Mar 18 19:57:00 2014

[macpo] Analyzing records for latency.

[macpo] Reuse distances:
var: b: 0 (13 times) inf. (2 times) 16 (2 times).
var: a: 0 (14 times) 1 (8 times) 2 (8 times).
var: c: 2 (13 times) inf. (3 times) 10 (2 times).

[macpo] Cache conflicts:
var: b, conflict ratio: 0%.
var: a, conflict ratio: 0%.
var: c, conflict ratio: 0%.

[macpo] Analyzing records for stride values.
var: b: 1 (19 times).
var: a: 0 (20 times) 1 (19 times).
var: c: 1 (19 times).

[macpo] Analyzing records for vector stride values.
Sampling: Disabled
Compiler: Rose
Trace: Valgrind

[macpo] Analyzing logs created from the binary /work/0268 at Tue Mar 18 20:03:35 2014

[macpo] Analyzing records for latency.

[macpo] Reuse distances:
var: b: 0 (10 times) 6 (6 times) inf. (2 times).
var: a: 0 (27 times) 6 (4 times) inf. (3 times).
var: c: 0 (10 times) 6 (7 times) inf. (3 times).

[macpo] Cache conflicts:
var: b, conflict ratio: 0%.
var: a, conflict ratio: 0%.
var: c, conflict ratio: 0%.

[macpo] Analyzing records for stride values.
var: b: 0 (10 times) 2 (9 times).
var: a: 0 (30 times) 2 (9 times).
var: c: 0 (10 times) 2 (9 times).

[macpo] Analyzing records for vector stride values.

 [macpo] Analyzing records for latency.

 [macpo] Reuse distances:
 var: c: 3 (20 times) 1 (17 times) inf. (3 times).
 var: a: 1 (20 times) 5 (10 times) 9 (5 times).
 var: b: 2 (32 times) 8 (5 times) inf. (2 times).

 [macpo] Cache conflicts:
 var: c, conflict ratio: 0%.
 var: a, conflict ratio: 0%.
 var: b, conflict ratio: 0%.

 [macpo] Analyzing records for stride values.
 var: c: 0 (20 times) 1 (19 times) 127 (1 times).
 var: a: 0 (20 times) 1 (19 times).
 var: b: 0 (20 times) 1 (19 times).

 [macpo] Analyzing records for vector stride values.



for (i = 0; i < n; i++) {
  a[i] = b[0];
}

Sampling: Disabled
Compiler: ICC
Trace: Valgrind

[macpo] Analyzing logs created from the binary /work/0268 at Tue Mar 18 20:14:39 2014

[macpo] Analyzing records for latency.

[macpo] Reuse distances:
var: b: inf. (1 times).
var: a: 0 (17 times) inf. (3 times).

[macpo] Cache conflicts:
var: b, conflict ratio: 0%.
var: a, conflict ratio: 0%.

[macpo] Analyzing records for stride values.
var: a: 1 (19 times).

[macpo] Analyzing records for vector stride values.
Sampling: Disabled
Compiler: Rose
Trace: Valgrind

[macpo] Analyzing logs created from the binary /work/0268 at Tue Mar 18 20:15:39 2014

[macpo] Analyzing records for latency.

[macpo] Reuse distances:
var: b: 1 (10 times) 3 (9 times) inf. (1 times).
var: a: 1 (17 times) inf. (3 times).

[macpo] Cache conflicts:
var: b, conflict ratio: 0%.
var: a, conflict ratio: 0%.

[macpo] Analyzing records for stride values.
var: b: 0 (19 times).
var: a: 1 (19 times).

[macpo] Analyzing records for vector stride values.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment