for(i=0; i<n; i++){
c[i] = a[i] + b[i]
}
-
Same compiler(Rose Compiler)(No vectorization): Both valgrind and macpo traces were exactly the same.
-
Icc Compiler(vectorization): Code was vectorized. Traces were same
int n= 10
for(i=0; i<n; i++){
c[i] = a[i] + b[i]
a[i] = c[i] + b[i]
}
-
Same compiler(Rose Compiler): traces were same. Code was not vectorized.
- levenshtein distance:
a=0
,b=0
,c=0
- levenshtein distance:
-
Icc Compiler(vectorization): Traces don't match.
- levenshtein distance:
a=18
,b=10
,c=10
- levenshtein distance:
-
Icc Compiler(vectorization off explicitly) : Traces were different.
- levenshtein distance:
a=0
,b=10
,c=10
(b read twice and c is also read by macpo.)
- levenshtein distance:
- No Vectorization.
- Variable b is read twice in case of macpo.
- Variable c is never read in icc.
ICC Compiler(valgrind trace) with vectorization.
icc macpo
40 60
Ra0 Ra0
Ra1 Rb0
Ra2 Wc0
Rb0 Rc0
Rb1 Rb0
Rb2 Wa0
Wc0 Ra1
Wc1 Rb1
Wc2 Wc1
Wa0 Rc1
Wa1 Rb1
Wa2 Wa1
n = 10;
All "a" are loaded in batches of 2 values.
All "b" are loaded in batches of 2 values.
All "c" are written in batches of 2 values.
All "a" are written in batches of 2 values.
Compiler is doing vectorization here. Each load is of 16 bytes of double values.
I think the compiler is keeping the results in registers and then writing them all at once.
n=20;
4 "a" values are loaded in batches of 2 values.
4 "b" values are loaded in batches of 2 values.
4 "c" are written in batches of 2 values.
4 "a" are written in batches of 2 values.
---------------------------------------------------------------
ICC Vectorization OFF
R:b:0
R:b:1
R:b:2
R:b:3
R:a:0
W:c:0
W:a:0
R:a:1
W:c:1
W:a:1
R:a:2
W:c:2
W:a:2
for (i = 0; i < n; i++) {
a[i] = b[0];
}
-
Rose compiler: traces were same.
-
icc(with vectorization): traces were different. Code was vectorized. Loop invariant opt. not performed.
- levenshtein distance:
b=9
,a=0
(b was read 10 times.) No loop invariant optimization. - Total of 5 instruction in case of icc to write due to vectorization.
- levenshtein distance:
-
icc(without vectorization): traces were different. Loop invariant opt. not performed.
- levenshtein distance:
b=9
,a=0
(b was read 10 times.) No loop invariant optimization. - Total of 5 instruction in case of icc to write due to vectorization.
- levenshtein distance:
int n = 100
for (i = 0; i < n; i++) {
for (k = 0; k < n; k++) {
for (j = 0; j < n; j++) {
c[i][j] += a[i][k] * b[k][j];
}
}
}
-
Rose compiler: Both valgrind and macpo traces were exactly the same. Note: macpo RW -> R, W
-
icc(w/o vectorization): Traces were not same.
- levenshtein distance:
b = 200
,a = 100
,c = 1800
- levenshtein distance:
-
ICC: 2100 R/W operations, Macpo: 4000 RW operations. Note: macpo RW -> R, W