Skip to content

Instantly share code, notes, and snippets.

@yuriks
Created April 10, 2017 17:04
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save yuriks/25a1ef6221698c73042ce8e5051b66ef to your computer and use it in GitHub Desktop.
Save yuriks/25a1ef6221698c73042ce8e5051b66ef to your computer and use it in GitHub Desktop.
dp4 timings
a*x + b*y + c*z + d*w
1 MUL A, X (4, 0.5)
2 MOV X, A
3 SHUFPS A, A (1, 1)
4 ADD A, X (4, 0.5)
5 MOV X, A
6 SHUFPS A, A (1, 1)
7 ADD A, X (4, 0.5)
end-to-end latency: 14
max tput: 1 -> 1 dot
1 MUL a, x (4, 0.5)
2 FMA a, b, y (4, 0.5)
3 FMA a, c, z (4, 0.5)
4 FMA a, d, w (4, 0.5)
end-to-end latency: 16
max tput: 0.5 (2 per cycle) -> 8 dots
can trivially (just change emitted instructions) use AVX to get 16 dots with full-rate AVX
1 MUL a, x
2 MUL c, z
3 FMA a, b, y
4 FMA c, d, w
5 ADD a, c
e2e latency: 12
max tput: 1 -> 4 dots
1 MUL A, X (4, 0.5)
2 HADD A, A (6, 2)
3 HADD A, A (6, 2)
e2e latency: 16
max tput: 2 -> 0.5 dots
1 DPPS A, X (13, 1.5)
e2e latency: 13
tput: 1.5 -> 0.66 dots
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment