Skip to content

Instantly share code, notes, and snippets.

@gnzlbg
Created August 31, 2018 09:19
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save gnzlbg/80d3139393615c18495b1dd7855fc787 to your computer and use it in GitHub Desktop.
Save gnzlbg/80d3139393615c18495b1dd7855fc787 to your computer and use it in GitHub Desktop.
Performance of all/any mask reductions on SSE2 and SSE4.1
start:
mov ebx, 111 ; Start marker bytes
db 0x64, 0x67, 0x90 ; Start marker bytes
.L2:
;; the all_sse2 implementation starts here:
movdqa xmm0, [rdi]
pmovmskb eax, xmm0
cmp eax, 65535
sete al
mov ebx, 222 ; End marker bytes
db 0x64, 0x67, 0x90 ; End marker bytes
Intel(R) Architecture Code Analyzer Version - v3.0-28-g1ba2cbb build date: 2017-10-30;16:57:45
Analyzed File - all_sse2.o
Binary Format - 64Bit
Architecture - SKL
Analysis Type - Throughput
Throughput Analysis Report
--------------------------
Block Throughput: 1.00 Cycles Throughput Bottleneck: Dependency chains
Loop Count: 23
Port Binding In Cycles Per Iteration:
--------------------------------------------------------------------------------------------------
| Port | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 |
--------------------------------------------------------------------------------------------------
| Cycles | 1.0 0.0 | 0.5 | 0.5 0.5 | 0.5 0.5 | 0.0 | 0.5 | 1.0 | 0.0 |
--------------------------------------------------------------------------------------------------
DV - Divider pipe (on port 0)
D - Data fetch pipe (on ports 2 and 3)
F - Macro Fusion with the previous instruction occurred
* - instruction micro-ops not bound to a port
^ - Micro Fusion occurred
# - ESP Tracking sync uop was issued
@ - SSE instruction followed an AVX256/AVX512 instruction, dozens of cycles penalty is expected
X - instruction not supported, was not accounted in Analysis
| Num Of | Ports pressure in cycles | |
| Uops | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 |
-----------------------------------------------------------------------------------------
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | movdqa xmm0, xmmword ptr [rdi]
| 1 | 1.0 | | | | | | | | pmovmskb eax, xmm0
| 1 | | 0.5 | | | | 0.5 | | | cmp eax, 0xffff
| 1 | | | | | | | 1.0 | | setz al
Total Num Of Uops: 4
start:
mov ebx, 111 ; Start marker bytes
db 0x64, 0x67, 0x90 ; Start marker bytes
.L2:
;; the all_sse41 implementation starts here:
movdqa xmm0, [rdi]
pcmpeqd xmm1, xmm1
ptest xmm0, xmm1
setb al
mov ebx, 222 ; End marker bytes
db 0x64, 0x67, 0x90 ; End marker bytes
Intel(R) Architecture Code Analyzer Version - v3.0-28-g1ba2cbb build date: 2017-10-30;16:57:45
Analyzed File - all_sse41.o
Binary Format - 64Bit
Architecture - SKL
Analysis Type - Throughput
Throughput Analysis Report
--------------------------
Block Throughput: 1.24 Cycles Throughput Bottleneck: Dependency chains
Loop Count: 36
Port Binding In Cycles Per Iteration:
--------------------------------------------------------------------------------------------------
| Port | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 |
--------------------------------------------------------------------------------------------------
| Cycles | 1.0 0.0 | 1.0 | 0.5 0.5 | 0.5 0.5 | 0.0 | 1.0 | 1.0 | 0.0 |
--------------------------------------------------------------------------------------------------
DV - Divider pipe (on port 0)
D - Data fetch pipe (on ports 2 and 3)
F - Macro Fusion with the previous instruction occurred
* - instruction micro-ops not bound to a port
^ - Micro Fusion occurred
# - ESP Tracking sync uop was issued
@ - SSE instruction followed an AVX256/AVX512 instruction, dozens of cycles penalty is expected
X - instruction not supported, was not accounted in Analysis
| Num Of | Ports pressure in cycles | |
| Uops | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 |
-----------------------------------------------------------------------------------------
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | movdqa xmm0, xmmword ptr [rdi]
| 1 | | 1.0 | | | | | | | pcmpeqd xmm1, xmm1
| 2 | 1.0 | | | | | 1.0 | | | ptest xmm0, xmm1
| 1 | | | | | | | 1.0 | | setb al
Total Num Of Uops: 5
start:
mov ebx, 111 ; Start marker bytes
db 0x64, 0x67, 0x90 ; Start marker bytes
.L2:
;; the any_sse2 implementation starts here:
movdqa xmm0, [rdi]
pmovmskb eax, xmm0
test eax, eax
setne al
mov ebx, 222 ; End marker bytes
db 0x64, 0x67, 0x90 ; End marker bytes
Intel(R) Architecture Code Analyzer Version - v3.0-28-g1ba2cbb build date: 2017-10-30;16:57:45
Analyzed File - any_sse2.o
Binary Format - 64Bit
Architecture - SKL
Analysis Type - Throughput
Throughput Analysis Report
--------------------------
Block Throughput: 1.00 Cycles Throughput Bottleneck: Dependency chains
Loop Count: 23
Port Binding In Cycles Per Iteration:
--------------------------------------------------------------------------------------------------
| Port | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 |
--------------------------------------------------------------------------------------------------
| Cycles | 1.0 0.0 | 0.5 | 0.5 0.5 | 0.5 0.5 | 0.0 | 0.5 | 1.0 | 0.0 |
--------------------------------------------------------------------------------------------------
DV - Divider pipe (on port 0)
D - Data fetch pipe (on ports 2 and 3)
F - Macro Fusion with the previous instruction occurred
* - instruction micro-ops not bound to a port
^ - Micro Fusion occurred
# - ESP Tracking sync uop was issued
@ - SSE instruction followed an AVX256/AVX512 instruction, dozens of cycles penalty is expected
X - instruction not supported, was not accounted in Analysis
| Num Of | Ports pressure in cycles | |
| Uops | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 |
-----------------------------------------------------------------------------------------
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | movdqa xmm0, xmmword ptr [rdi]
| 1 | 1.0 | | | | | | | | pmovmskb eax, xmm0
| 1 | | 0.5 | | | | 0.5 | | | test eax, eax
| 1 | | | | | | | 1.0 | | setnz al
Total Num Of Uops: 4
start:
mov ebx, 111 ; Start marker bytes
db 0x64, 0x67, 0x90 ; Start marker bytes
.L2:
;; the any_sse41 implementation starts here:
movdqa xmm0, [rdi]
ptest xmm0, xmm0
setne al
mov ebx, 222 ; End marker bytes
db 0x64, 0x67, 0x90 ; End marker bytes
Intel(R) Architecture Code Analyzer Version - v3.0-28-g1ba2cbb build date: 2017-10-30;16:57:45
Analyzed File - any_sse41.o
Binary Format - 64Bit
Architecture - SKL
Analysis Type - Throughput
Throughput Analysis Report
--------------------------
Block Throughput: 1.00 Cycles Throughput Bottleneck: Dependency chains
Loop Count: 23
Port Binding In Cycles Per Iteration:
--------------------------------------------------------------------------------------------------
| Port | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 |
--------------------------------------------------------------------------------------------------
| Cycles | 1.0 0.0 | 0.0 | 0.5 0.5 | 0.5 0.5 | 0.0 | 1.0 | 1.0 | 0.0 |
--------------------------------------------------------------------------------------------------
DV - Divider pipe (on port 0)
D - Data fetch pipe (on ports 2 and 3)
F - Macro Fusion with the previous instruction occurred
* - instruction micro-ops not bound to a port
^ - Micro Fusion occurred
# - ESP Tracking sync uop was issued
@ - SSE instruction followed an AVX256/AVX512 instruction, dozens of cycles penalty is expected
X - instruction not supported, was not accounted in Analysis
| Num Of | Ports pressure in cycles | |
| Uops | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 |
-----------------------------------------------------------------------------------------
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | movdqa xmm0, xmmword ptr [rdi]
| 2 | 1.0 | | | | | 1.0 | | | ptest xmm0, xmm0
| 1 | | | | | | | 1.0 | | setnz al
Total Num Of Uops: 4
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment