Created
August 31, 2018 09:19
-
-
Save gnzlbg/80d3139393615c18495b1dd7855fc787 to your computer and use it in GitHub Desktop.
Performance of all/any mask reductions on SSE2 and SSE4.1
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
start: | |
mov ebx, 111 ; Start marker bytes | |
db 0x64, 0x67, 0x90 ; Start marker bytes | |
.L2: | |
;; the all_sse2 implementation starts here: | |
movdqa xmm0, [rdi] | |
pmovmskb eax, xmm0 | |
cmp eax, 65535 | |
sete al | |
mov ebx, 222 ; End marker bytes | |
db 0x64, 0x67, 0x90 ; End marker bytes |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Intel(R) Architecture Code Analyzer Version - v3.0-28-g1ba2cbb build date: 2017-10-30;16:57:45 | |
Analyzed File - all_sse2.o | |
Binary Format - 64Bit | |
Architecture - SKL | |
Analysis Type - Throughput | |
Throughput Analysis Report | |
-------------------------- | |
Block Throughput: 1.00 Cycles Throughput Bottleneck: Dependency chains | |
Loop Count: 23 | |
Port Binding In Cycles Per Iteration: | |
-------------------------------------------------------------------------------------------------- | |
| Port | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 | | |
-------------------------------------------------------------------------------------------------- | |
| Cycles | 1.0 0.0 | 0.5 | 0.5 0.5 | 0.5 0.5 | 0.0 | 0.5 | 1.0 | 0.0 | | |
-------------------------------------------------------------------------------------------------- | |
DV - Divider pipe (on port 0) | |
D - Data fetch pipe (on ports 2 and 3) | |
F - Macro Fusion with the previous instruction occurred | |
* - instruction micro-ops not bound to a port | |
^ - Micro Fusion occurred | |
# - ESP Tracking sync uop was issued | |
@ - SSE instruction followed an AVX256/AVX512 instruction, dozens of cycles penalty is expected | |
X - instruction not supported, was not accounted in Analysis | |
| Num Of | Ports pressure in cycles | | | |
| Uops | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 | | |
----------------------------------------------------------------------------------------- | |
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | movdqa xmm0, xmmword ptr [rdi] | |
| 1 | 1.0 | | | | | | | | pmovmskb eax, xmm0 | |
| 1 | | 0.5 | | | | 0.5 | | | cmp eax, 0xffff | |
| 1 | | | | | | | 1.0 | | setz al | |
Total Num Of Uops: 4 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
start: | |
mov ebx, 111 ; Start marker bytes | |
db 0x64, 0x67, 0x90 ; Start marker bytes | |
.L2: | |
;; the all_sse41 implementation starts here: | |
movdqa xmm0, [rdi] | |
pcmpeqd xmm1, xmm1 | |
ptest xmm0, xmm1 | |
setb al | |
mov ebx, 222 ; End marker bytes | |
db 0x64, 0x67, 0x90 ; End marker bytes |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Intel(R) Architecture Code Analyzer Version - v3.0-28-g1ba2cbb build date: 2017-10-30;16:57:45 | |
Analyzed File - all_sse41.o | |
Binary Format - 64Bit | |
Architecture - SKL | |
Analysis Type - Throughput | |
Throughput Analysis Report | |
-------------------------- | |
Block Throughput: 1.24 Cycles Throughput Bottleneck: Dependency chains | |
Loop Count: 36 | |
Port Binding In Cycles Per Iteration: | |
-------------------------------------------------------------------------------------------------- | |
| Port | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 | | |
-------------------------------------------------------------------------------------------------- | |
| Cycles | 1.0 0.0 | 1.0 | 0.5 0.5 | 0.5 0.5 | 0.0 | 1.0 | 1.0 | 0.0 | | |
-------------------------------------------------------------------------------------------------- | |
DV - Divider pipe (on port 0) | |
D - Data fetch pipe (on ports 2 and 3) | |
F - Macro Fusion with the previous instruction occurred | |
* - instruction micro-ops not bound to a port | |
^ - Micro Fusion occurred | |
# - ESP Tracking sync uop was issued | |
@ - SSE instruction followed an AVX256/AVX512 instruction, dozens of cycles penalty is expected | |
X - instruction not supported, was not accounted in Analysis | |
| Num Of | Ports pressure in cycles | | | |
| Uops | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 | | |
----------------------------------------------------------------------------------------- | |
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | movdqa xmm0, xmmword ptr [rdi] | |
| 1 | | 1.0 | | | | | | | pcmpeqd xmm1, xmm1 | |
| 2 | 1.0 | | | | | 1.0 | | | ptest xmm0, xmm1 | |
| 1 | | | | | | | 1.0 | | setb al | |
Total Num Of Uops: 5 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
start: | |
mov ebx, 111 ; Start marker bytes | |
db 0x64, 0x67, 0x90 ; Start marker bytes | |
.L2: | |
;; the any_sse2 implementation starts here: | |
movdqa xmm0, [rdi] | |
pmovmskb eax, xmm0 | |
test eax, eax | |
setne al | |
mov ebx, 222 ; End marker bytes | |
db 0x64, 0x67, 0x90 ; End marker bytes |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Intel(R) Architecture Code Analyzer Version - v3.0-28-g1ba2cbb build date: 2017-10-30;16:57:45 | |
Analyzed File - any_sse2.o | |
Binary Format - 64Bit | |
Architecture - SKL | |
Analysis Type - Throughput | |
Throughput Analysis Report | |
-------------------------- | |
Block Throughput: 1.00 Cycles Throughput Bottleneck: Dependency chains | |
Loop Count: 23 | |
Port Binding In Cycles Per Iteration: | |
-------------------------------------------------------------------------------------------------- | |
| Port | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 | | |
-------------------------------------------------------------------------------------------------- | |
| Cycles | 1.0 0.0 | 0.5 | 0.5 0.5 | 0.5 0.5 | 0.0 | 0.5 | 1.0 | 0.0 | | |
-------------------------------------------------------------------------------------------------- | |
DV - Divider pipe (on port 0) | |
D - Data fetch pipe (on ports 2 and 3) | |
F - Macro Fusion with the previous instruction occurred | |
* - instruction micro-ops not bound to a port | |
^ - Micro Fusion occurred | |
# - ESP Tracking sync uop was issued | |
@ - SSE instruction followed an AVX256/AVX512 instruction, dozens of cycles penalty is expected | |
X - instruction not supported, was not accounted in Analysis | |
| Num Of | Ports pressure in cycles | | | |
| Uops | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 | | |
----------------------------------------------------------------------------------------- | |
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | movdqa xmm0, xmmword ptr [rdi] | |
| 1 | 1.0 | | | | | | | | pmovmskb eax, xmm0 | |
| 1 | | 0.5 | | | | 0.5 | | | test eax, eax | |
| 1 | | | | | | | 1.0 | | setnz al | |
Total Num Of Uops: 4 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
start: | |
mov ebx, 111 ; Start marker bytes | |
db 0x64, 0x67, 0x90 ; Start marker bytes | |
.L2: | |
;; the any_sse41 implementation starts here: | |
movdqa xmm0, [rdi] | |
ptest xmm0, xmm0 | |
setne al | |
mov ebx, 222 ; End marker bytes | |
db 0x64, 0x67, 0x90 ; End marker bytes |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Intel(R) Architecture Code Analyzer Version - v3.0-28-g1ba2cbb build date: 2017-10-30;16:57:45 | |
Analyzed File - any_sse41.o | |
Binary Format - 64Bit | |
Architecture - SKL | |
Analysis Type - Throughput | |
Throughput Analysis Report | |
-------------------------- | |
Block Throughput: 1.00 Cycles Throughput Bottleneck: Dependency chains | |
Loop Count: 23 | |
Port Binding In Cycles Per Iteration: | |
-------------------------------------------------------------------------------------------------- | |
| Port | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 | | |
-------------------------------------------------------------------------------------------------- | |
| Cycles | 1.0 0.0 | 0.0 | 0.5 0.5 | 0.5 0.5 | 0.0 | 1.0 | 1.0 | 0.0 | | |
-------------------------------------------------------------------------------------------------- | |
DV - Divider pipe (on port 0) | |
D - Data fetch pipe (on ports 2 and 3) | |
F - Macro Fusion with the previous instruction occurred | |
* - instruction micro-ops not bound to a port | |
^ - Micro Fusion occurred | |
# - ESP Tracking sync uop was issued | |
@ - SSE instruction followed an AVX256/AVX512 instruction, dozens of cycles penalty is expected | |
X - instruction not supported, was not accounted in Analysis | |
| Num Of | Ports pressure in cycles | | | |
| Uops | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 | | |
----------------------------------------------------------------------------------------- | |
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | movdqa xmm0, xmmword ptr [rdi] | |
| 2 | 1.0 | | | | | 1.0 | | | ptest xmm0, xmm0 | |
| 1 | | | | | | | 1.0 | | setnz al | |
Total Num Of Uops: 4 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment