gnzlbg/all_sse2.asm

## all_sse2.asm
start:
  mov ebx, 111          ; Start marker bytes
  db 0x64, 0x67, 0x90   ; Start marker bytes
.L2:
  ;;  the all_sse2 implementation starts here:
  movdqa xmm0, [rdi]
  pmovmskb eax, xmm0
  cmp eax, 65535
  sete al
  mov ebx, 222          ; End marker bytes
  db 0x64, 0x67, 0x90   ; End marker bytes

## all_sse2.iaca
Intel(R) Architecture Code Analyzer Version -  v3.0-28-g1ba2cbb build date: 2017-10-30;16:57:45
Analyzed File -  all_sse2.o
Binary Format - 64Bit
Architecture  -  SKL
Analysis Type - Throughput

Throughput Analysis Report
--------------------------
Block Throughput: 1.00 Cycles       Throughput Bottleneck: Dependency chains
Loop Count:  23
Port Binding In Cycles Per Iteration:
--------------------------------------------------------------------------------------------------
|  Port  |   0   -  DV   |   1   |   2   -  D    |   3   -  D    |   4   |   5   |   6   |   7   |
--------------------------------------------------------------------------------------------------
| Cycles |  1.0     0.0  |  0.5  |  0.5     0.5  |  0.5     0.5  |  0.0  |  0.5  |  1.0  |  0.0  |
--------------------------------------------------------------------------------------------------

DV - Divider pipe (on port 0)
D - Data fetch pipe (on ports 2 and 3)
F - Macro Fusion with the previous instruction occurred
* - instruction micro-ops not bound to a port
^ - Micro Fusion occurred
# - ESP Tracking sync uop was issued
@ - SSE instruction followed an AVX256/AVX512 instruction, dozens of cycles penalty is expected
X - instruction not supported, was not accounted in Analysis

| Num Of   |                    Ports pressure in cycles                         |      |
|  Uops    |  0  - DV    |  1   |  2  -  D    |  3  -  D    |  4   |  5   |  6   |  7   |
-----------------------------------------------------------------------------------------
|   1      |             |      | 0.5     0.5 | 0.5     0.5 |      |      |      |      | movdqa xmm0, xmmword ptr [rdi]
|   1      | 1.0         |      |             |             |      |      |      |      | pmovmskb eax, xmm0
|   1      |             | 0.5  |             |             |      | 0.5  |      |      | cmp eax, 0xffff
|   1      |             |      |             |             |      |      | 1.0  |      | setz al
Total Num Of Uops: 4

## all_sse41.asm
start:
  mov ebx, 111          ; Start marker bytes
  db 0x64, 0x67, 0x90   ; Start marker bytes
.L2:
  ;;  the all_sse41 implementation starts here:
  movdqa xmm0, [rdi]
  pcmpeqd xmm1, xmm1
  ptest xmm0, xmm1
  setb al
  mov ebx, 222          ; End marker bytes
  db 0x64, 0x67, 0x90   ; End marker bytes

## all_sse41.iaca
Intel(R) Architecture Code Analyzer Version -  v3.0-28-g1ba2cbb build date: 2017-10-30;16:57:45
Analyzed File -  all_sse41.o
Binary Format - 64Bit
Architecture  -  SKL
Analysis Type - Throughput

Throughput Analysis Report
--------------------------
Block Throughput: 1.24 Cycles       Throughput Bottleneck: Dependency chains
Loop Count:  36
Port Binding In Cycles Per Iteration:
--------------------------------------------------------------------------------------------------
|  Port  |   0   -  DV   |   1   |   2   -  D    |   3   -  D    |   4   |   5   |   6   |   7   |
--------------------------------------------------------------------------------------------------
| Cycles |  1.0     0.0  |  1.0  |  0.5     0.5  |  0.5     0.5  |  0.0  |  1.0  |  1.0  |  0.0  |
--------------------------------------------------------------------------------------------------

DV - Divider pipe (on port 0)
D - Data fetch pipe (on ports 2 and 3)
F - Macro Fusion with the previous instruction occurred
* - instruction micro-ops not bound to a port
^ - Micro Fusion occurred
# - ESP Tracking sync uop was issued
@ - SSE instruction followed an AVX256/AVX512 instruction, dozens of cycles penalty is expected
X - instruction not supported, was not accounted in Analysis

| Num Of   |                    Ports pressure in cycles                         |      |
|  Uops    |  0  - DV    |  1   |  2  -  D    |  3  -  D    |  4   |  5   |  6   |  7   |
-----------------------------------------------------------------------------------------
|   1      |             |      | 0.5     0.5 | 0.5     0.5 |      |      |      |      | movdqa xmm0, xmmword ptr [rdi]
|   1      |             | 1.0  |             |             |      |      |      |      | pcmpeqd xmm1, xmm1
|   2      | 1.0         |      |             |             |      | 1.0  |      |      | ptest xmm0, xmm1
|   1      |             |      |             |             |      |      | 1.0  |      | setb al
Total Num Of Uops: 5

## any_sse2.asm
start:
  mov ebx, 111          ; Start marker bytes
  db 0x64, 0x67, 0x90   ; Start marker bytes
.L2:
  ;;  the any_sse2 implementation starts here:
  movdqa xmm0, [rdi]
  pmovmskb eax, xmm0
  test eax, eax
  setne al
  mov ebx, 222          ; End marker bytes
  db 0x64, 0x67, 0x90   ; End marker bytes

## any_sse2.iaca
Intel(R) Architecture Code Analyzer Version -  v3.0-28-g1ba2cbb build date: 2017-10-30;16:57:45
Analyzed File -  any_sse2.o
Binary Format - 64Bit
Architecture  -  SKL
Analysis Type - Throughput

Throughput Analysis Report
--------------------------
Block Throughput: 1.00 Cycles       Throughput Bottleneck: Dependency chains
Loop Count:  23
Port Binding In Cycles Per Iteration:
--------------------------------------------------------------------------------------------------
|  Port  |   0   -  DV   |   1   |   2   -  D    |   3   -  D    |   4   |   5   |   6   |   7   |
--------------------------------------------------------------------------------------------------
| Cycles |  1.0     0.0  |  0.5  |  0.5     0.5  |  0.5     0.5  |  0.0  |  0.5  |  1.0  |  0.0  |
--------------------------------------------------------------------------------------------------

DV - Divider pipe (on port 0)
D - Data fetch pipe (on ports 2 and 3)
F - Macro Fusion with the previous instruction occurred
* - instruction micro-ops not bound to a port
^ - Micro Fusion occurred
# - ESP Tracking sync uop was issued
@ - SSE instruction followed an AVX256/AVX512 instruction, dozens of cycles penalty is expected
X - instruction not supported, was not accounted in Analysis

| Num Of   |                    Ports pressure in cycles                         |      |
|  Uops    |  0  - DV    |  1   |  2  -  D    |  3  -  D    |  4   |  5   |  6   |  7   |
-----------------------------------------------------------------------------------------
|   1      |             |      | 0.5     0.5 | 0.5     0.5 |      |      |      |      | movdqa xmm0, xmmword ptr [rdi]
|   1      | 1.0         |      |             |             |      |      |      |      | pmovmskb eax, xmm0
|   1      |             | 0.5  |             |             |      | 0.5  |      |      | test eax, eax
|   1      |             |      |             |             |      |      | 1.0  |      | setnz al
Total Num Of Uops: 4

## any_sse41.asm
start:
  mov ebx, 111          ; Start marker bytes
  db 0x64, 0x67, 0x90   ; Start marker bytes
.L2:
  ;;  the any_sse41 implementation starts here:
  movdqa xmm0, [rdi]
  ptest xmm0, xmm0
  setne al
  mov ebx, 222          ; End marker bytes
  db 0x64, 0x67, 0x90   ; End marker bytes

## any_sse41.iaca
Intel(R) Architecture Code Analyzer Version -  v3.0-28-g1ba2cbb build date: 2017-10-30;16:57:45
Analyzed File -  any_sse41.o
Binary Format - 64Bit
Architecture  -  SKL
Analysis Type - Throughput

Throughput Analysis Report
--------------------------
Block Throughput: 1.00 Cycles       Throughput Bottleneck: Dependency chains
Loop Count:  23
Port Binding In Cycles Per Iteration:
--------------------------------------------------------------------------------------------------
|  Port  |   0   -  DV   |   1   |   2   -  D    |   3   -  D    |   4   |   5   |   6   |   7   |
--------------------------------------------------------------------------------------------------
| Cycles |  1.0     0.0  |  0.0  |  0.5     0.5  |  0.5     0.5  |  0.0  |  1.0  |  1.0  |  0.0  |
--------------------------------------------------------------------------------------------------

DV - Divider pipe (on port 0)
D - Data fetch pipe (on ports 2 and 3)
F - Macro Fusion with the previous instruction occurred
* - instruction micro-ops not bound to a port
^ - Micro Fusion occurred
# - ESP Tracking sync uop was issued
@ - SSE instruction followed an AVX256/AVX512 instruction, dozens of cycles penalty is expected
X - instruction not supported, was not accounted in Analysis

| Num Of   |                    Ports pressure in cycles                         |      |
|  Uops    |  0  - DV    |  1   |  2  -  D    |  3  -  D    |  4   |  5   |  6   |  7   |
-----------------------------------------------------------------------------------------
|   1      |             |      | 0.5     0.5 | 0.5     0.5 |      |      |      |      | movdqa xmm0, xmmword ptr [rdi]
|   2      | 1.0         |      |             |             |      | 1.0  |      |      | ptest xmm0, xmm0
|   1      |             |      |             |             |      |      | 1.0  |      | setnz al
Total Num Of Uops: 4
	start:
	mov ebx, 111 ; Start marker bytes
	db 0x64, 0x67, 0x90 ; Start marker bytes
	.L2:
	;; the all_sse2 implementation starts here:
	movdqa xmm0, [rdi]
	pmovmskb eax, xmm0
	cmp eax, 65535
	sete al
	mov ebx, 222 ; End marker bytes
	db 0x64, 0x67, 0x90 ; End marker bytes
	Intel(R) Architecture Code Analyzer Version - v3.0-28-g1ba2cbb build date: 2017-10-30;16:57:45
	Analyzed File - all_sse2.o
	Binary Format - 64Bit
	Architecture - SKL
	Analysis Type - Throughput

	Throughput Analysis Report
	--------------------------
	Block Throughput: 1.00 Cycles Throughput Bottleneck: Dependency chains
	Loop Count: 23
	Port Binding In Cycles Per Iteration:
	--------------------------------------------------------------------------------------------------
	\| Port \| 0 - DV \| 1 \| 2 - D \| 3 - D \| 4 \| 5 \| 6 \| 7 \|
	--------------------------------------------------------------------------------------------------
	\| Cycles \| 1.0 0.0 \| 0.5 \| 0.5 0.5 \| 0.5 0.5 \| 0.0 \| 0.5 \| 1.0 \| 0.0 \|
	--------------------------------------------------------------------------------------------------

	DV - Divider pipe (on port 0)
	D - Data fetch pipe (on ports 2 and 3)
	F - Macro Fusion with the previous instruction occurred
	* - instruction micro-ops not bound to a port
	^ - Micro Fusion occurred
	# - ESP Tracking sync uop was issued
	@ - SSE instruction followed an AVX256/AVX512 instruction, dozens of cycles penalty is expected
	X - instruction not supported, was not accounted in Analysis

	\| Num Of \| Ports pressure in cycles \| \|
	\| Uops \| 0 - DV \| 1 \| 2 - D \| 3 - D \| 4 \| 5 \| 6 \| 7 \|
	-----------------------------------------------------------------------------------------
	\| 1 \| \| \| 0.5 0.5 \| 0.5 0.5 \| \| \| \| \| movdqa xmm0, xmmword ptr [rdi]
	\| 1 \| 1.0 \| \| \| \| \| \| \| \| pmovmskb eax, xmm0
	\| 1 \| \| 0.5 \| \| \| \| 0.5 \| \| \| cmp eax, 0xffff
	\| 1 \| \| \| \| \| \| \| 1.0 \| \| setz al
	Total Num Of Uops: 4
	Intel(R) Architecture Code Analyzer Version - v3.0-28-g1ba2cbb build date: 2017-10-30;16:57:45
	Analyzed File - all_sse41.o
	Binary Format - 64Bit
	Architecture - SKL
	Analysis Type - Throughput

	Throughput Analysis Report
	--------------------------
	Block Throughput: 1.24 Cycles Throughput Bottleneck: Dependency chains
	Loop Count: 36
	Port Binding In Cycles Per Iteration:
	--------------------------------------------------------------------------------------------------
	\| Port \| 0 - DV \| 1 \| 2 - D \| 3 - D \| 4 \| 5 \| 6 \| 7 \|
	--------------------------------------------------------------------------------------------------
	\| Cycles \| 1.0 0.0 \| 1.0 \| 0.5 0.5 \| 0.5 0.5 \| 0.0 \| 1.0 \| 1.0 \| 0.0 \|
	--------------------------------------------------------------------------------------------------

	DV - Divider pipe (on port 0)
	D - Data fetch pipe (on ports 2 and 3)
	F - Macro Fusion with the previous instruction occurred
	* - instruction micro-ops not bound to a port
	^ - Micro Fusion occurred
	# - ESP Tracking sync uop was issued
	@ - SSE instruction followed an AVX256/AVX512 instruction, dozens of cycles penalty is expected
	X - instruction not supported, was not accounted in Analysis

	\| Num Of \| Ports pressure in cycles \| \|
	\| Uops \| 0 - DV \| 1 \| 2 - D \| 3 - D \| 4 \| 5 \| 6 \| 7 \|
	-----------------------------------------------------------------------------------------
	\| 1 \| \| \| 0.5 0.5 \| 0.5 0.5 \| \| \| \| \| movdqa xmm0, xmmword ptr [rdi]
	\| 1 \| \| 1.0 \| \| \| \| \| \| \| pcmpeqd xmm1, xmm1
	\| 2 \| 1.0 \| \| \| \| \| 1.0 \| \| \| ptest xmm0, xmm1
	\| 1 \| \| \| \| \| \| \| 1.0 \| \| setb al
	Total Num Of Uops: 5
	Intel(R) Architecture Code Analyzer Version - v3.0-28-g1ba2cbb build date: 2017-10-30;16:57:45
	Analyzed File - any_sse2.o
	Binary Format - 64Bit
	Architecture - SKL
	Analysis Type - Throughput

	Throughput Analysis Report
	--------------------------
	Block Throughput: 1.00 Cycles Throughput Bottleneck: Dependency chains
	Loop Count: 23
	Port Binding In Cycles Per Iteration:
	--------------------------------------------------------------------------------------------------
	\| Port \| 0 - DV \| 1 \| 2 - D \| 3 - D \| 4 \| 5 \| 6 \| 7 \|
	--------------------------------------------------------------------------------------------------
	\| Cycles \| 1.0 0.0 \| 0.5 \| 0.5 0.5 \| 0.5 0.5 \| 0.0 \| 0.5 \| 1.0 \| 0.0 \|
	--------------------------------------------------------------------------------------------------

	DV - Divider pipe (on port 0)
	D - Data fetch pipe (on ports 2 and 3)
	F - Macro Fusion with the previous instruction occurred
	* - instruction micro-ops not bound to a port
	^ - Micro Fusion occurred
	# - ESP Tracking sync uop was issued
	@ - SSE instruction followed an AVX256/AVX512 instruction, dozens of cycles penalty is expected
	X - instruction not supported, was not accounted in Analysis

	\| Num Of \| Ports pressure in cycles \| \|
	\| Uops \| 0 - DV \| 1 \| 2 - D \| 3 - D \| 4 \| 5 \| 6 \| 7 \|
	-----------------------------------------------------------------------------------------
	\| 1 \| \| \| 0.5 0.5 \| 0.5 0.5 \| \| \| \| \| movdqa xmm0, xmmword ptr [rdi]
	\| 1 \| 1.0 \| \| \| \| \| \| \| \| pmovmskb eax, xmm0
	\| 1 \| \| 0.5 \| \| \| \| 0.5 \| \| \| test eax, eax
	\| 1 \| \| \| \| \| \| \| 1.0 \| \| setnz al
	Total Num Of Uops: 4
	Intel(R) Architecture Code Analyzer Version - v3.0-28-g1ba2cbb build date: 2017-10-30;16:57:45
	Analyzed File - any_sse41.o
	Binary Format - 64Bit
	Architecture - SKL
	Analysis Type - Throughput

	Throughput Analysis Report
	--------------------------
	Block Throughput: 1.00 Cycles Throughput Bottleneck: Dependency chains
	Loop Count: 23
	Port Binding In Cycles Per Iteration:
	--------------------------------------------------------------------------------------------------
	\| Port \| 0 - DV \| 1 \| 2 - D \| 3 - D \| 4 \| 5 \| 6 \| 7 \|
	--------------------------------------------------------------------------------------------------
	\| Cycles \| 1.0 0.0 \| 0.0 \| 0.5 0.5 \| 0.5 0.5 \| 0.0 \| 1.0 \| 1.0 \| 0.0 \|
	--------------------------------------------------------------------------------------------------

	DV - Divider pipe (on port 0)
	D - Data fetch pipe (on ports 2 and 3)
	F - Macro Fusion with the previous instruction occurred
	* - instruction micro-ops not bound to a port
	^ - Micro Fusion occurred
	# - ESP Tracking sync uop was issued
	@ - SSE instruction followed an AVX256/AVX512 instruction, dozens of cycles penalty is expected
	X - instruction not supported, was not accounted in Analysis

	\| Num Of \| Ports pressure in cycles \| \|
	\| Uops \| 0 - DV \| 1 \| 2 - D \| 3 - D \| 4 \| 5 \| 6 \| 7 \|
	-----------------------------------------------------------------------------------------
	\| 1 \| \| \| 0.5 0.5 \| 0.5 0.5 \| \| \| \| \| movdqa xmm0, xmmword ptr [rdi]
	\| 2 \| 1.0 \| \| \| \| \| 1.0 \| \| \| ptest xmm0, xmm0
	\| 1 \| \| \| \| \| \| \| 1.0 \| \| setnz al
	Total Num Of Uops: 4