nicolasvasilache/gist:68c7f34012584b0e00f335bcb374ede0 Secret

## gistfile1.txt
Hi everyone,

I am experimenting with LLVM lowering, intrinsics and shufflevector in general.

Here is an IR that I produce with the objective of emitting some vblendps instructions: https://gist.github.com/nicolasvasilache/0fe30c83cbfe5b4776ec9f0ee465611a.

I compile this further with

clang -x ir -emit-llvm -S -mcpu=haswell -O3 -o - | llc -O3 -mcpu=haswell - -o -

to obtain:

https://gist.github.com/nicolasvasilache/2c773b86fcda01cc28711828a0a9ce0a

At this point, I would expect to see some vblendps instructions generated for the pieces of IR that produce %48/%49 %51/%52 %54/%55 and %57/%58 to reduce pressure on port 5 (vblendps can also go on ports 0 and 1). However the expected instruction does not get generated and llvm-mca continues to show me high port 5 contention.

Could people suggest some steps / commands to help better understand why my expectation is not met and whether I can do something to make the compiler generate what I want? Thanks in advance!

I have verified independently that in isolation, a single such shuffle creates a vblendps. I see them being recombined in the produced assembly and I am looking for experimenting with avoiding that vshufps + vblendps + vblendps get recombined into vunpckxxx + vunpckxxx instructions.

--
N

Simon Pilgrim via llvm-dev
Nov 9, 2021, 9:44:43 PM (11 days ago)
to llvm...@lists.llvm.org
On 09/11/2021 08:57, Nicolas Vasilache via llvm-dev wrote:
> Hi everyone,
>
> I am experimenting with LLVM lowering, intrinsics and shufflevector in
> general.
>
> Here is an IR that I produce with the objective of emitting some
> vblendps instructions:
> https://gist.github.com/nicolasvasilache/0fe30c83cbfe5b4776ec9f0ee465611a.
>
From what I can see, the original IR code was (effectively):
8 x UNPCKLPS/UNPCKHPS
4 x SHUFPS
8 x BLENDPS
4 x INSERTF128
4 x PERM2F128

> I compile this further with
>
> clang -x ir -emit-llvm -S -mcpu=haswell -O3 -o - | llc -O3
> -mcpu=haswell - -o -
>
> to obtain:
>
> https://gist.github.com/nicolasvasilache/2c773b86fcda01cc28711828a0a9ce0a

and after the x86 shuffle combines:

8 x UNPCKLPS/UNPCKHPS
8 x UNPCKLPD/UNPCKHPD
4 x INSERTF128
4 x PERM2F128

Starting from each BLENDPS, they've combined with the SHUFPS to create
the UNPCK*PD nodes. We nearly always benefit from folding shuffle chains
to reduce total instruction counts, even if some inner nodes have
multiple uses (like the SHUFPS), and I'd hate to lose that.

> At this point, I would expect to see some vblendps instructions
> generated for the pieces of IR that produce %48/%49 %51/%52 %54/%55
> and %57/%58 to reduce pressure on port 5 (vblendps can also go on
> ports 0 and 1). However the expected instruction does not get
> generated and llvm-mca continues to show me high port 5 contention.
>
> Could people suggest some steps / commands to help better understand
> why my expectation is not met and whether I can do something to make
> the compiler generate what I want? Thanks in advance!

So on Haswell, we've gained 4 extra Port5-only shuffles but removed the
8 Port015 blends.

We have very little arch-specific shuffle combines, just the
fast-variable-shuffle tuning flags to avoid unnecessary shuffle mask
loads, the shuffle combines just aims for the reduction in simple target
shuffle nodes. And tbh I'm reluctant to add to this as shuffle combining
is complex already.

We should be preferring to lower/combine to BLENDPS in more
circumstances (its commutable and never slower than any other target
shuffle, although demanded elts can do less with 'undef' elements), but
that won't help us here.

So far I've failed to find a BLEND-based 8x8 transpose pattern that the
shuffle combiner doesn't manage to combine back to the 8xUNPCK/SHUFPS ops :(

> I have verified independently that in isolation, a single such shuffle
> creates a vblendps. I see them being recombined in the produced
> assembly and I am looking for experimenting with avoiding that vshufps
> + vblendps + vblendps get recombined into vunpckxxx + vunpckxxx
> instructions.
>
> --

_______________________________________________
LLVM Developers mailing list
llvm...@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

Simon Pilgrim via llvm-dev
Nov 9, 2021, 10:32:16 PM (11 days ago)
to llvm...@lists.llvm.org
The only thing I can think of is you might want to see if you can
reorder the INSERTF128/PERM2F128 shuffles in between the UNPACK*PS and
the SHUFPS/BLENDPS:

8 x UNPCKLPS/UNPCKHPS
4 x INSERTF128
4 x PERM2F128


4 x SHUFPS
8 x BLENDPS

Splitting the per-lane shuffles with the subvector-shuffles could help
stop the shuffle combiner.

Diego Caballero via llvm-dev's profile photo
Diego Caballero via llvm-dev
Nov 10, 2021, 10:31:12 AM (11 days ago)
to Simon Pilgrim, Nicolas Vasilache, llvm...@lists.llvm.org
+Nicolas Vasilache :)
Nicolas Vasilache via llvm-dev's profile photo
Nicolas Vasilache via llvm-dev
Nov 10, 2021, 10:46:44 AM (11 days ago)
to Diego Caballero, llvm...@lists.llvm.org


On Wed, Nov 10, 2021 at 10:30 AM Diego Caballero <diegoca...@google.com> wrote:
+Nicolas Vasilache :)

Thanks Diego, email is hard, I could not find ways to inject myself into my own discussion...
If you are referring to this specific code, yes same for me.
If you are thinking about the general 8x8 transpose problem, I have a version with vector<4xf32> loads that ends up using blends; as expected, the port 5 pressure reduction helps and both llvm-mca and runtime agree that this is 20-30% faster.


The only thing I can think of is you might want to see if you can
reorder the INSERTF128/PERM2F128 shuffles in between the UNPACK*PS and
the SHUFPS/BLENDPS:

8 x UNPCKLPS/UNPCKHPS
4 x INSERTF128
4 x PERM2F128
4 x SHUFPS
8 x BLENDPS

Splitting the per-lane shuffles with the subvector-shuffles could help
stop the shuffle combiner.

Right, I tried different variations here but invariably getting the same result.
The vector<4xf32> based version is something that I also want to target for a bunch of orthogonal reasons.
I'll note that my use case is MLIR codegen with explicit vectors and intrinsics -> LLVM so I have quite some flexibility.
But it feels unnatural in the compiler flow to have to branch off at a significant higher-level of abstraction to sidestep concerns related to X86 microarchitecture details.

As I am very new to this part of LLVM, I am not sure what is feasible or not. Would it be envisionnable to either:
1. have a way to inject some numeric cost to influence the value of some resulting combinations?
2. revive some form of intrinsic and guarantee that the instruction would be generated?

I realize point 2. is contrary to the evolution of LLVM as these intrinsics were removed ca. 2015 in favor of the combiner-based approach.
Still it seems that `we have very little arch-specific shuffle combines` could be the signal that such intrinsics are needed?


>> I have verified independently that in isolation, a single such
>> shuffle creates a vblendps. I see them being recombined in the
>> produced assembly and I am looking for experimenting with avoiding
>> that vshufps + vblendps + vblendps get recombined into vunpckxxx +
>> vunpckxxx instructions.
>>
>> --
_______________________________________________
LLVM Developers mailing list
llvm...@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev


--
N

Wang, Pengfei via llvm-dev
Nov 11, 2021, 9:35:01 AM (10 days ago)
to Nicolas Vasilache, Diego Caballero, llvm...@lists.llvm.org
>As I am very new to this part of LLVM, I am not sure what is feasible or not. Would it be envisionnable to either:

>1. have a way to inject some numeric cost to influence the value of some resulting combinations?

>2. revive some form of intrinsic and guarantee that the instruction would be generated?


I think a feasible way is to add a new tuningXXX feature for given targets and do something different with the flag in the combine.

1) seems overengineering and 2) seems overkilled for potential opportunities by the combine.


Thanks

Phoebe


Simon Pilgrim via llvm-dev
Nov 14, 2021, 4:53:04 PM (6 days ago)
to Wang, Pengfei, Nicolas Vasilache, Diego Caballero, llvm...@lists.llvm.org
Nicolas - have you investigated just using inline asm instead?
Nicolas Vasilache via llvm-dev's profile photo
Nicolas Vasilache via llvm-dev
Nov 14, 2021, 11:17:04 PM (6 days ago)
to Simon Pilgrim, llvm...@lists.llvm.org
Not yet, the InlineAsmOp in MLIR is still generally unused.
It has been used a bit in the IREE project though (https://github.com/google/iree/blob/49a81c60329437e64791ee1abd09d47fe1cde205/iree/compiler/Codegen/LLVMCPU/VectorContractToAArch64InlineAsmOp.cpp#L103).

I should be be indeed able to intersperse my lowering (https://github.com/llvm/llvm-project/blob/main/mlir/lib/Dialect/X86Vector/Transforms/AVXTranspose.cpp#L124) with some InlineAsmOp uses.
I'll report back when I have something.
--
N
	Hi everyone,

	I am experimenting with LLVM lowering, intrinsics and shufflevector in general.

	Here is an IR that I produce with the objective of emitting some vblendps instructions: https://gist.github.com/nicolasvasilache/0fe30c83cbfe5b4776ec9f0ee465611a.

	I compile this further with

	clang -x ir -emit-llvm -S -mcpu=haswell -O3 -o - \| llc -O3 -mcpu=haswell - -o -

	to obtain:

	https://gist.github.com/nicolasvasilache/2c773b86fcda01cc28711828a0a9ce0a

	At this point, I would expect to see some vblendps instructions generated for the pieces of IR that produce %48/%49 %51/%52 %54/%55 and %57/%58 to reduce pressure on port 5 (vblendps can also go on ports 0 and 1). However the expected instruction does not get generated and llvm-mca continues to show me high port 5 contention.

	Could people suggest some steps / commands to help better understand why my expectation is not met and whether I can do something to make the compiler generate what I want? Thanks in advance!

	I have verified independently that in isolation, a single such shuffle creates a vblendps. I see them being recombined in the produced assembly and I am looking for experimenting with avoiding that vshufps + vblendps + vblendps get recombined into vunpckxxx + vunpckxxx instructions.

	--
	N

	Simon Pilgrim via llvm-dev
	Nov 9, 2021, 9:44:43 PM (11 days ago)
	to llvm...@lists.llvm.org
	On 09/11/2021 08:57, Nicolas Vasilache via llvm-dev wrote:
	> Hi everyone,
	>
	> I am experimenting with LLVM lowering, intrinsics and shufflevector in
	> general.
	>
	> Here is an IR that I produce with the objective of emitting some
	> vblendps instructions:
	> https://gist.github.com/nicolasvasilache/0fe30c83cbfe5b4776ec9f0ee465611a.
	>
	From what I can see, the original IR code was (effectively):
	8 x UNPCKLPS/UNPCKHPS
	4 x SHUFPS
	8 x BLENDPS
	4 x INSERTF128
	4 x PERM2F128

	> I compile this further with
	>
	> clang -x ir -emit-llvm -S -mcpu=haswell -O3 -o - \| llc -O3
	> -mcpu=haswell - -o -
	>
	> to obtain:
	>
	> https://gist.github.com/nicolasvasilache/2c773b86fcda01cc28711828a0a9ce0a

	and after the x86 shuffle combines:

	8 x UNPCKLPS/UNPCKHPS
	8 x UNPCKLPD/UNPCKHPD
	4 x INSERTF128
	4 x PERM2F128

	Starting from each BLENDPS, they've combined with the SHUFPS to create
	the UNPCK*PD nodes. We nearly always benefit from folding shuffle chains
	to reduce total instruction counts, even if some inner nodes have
	multiple uses (like the SHUFPS), and I'd hate to lose that.

	> At this point, I would expect to see some vblendps instructions
	> generated for the pieces of IR that produce %48/%49 %51/%52 %54/%55
	> and %57/%58 to reduce pressure on port 5 (vblendps can also go on
	> ports 0 and 1). However the expected instruction does not get
	> generated and llvm-mca continues to show me high port 5 contention.
	>
	> Could people suggest some steps / commands to help better understand
	> why my expectation is not met and whether I can do something to make
	> the compiler generate what I want? Thanks in advance!

	So on Haswell, we've gained 4 extra Port5-only shuffles but removed the
	8 Port015 blends.

	We have very little arch-specific shuffle combines, just the
	fast-variable-shuffle tuning flags to avoid unnecessary shuffle mask
	loads, the shuffle combines just aims for the reduction in simple target
	shuffle nodes. And tbh I'm reluctant to add to this as shuffle combining
	is complex already.

	We should be preferring to lower/combine to BLENDPS in more
	circumstances (its commutable and never slower than any other target
	shuffle, although demanded elts can do less with 'undef' elements), but
	that won't help us here.

	So far I've failed to find a BLEND-based 8x8 transpose pattern that the
	shuffle combiner doesn't manage to combine back to the 8xUNPCK/SHUFPS ops :(

	> I have verified independently that in isolation, a single such shuffle
	> creates a vblendps. I see them being recombined in the produced
	> assembly and I am looking for experimenting with avoiding that vshufps
	> + vblendps + vblendps get recombined into vunpckxxx + vunpckxxx
	> instructions.
	>
	> --

	_______________________________________________
	LLVM Developers mailing list
	llvm...@lists.llvm.org
	https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

	Simon Pilgrim via llvm-dev
	Nov 9, 2021, 10:32:16 PM (11 days ago)
	to llvm...@lists.llvm.org
	The only thing I can think of is you might want to see if you can
	reorder the INSERTF128/PERM2F128 shuffles in between the UNPACK*PS and
	the SHUFPS/BLENDPS:

	8 x UNPCKLPS/UNPCKHPS
	4 x INSERTF128
	4 x PERM2F128


	4 x SHUFPS
	8 x BLENDPS

	Splitting the per-lane shuffles with the subvector-shuffles could help
	stop the shuffle combiner.

	Diego Caballero via llvm-dev's profile photo
	Diego Caballero via llvm-dev
	Nov 10, 2021, 10:31:12 AM (11 days ago)
	to Simon Pilgrim, Nicolas Vasilache, llvm...@lists.llvm.org
	+Nicolas Vasilache :)
	Nicolas Vasilache via llvm-dev's profile photo
	Nicolas Vasilache via llvm-dev
	Nov 10, 2021, 10:46:44 AM (11 days ago)
	to Diego Caballero, llvm...@lists.llvm.org


	On Wed, Nov 10, 2021 at 10:30 AM Diego Caballero <diegoca...@google.com> wrote:
	+Nicolas Vasilache :)

	Thanks Diego, email is hard, I could not find ways to inject myself into my own discussion...
	If you are referring to this specific code, yes same for me.
	If you are thinking about the general 8x8 transpose problem, I have a version with vector<4xf32> loads that ends up using blends; as expected, the port 5 pressure reduction helps and both llvm-mca and runtime agree that this is 20-30% faster.


	The only thing I can think of is you might want to see if you can
	reorder the INSERTF128/PERM2F128 shuffles in between the UNPACK*PS and
	the SHUFPS/BLENDPS:

	8 x UNPCKLPS/UNPCKHPS
	4 x INSERTF128
	4 x PERM2F128
	4 x SHUFPS
	8 x BLENDPS

	Splitting the per-lane shuffles with the subvector-shuffles could help
	stop the shuffle combiner.

	Right, I tried different variations here but invariably getting the same result.
	The vector<4xf32> based version is something that I also want to target for a bunch of orthogonal reasons.
	I'll note that my use case is MLIR codegen with explicit vectors and intrinsics -> LLVM so I have quite some flexibility.
	But it feels unnatural in the compiler flow to have to branch off at a significant higher-level of abstraction to sidestep concerns related to X86 microarchitecture details.

	As I am very new to this part of LLVM, I am not sure what is feasible or not. Would it be envisionnable to either:
	1. have a way to inject some numeric cost to influence the value of some resulting combinations?
	2. revive some form of intrinsic and guarantee that the instruction would be generated?

	I realize point 2. is contrary to the evolution of LLVM as these intrinsics were removed ca. 2015 in favor of the combiner-based approach.
	Still it seems that `we have very little arch-specific shuffle combines` could be the signal that such intrinsics are needed?


	>> I have verified independently that in isolation, a single such
	>> shuffle creates a vblendps. I see them being recombined in the
	>> produced assembly and I am looking for experimenting with avoiding
	>> that vshufps + vblendps + vblendps get recombined into vunpckxxx +
	>> vunpckxxx instructions.
	>>
	>> --
	_______________________________________________
	LLVM Developers mailing list
	llvm...@lists.llvm.org
	https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev


	--
	N

	Wang, Pengfei via llvm-dev
	Nov 11, 2021, 9:35:01 AM (10 days ago)
	to Nicolas Vasilache, Diego Caballero, llvm...@lists.llvm.org
	>As I am very new to this part of LLVM, I am not sure what is feasible or not. Would it be envisionnable to either:

	>1. have a way to inject some numeric cost to influence the value of some resulting combinations?

	>2. revive some form of intrinsic and guarantee that the instruction would be generated?



	I think a feasible way is to add a new tuningXXX feature for given targets and do something different with the flag in the combine.

	1) seems overengineering and 2) seems overkilled for potential opportunities by the combine.



	Thanks

	Phoebe


	Simon Pilgrim via llvm-dev
	Nov 14, 2021, 4:53:04 PM (6 days ago)
	to Wang, Pengfei, Nicolas Vasilache, Diego Caballero, llvm...@lists.llvm.org
	Nicolas - have you investigated just using inline asm instead?
	Nicolas Vasilache via llvm-dev's profile photo
	Nicolas Vasilache via llvm-dev
	Nov 14, 2021, 11:17:04 PM (6 days ago)
	to Simon Pilgrim, llvm...@lists.llvm.org
	Not yet, the InlineAsmOp in MLIR is still generally unused.
	It has been used a bit in the IREE project though (https://github.com/google/iree/blob/49a81c60329437e64791ee1abd09d47fe1cde205/iree/compiler/Codegen/LLVMCPU/VectorContractToAArch64InlineAsmOp.cpp#L103).

	I should be be indeed able to intersperse my lowering (https://github.com/llvm/llvm-project/blob/main/mlir/lib/Dialect/X86Vector/Transforms/AVXTranspose.cpp#L124) with some InlineAsmOp uses.
	I'll report back when I have something.
	--
	N