rygorous/gist:159aa1c4573077126169

## gistfile1.txt
So I would like to be able to write functions that use SSEx intrinsics (that are called
via some CPU dispatch mechanism) without allowing the compiler to use SSEx instructions
everywhere (because that's not under control of the runtime CPU dispatch we have).

On VC++, this is easy. I get to use whatever intrinsics I want, and the compiler will
emit the corresponding instructions. It will not use these same instructions in code
that wasn't written with intrinsics unless I specifically allow it to with a
command-line option. In GCC and Clang, this turns out to be pretty hard (or at least
there's no good way I know of). And yes, I fully realize that the under-the-hood semantics
of this are tricky, since in a modern compiler these vector intrinsics turn into an IL that
undergoes several transforms, and it may not be obvious to the back-end where they came
from and whether it's allowed to say match a codegen pattern that uses a SSE3 instruction
or not in a particular context.

So you need to define precisely what the desired behavior actually is to decide what
should happen in such a case. A reasonable formalization is this: unless I have specified
some option that allows the compiler to use some instruction set extension everywhere
(for example, "/arch:SSE2" for VC++ lets the compiler use SSE2 instructions wherever it
wants), the compiler may only emit SSE2 instructions within functions that use SSE2
intrinsics (and hence implicitly require SSE2 anyway). Thus, even if I didn't specify
"/arch:SSE2", I would be okay with the compiler using SSE2 instructions for general-purpose
code in such functions. On x64, this particular example is somewhat moot (since x64 includes
SSE2); but the actual *behavior* of VC++ in such cases is very convenient and
programmer-friendly, and I would like to see more compilers adopt it.

I *would* like to be able to use SSSE3 intrinsics in arbitrary (x86) code, without at the
same time allowing the compiler to use SSSE3 code everywhere else in that translation unit
as consequence of automatic transformations (say, replacing a sequence of permuted integer
loads and stores with a MOVDQU, PSHUFB, MOVDQU). I get that behavior in VC++ but not in other
compilers. The problem is that while this behavior is easy to describe at the source level,
it's not necessarily obvious at the IL level.

Thus, here is my formal, precise, source language-agnostic definition of the behavior I
would like to see: I am okay with the compiler emitting (say) SSSE3 instructions in any
block that is dominated by a block containing SSSE3 intrinsics (that is, source language
statements that require SSSE3). (I am of course also okay with the compiler using SSSE3
instructions when their usage was globally enabled using a command-line switch, but I
would prefer something more selective for code that needs to run on older machines and
can't just be compiled with "ZOMG use SSE4.2 *everywhere*").

-

This *sounds* more complicated than say "don't automatically introduce SSSE3 instructions
at all" or "just give me a function-level annotation", but both of these approaches have
problems: the former is actually tricky when intrinsics are rewritten to a generic form in
the IL, and needlessly restrictive besides; the latter is fine in principle, but in practice
tend to break frequently as soon as there's inlining or link-time optimization is involved.
So my hope is that expressing the property I want purely in terms of things that are
available in a low-level, basic-blocks-plus-CFG form is helpful.
	So I would like to be able to write functions that use SSEx intrinsics (that are called
	via some CPU dispatch mechanism) without allowing the compiler to use SSEx instructions
	everywhere (because that's not under control of the runtime CPU dispatch we have).

	On VC++, this is easy. I get to use whatever intrinsics I want, and the compiler will
	emit the corresponding instructions. It will not use these same instructions in code
	that wasn't written with intrinsics unless I specifically allow it to with a
	command-line option. In GCC and Clang, this turns out to be pretty hard (or at least
	there's no good way I know of). And yes, I fully realize that the under-the-hood semantics
	of this are tricky, since in a modern compiler these vector intrinsics turn into an IL that
	undergoes several transforms, and it may not be obvious to the back-end where they came
	from and whether it's allowed to say match a codegen pattern that uses a SSE3 instruction
	or not in a particular context.

	So you need to define precisely what the desired behavior actually is to decide what
	should happen in such a case. A reasonable formalization is this: unless I have specified
	some option that allows the compiler to use some instruction set extension everywhere
	(for example, "/arch:SSE2" for VC++ lets the compiler use SSE2 instructions wherever it
	wants), the compiler may only emit SSE2 instructions within functions that use SSE2
	intrinsics (and hence implicitly require SSE2 anyway). Thus, even if I didn't specify
	"/arch:SSE2", I would be okay with the compiler using SSE2 instructions for general-purpose
	code in such functions. On x64, this particular example is somewhat moot (since x64 includes
	SSE2); but the actual behavior of VC++ in such cases is very convenient and
	programmer-friendly, and I would like to see more compilers adopt it.

	I would like to be able to use SSSE3 intrinsics in arbitrary (x86) code, without at the
	same time allowing the compiler to use SSSE3 code everywhere else in that translation unit
	as consequence of automatic transformations (say, replacing a sequence of permuted integer
	loads and stores with a MOVDQU, PSHUFB, MOVDQU). I get that behavior in VC++ but not in other
	compilers. The problem is that while this behavior is easy to describe at the source level,
	it's not necessarily obvious at the IL level.

	Thus, here is my formal, precise, source language-agnostic definition of the behavior I
	would like to see: I am okay with the compiler emitting (say) SSSE3 instructions in any
	block that is dominated by a block containing SSSE3 intrinsics (that is, source language
	statements that require SSSE3). (I am of course also okay with the compiler using SSSE3
	instructions when their usage was globally enabled using a command-line switch, but I
	would prefer something more selective for code that needs to run on older machines and
	can't just be compiled with "ZOMG use SSE4.2 everywhere").

	-

	This sounds more complicated than say "don't automatically introduce SSSE3 instructions
	at all" or "just give me a function-level annotation", but both of these approaches have
	problems: the former is actually tricky when intrinsics are rewritten to a generic form in
	the IL, and needlessly restrictive besides; the latter is fine in principle, but in practice
	tend to break frequently as soon as there's inlining or link-time optimization is involved.
	So my hope is that expressing the property I want purely in terms of things that are
	available in a low-level, basic-blocks-plus-CFG form is helpful.