Created
April 23, 2015 21:25
-
-
Save rygorous/f26f5f60284d9d9246f6 to your computer and use it in GitHub Desktop.
Problems with GCCs per-function target specifiers (at least in 4.6.3)
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
So in RADFFT we have some SSE3 kernel functions that have been the source of a lot | |
of compilation "fun". With VC++ it's no problem, with GCC it evidently is. We don't | |
want to specify "-msse3" since that allows the compiler to use SSE3 outside these | |
functions - all of which are called within a CPU feature dispach block - and that | |
not only can but has caused breakage for customers in the past. | |
GCC says "put the SSE3 code in a separate translation unit and compile just that with | |
-msse3". There's been at least three problems with this that I'm aware of: | |
a) It forces us to make these small internal DSP kernels public (currently they're | |
static functions), which is extra namespace pollution, which is not a huge deal | |
but still blows. | |
b) We've had fun issues with function attributes "bleeding out" or getting ignored | |
in the past, mostly involving inlining. This is less of an issue when we're | |
talking about a separate translation unit - unless there's LTO involved, | |
which it often is. Yay. | |
c) Having to add extra command line options, only on some files, only when using | |
GCC, and only on some platforms (namely x86 targets) is, in that exact | |
combination, a pretty giant pain in the ass in the build process and super | |
error-prone. Just not instantiating say SSE3 functions when the compiler doesn't | |
support the intrinsics is easy. Messing with the build flags on specific files | |
and under specific circumstances is something of a PITA in all build systems | |
I've dealt with, but CDep is worse than most. | |
Okay. So I was pretty excited to see that GCC has this: | |
https://gcc.gnu.org/onlinedocs/gcc/Function-Attributes.html#index-g_t_0040code_007bsyscall_005flinkage_007d-function-attribute_002c-IA-64-3240 | |
(the "target" bit, not the Itanium thing that link is actually to, but I the | |
anchor for the "target" thing is on the line just below the "target" heading | |
so you have no idea what it's talking about) | |
Andh you can also set that via a #pragma | |
https://gcc.gnu.org/onlinedocs/gcc/Function-Specific-Option-Pragmas.html | |
which is all supported on GCC 4.6 and up (probably earlier, didn't check). | |
So I thought, great, I can just use that, right? Gather all the SSE3 functions | |
together in one contiguous block, do this before the block: | |
#pragma GCC push_options | |
#pragma GCC target("sse3") | |
#include <pmmintrin.h> // so I actually get the SSE3 intrinsics | |
and this right after: | |
#pragma GCC pop_options | |
Cool. That's reasonable. I can certainly live with that! | |
And of course it was too good to be true, because if you try that (at least with | |
the GCC 4.6.3 on our Ubuntu 12.04 LTS Linux builders), what you'll find is that | |
including "pmmintrin.h" does not in fact work. The target("sse3") thing enables | |
the builtins, but you still can't include that header. (It fails testing for a | |
__SSE3__ #define. You can try #defining that manually, but will quickly discover | |
that this merely causes more interesting breakage further down the line.) | |
It turns out that the smallest example I can get to produce the behavior I actually | |
want is this: | |
#pragma GCC target("no-sse3") // at the top of the file | |
#include <emmintrin.h> // for SSE2 intrinsics; required to include this here! | |
#pragma GCC push_options | |
#pragma GCC target("sse3") | |
#include <pmmintrin.h> // need to include this *in* the sse3 block (fair enough) | |
__m128 silly_sse3(float *x) | |
{ | |
return _mm_moveldup_ps(_mm_load_ps(x)); | |
} | |
#pragma GCC pop_options | |
__m128 silly_sse2(float *x) | |
{ | |
__m128 v = _mm_load_ps(x); | |
return _mm_shuffle_ps(v, v, 0xa0); | |
} | |
And then compile that with "-msse3". Anything less and it either doesn't | |
compile or doesn't link. Having to do this pretty thoroughly dismantles the | |
benefit of being able to specify this per-function. |
Compiles fine with gcc (GCC) 4.9.2 20150304 (prerelease)
$ gcc -S -fverbose-asm -c sse.c
$ cat sse.s
.file "sse.c"
# GNU C (GCC) version 4.9.2 20150304 (prerelease) (x86_64-unknown-linux-gnu)
# compiled by GNU C version 4.9.2 20150304 (prerelease), GMP version 6.0.0, MPFR version 3.1.2-p11, MPC version 1.0.3
# GGC heuristics: --param ggc-min-expand=100 --param ggc-min-heapsize=131072
# options passed: sse.c -mtune=generic -march=x86-64 -fverbose-asm
# options enabled: -faggressive-loop-optimizations
# -fasynchronous-unwind-tables -fauto-inc-dec -fcommon
# -fdelete-null-pointer-checks -fdwarf2-cfi-asm -fearly-inlining
# -feliminate-unused-debug-types -ffunction-cse -fgcse-lm -fgnu-runtime
# -fgnu-unique -fident -finline-atomics -fira-hoist-pressure
# -fira-share-save-slots -fira-share-spill-slots -fivopts
# -fkeep-static-consts -fleading-underscore -flifetime-dse -fmath-errno
# -fmerge-debug-strings -fpeephole -fprefetch-loop-arrays
# -freg-struct-return -fsched-critical-path-heuristic
# -fsched-dep-count-heuristic -fsched-group-heuristic -fsched-interblock
# -fsched-last-insn-heuristic -fsched-rank-heuristic -fsched-spec
# -fsched-spec-insn-heuristic -fsched-stalled-insns-dep -fshow-column
# -fsigned-zeros -fsplit-ivs-in-unroller -fstrict-volatile-bitfields
# -fsync-libcalls -ftrapping-math -ftree-coalesce-vars -ftree-cselim
# -ftree-forwprop -ftree-loop-if-convert -ftree-loop-im -ftree-loop-ivcanon
# -ftree-loop-optimize -ftree-parallelize-loops= -ftree-phiprop
# -ftree-reassoc -ftree-scev-cprop -funit-at-a-time -funwind-tables
# -fverbose-asm -fzero-initialized-in-bss -m128bit-long-double -m64 -m80387
# -malign-stringops -mavx256-split-unaligned-load
# -mavx256-split-unaligned-store -mfancy-math-387 -mfp-ret-in-387 -mfxsr
# -mglibc -mieee-fp -mlong-double-80 -mmmx -mno-sse4 -mpush-args -mred-zone
# -msse -msse2 -mtls-direct-seg-refs
.text
.globl silly_sse3
.type silly_sse3, @function
silly_sse3:
.LFB512:
.cfi_startproc
pushq %rbp #
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp #,
.cfi_def_cfa_register 6
movq %rdi, -40(%rbp) # x, x
movq -40(%rbp), %rax # x, tmp86
movq %rax, -8(%rbp) # tmp86, __P
movq -8(%rbp), %rax # __P, tmp87
movaps (%rax), %xmm0 # MEM[(__v4sf *)__P_4], D.4423
movaps %xmm0, -32(%rbp) # D.4423, __X
movsldup -32(%rbp), %xmm0 # __X, D.4424
nop
popq %rbp #
.cfi_def_cfa 7, 8
ret
.cfi_endproc
.LFE512:
.size silly_sse3, .-silly_sse3
.globl silly_sse2
.type silly_sse2, @function
silly_sse2:
.LFB513:
.cfi_startproc
pushq %rbp #
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp #,
.cfi_def_cfa_register 6
movq %rdi, -40(%rbp) # x, x
movq -40(%rbp), %rax # x, tmp86
movq %rax, -24(%rbp) # tmp86, __P
movq -24(%rbp), %rax # __P, tmp87
movaps (%rax), %xmm0 # MEM[(__v4sf *)__P_4], D.4425
movaps %xmm0, -16(%rbp) # D.4425, v
movaps -16(%rbp), %xmm0 # v, tmp88
movaps -16(%rbp), %xmm1 # v, tmp89
shufps $160, %xmm1, %xmm0 #, tmp89, D.4425
popq %rbp #
.cfi_def_cfa 7, 8
ret
.cfi_endproc
.LFE513:
.size silly_sse2, .-silly_sse2
.ident "GCC: (GNU) 4.9.2 20150304 (prerelease)"
.section .note.GNU-stack,"",@progbits
However,
$ gcc -msse3 -o sse.o -c sse.c
In file included from sse.c:6:0:
/usr/lib/gcc/x86_64-unknown-linux-gnu/4.9.2/include/pmmintrin.h: In function ‘silly_sse3’:
/usr/lib/gcc/x86_64-unknown-linux-gnu/4.9.2/include/pmmintrin.h:76:3: error: ‘__builtin_ia32_movsldup’ needs isa option -m32 -msse3
return (__m128) __builtin_ia32_movsldup ((__v4sf)__X);
Adding -m32
gives the same error.
Does that help?
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
+1
Doing this in MSVC is just so much saner, although less "optimized" because compiler can not use SSEn instructions for regular code.
Last llvm devmeeting had a talk about this: http://llvm.org/devmtg/2014-10/Slides/Christopher-Function%20Multiversioning%20Talk.pdf (video also available)
They are working to make this easier, but afaik currently the focus is for optimizer to generate arch-specific functions and automatically choose function at runtime. But they are thinking how to make life easier for including intrinsic headers.