Create a gist now

Instantly share code, notes, and snippets.

Problems with GCCs per-function target specifiers (at least in 4.6.3)
So in RADFFT we have some SSE3 kernel functions that have been the source of a lot
of compilation "fun". With VC++ it's no problem, with GCC it evidently is. We don't
want to specify "-msse3" since that allows the compiler to use SSE3 outside these
functions - all of which are called within a CPU feature dispach block - and that
not only can but has caused breakage for customers in the past.
GCC says "put the SSE3 code in a separate translation unit and compile just that with
-msse3". There's been at least three problems with this that I'm aware of:
a) It forces us to make these small internal DSP kernels public (currently they're
static functions), which is extra namespace pollution, which is not a huge deal
but still blows.
b) We've had fun issues with function attributes "bleeding out" or getting ignored
in the past, mostly involving inlining. This is less of an issue when we're
talking about a separate translation unit - unless there's LTO involved,
which it often is. Yay.
c) Having to add extra command line options, only on some files, only when using
GCC, and only on some platforms (namely x86 targets) is, in that exact
combination, a pretty giant pain in the ass in the build process and super
error-prone. Just not instantiating say SSE3 functions when the compiler doesn't
support the intrinsics is easy. Messing with the build flags on specific files
and under specific circumstances is something of a PITA in all build systems
I've dealt with, but CDep is worse than most.
Okay. So I was pretty excited to see that GCC has this:
(the "target" bit, not the Itanium thing that link is actually to, but I the
anchor for the "target" thing is on the line just below the "target" heading
so you have no idea what it's talking about)
Andh you can also set that via a #pragma
which is all supported on GCC 4.6 and up (probably earlier, didn't check).
So I thought, great, I can just use that, right? Gather all the SSE3 functions
together in one contiguous block, do this before the block:
#pragma GCC push_options
#pragma GCC target("sse3")
#include <pmmintrin.h> // so I actually get the SSE3 intrinsics
and this right after:
#pragma GCC pop_options
Cool. That's reasonable. I can certainly live with that!
And of course it was too good to be true, because if you try that (at least with
the GCC 4.6.3 on our Ubuntu 12.04 LTS Linux builders), what you'll find is that
including "pmmintrin.h" does not in fact work. The target("sse3") thing enables
the builtins, but you still can't include that header. (It fails testing for a
__SSE3__ #define. You can try #defining that manually, but will quickly discover
that this merely causes more interesting breakage further down the line.)
It turns out that the smallest example I can get to produce the behavior I actually
want is this:
#pragma GCC target("no-sse3") // at the top of the file
#include <emmintrin.h> // for SSE2 intrinsics; required to include this here!
#pragma GCC push_options
#pragma GCC target("sse3")
#include <pmmintrin.h> // need to include this *in* the sse3 block (fair enough)
__m128 silly_sse3(float *x)
return _mm_moveldup_ps(_mm_load_ps(x));
#pragma GCC pop_options
__m128 silly_sse2(float *x)
__m128 v = _mm_load_ps(x);
return _mm_shuffle_ps(v, v, 0xa0);
And then compile that with "-msse3". Anything less and it either doesn't
compile or doesn't link. Having to do this pretty thoroughly dismantles the
benefit of being able to specify this per-function.

Doing this in MSVC is just so much saner, although less "optimized" because compiler can not use SSEn instructions for regular code.

Last llvm devmeeting had a talk about this: (video also available)
They are working to make this easier, but afaik currently the focus is for optimizer to generate arch-specific functions and automatically choose function at runtime. But they are thinking how to make life easier for including intrinsic headers.

gpakosz commented Apr 30, 2015

Compiles fine with gcc (GCC) 4.9.2 20150304 (prerelease)

$ gcc -S -fverbose-asm  -c sse.c
$ cat sse.s
        .file   "sse.c"
# GNU C (GCC) version 4.9.2 20150304 (prerelease) (x86_64-unknown-linux-gnu)
#       compiled by GNU C version 4.9.2 20150304 (prerelease), GMP version 6.0.0, MPFR version 3.1.2-p11, MPC version 1.0.3
# GGC heuristics: --param ggc-min-expand=100 --param ggc-min-heapsize=131072
# options passed:  sse.c -mtune=generic -march=x86-64 -fverbose-asm
# options enabled:  -faggressive-loop-optimizations
# -fasynchronous-unwind-tables -fauto-inc-dec -fcommon
# -fdelete-null-pointer-checks -fdwarf2-cfi-asm -fearly-inlining
# -feliminate-unused-debug-types -ffunction-cse -fgcse-lm -fgnu-runtime
# -fgnu-unique -fident -finline-atomics -fira-hoist-pressure
# -fira-share-save-slots -fira-share-spill-slots -fivopts
# -fkeep-static-consts -fleading-underscore -flifetime-dse -fmath-errno
# -fmerge-debug-strings -fpeephole -fprefetch-loop-arrays
# -freg-struct-return -fsched-critical-path-heuristic
# -fsched-dep-count-heuristic -fsched-group-heuristic -fsched-interblock
# -fsched-last-insn-heuristic -fsched-rank-heuristic -fsched-spec
# -fsched-spec-insn-heuristic -fsched-stalled-insns-dep -fshow-column
# -fsigned-zeros -fsplit-ivs-in-unroller -fstrict-volatile-bitfields
# -fsync-libcalls -ftrapping-math -ftree-coalesce-vars -ftree-cselim
# -ftree-forwprop -ftree-loop-if-convert -ftree-loop-im -ftree-loop-ivcanon
# -ftree-loop-optimize -ftree-parallelize-loops= -ftree-phiprop
# -ftree-reassoc -ftree-scev-cprop -funit-at-a-time -funwind-tables
# -fverbose-asm -fzero-initialized-in-bss -m128bit-long-double -m64 -m80387
# -malign-stringops -mavx256-split-unaligned-load
# -mavx256-split-unaligned-store -mfancy-math-387 -mfp-ret-in-387 -mfxsr
# -mglibc -mieee-fp -mlong-double-80 -mmmx -mno-sse4 -mpush-args -mred-zone
# -msse -msse2 -mtls-direct-seg-refs

        .globl  silly_sse3
        .type   silly_sse3, @function
        pushq   %rbp    #
        .cfi_def_cfa_offset 16
        .cfi_offset 6, -16
        movq    %rsp, %rbp      #,
        .cfi_def_cfa_register 6
        movq    %rdi, -40(%rbp) # x, x
        movq    -40(%rbp), %rax # x, tmp86
        movq    %rax, -8(%rbp)  # tmp86, __P
        movq    -8(%rbp), %rax  # __P, tmp87
        movaps  (%rax), %xmm0   # MEM[(__v4sf *)__P_4], D.4423
        movaps  %xmm0, -32(%rbp)        # D.4423, __X
        movsldup        -32(%rbp), %xmm0        # __X, D.4424
        popq    %rbp    #
        .cfi_def_cfa 7, 8
        .size   silly_sse3, .-silly_sse3
        .globl  silly_sse2
        .type   silly_sse2, @function
        pushq   %rbp    #
        .cfi_def_cfa_offset 16
        .cfi_offset 6, -16
        movq    %rsp, %rbp      #,
        .cfi_def_cfa_register 6
        movq    %rdi, -40(%rbp) # x, x
        movq    -40(%rbp), %rax # x, tmp86
        movq    %rax, -24(%rbp) # tmp86, __P
        movq    -24(%rbp), %rax # __P, tmp87
        movaps  (%rax), %xmm0   # MEM[(__v4sf *)__P_4], D.4425
        movaps  %xmm0, -16(%rbp)        # D.4425, v
        movaps  -16(%rbp), %xmm0        # v, tmp88
        movaps  -16(%rbp), %xmm1        # v, tmp89
        shufps  $160, %xmm1, %xmm0      #, tmp89, D.4425
        popq    %rbp    #
        .cfi_def_cfa 7, 8
        .size   silly_sse2, .-silly_sse2
        .ident  "GCC: (GNU) 4.9.2 20150304 (prerelease)"
        .section        .note.GNU-stack,"",@progbits


$ gcc -msse3 -o sse.o -c sse.c
In file included from sse.c:6:0:
/usr/lib/gcc/x86_64-unknown-linux-gnu/4.9.2/include/pmmintrin.h: In function ‘silly_sse3’:
/usr/lib/gcc/x86_64-unknown-linux-gnu/4.9.2/include/pmmintrin.h:76:3: error: ‘__builtin_ia32_movsldup’ needs isa option -m32 -msse3
  return (__m128) __builtin_ia32_movsldup ((__v4sf)__X);

Adding -m32 gives the same error.

Does that help?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment