alexcrichton/previous.md Secret

## previous.md

      
    Raw
  

              previous.md
            
          
    This is intended to be a tracking issue for implementing all vendor intrinsics in this repository.
This issue is also intended to be a guide for documenting the process of adding new vendor intrinsics to this crate.
If you decide to implement a set of vendor intrinsics, please check the list below to make sure somebody else isn't already working on them. If it's not checked off or has a name next to it, feel free to comment that you'd like to implement it!
At a high level, each vendor intrinsic should correspond to a single exported Rust function with an appropriate target_feature attribute. Here's an example for _mm_adds_epi16:
/// Add packed 16-bit integers in `a` and `b` using saturation.
#[inline(always)]
#[target_feature = "+sse2"]
#[cfg_attr(test, assert_instr(paddsw))]
pub unsafe fn _mm_adds_epi16(a: i16x8, b: i16x8) -> i16x8 {
    unsafe { paddsw(a, b) }
}
Let's break this down:

The #[inline(always)] is added because vendor intrinsic functions generally should always be inlined because the intent of a vendor intrinsic is to correspond to a single particular CPU instruction. A vendor intrinsic that is compiled into an actual function call could be quite disastrous for performance.
The #[target_feature = "+sse2"] attribute intructs the compiler to generate code with the sse2 target feature enabled, regardless of the target platform. That is, even if you're compiling for a platform that doesn't support sse2, the compiler will still generate code for _mm_adds_epi16 as if sse2 support existed. Without this attribute, the compiler might not generate the intended CPU instruction.
The #[cfg_attr(test, assert_instr(paddsw))] attribute indicates that when we're testing the crate we'll assert that the paddsw instruction is generated inside this function, ensuring that the SIMD intrinsic truly is an intrinsic for the instruction!
The types of the vectors given to the intrinsic should generally match the types as provided in the vendor interface. We'll talk about this more below.
The implementation of the vendor intrinsic is generally very simple. Remember, the goal is to compile a call to _mm_adds_epi16 down to a single particular CPU instruction. As such, the implementation typically defers to a compiler intrinsic (in this case, paddsw) when one is available. More on this below as well.
The intrinsic itself is unsafe due to the usage of #[target_feature]

Once a function has been added, you should also add at least one test for basic functionality. Here's an example for _mm_adds_epi16:
#[simd_test = "sse2"]
unsafe fn _mm_adds_epi16() {
    let a = i16x8::new(0, 1, 2, 3, 4, 5, 6, 7);
    let b = i16x8::new(8, 9, 10, 11, 12, 13, 14, 15);
    let r = sse2::_mm_adds_epi16(a, b);
    let e = i16x8::new(8, 10, 12, 14, 16, 18, 20, 22);
    assert_eq!(r, e);
}
Note that #[simd_test] is the same as #[test], it's just a custom macro to enable the target feature in the test and generate a wrapper for ensuring the feature is available on the local cpu as well.
Finally, once that's done, send a PR!
Determining types

Determining the function signature of each vendor intrinsic can be tricky depending on the specificity of the vendor API. For SSE, Intel generally has three types in their interface:

__m128 consists of 4 single-precision (32-bit) floating point numbers.
__m128d consists of 2 double-precision (64-bit) floating point numbers.
__m128i consists of N integers, where N can be 16, 8, 4 or 2. The corresponding bit sizes for each value of N are 8-bit, 16-bit, 32-bit and 64-bit, respectively. Finally, there are signed and unsigned variants for each value of N, which means __m128i can be mapped to one of eight possible concrete integer types.

In terms of the stdsimd crate, the first two floating point types have a straight-forward translation. __m128 maps to f32x4 while __m128d maps to f64x2.
Unfortunately, since __m128i can correspond to any number of integer types we need to actually inspect the vendor intrinsic to determine the type. Sometimes this is hinted at in the name of intrinsic itself. Continuing with our previous example, _mm_adds_epi16, we can infer that it is a signed operation on an integer vector consisting of eight 16-bit integers. Namely, the epi means signed (where as epu means unsigned) and 16 means 16-bit.
Fortunately, Clang (and LLVM) have determined the specific concrete integer types for most of the vendor intrinsics already, but they aren't available in any easily access away (as far as this author knows). For example, you can see the types for _mm_adds_epi16 in Clang's emmintrin.h header file.
Writing the implementation

An implementation of an intrinsic (so far) generally has one of three shapes:

The vendor intrinsic does not have any corresponding compiler intrinsic, so you must write the implementation in such a way that the compiler will recognize it and produce the desired codegen. For example, the _mm_add_epi16 intrinsic (note the missing s in add) is implemented via a + b, which compiles down to LLVM's cross platform SIMD vector API.
The vendor intrinsic does have a corresponding compiler intrinsic, so you must write an extern block to bring that intrinsic into scope and then call it. The example above (_mm_adds_epi16) uses this approach.
The vendor intrinsic has a parameter that must be a constant value when given to the CPU instruction, where that constant is often a parameter that impacts the operation of the intrinsic. This means the implementation of the vendor intrinsic must guarantee that a particular parameter be a constant. This is tricky because Rust doesn't (yet) have a stable way of doing this, so we have to do it ourselves. How you do it can vary, but one particularly gnarly example is _mm_cmpestri (make sure to look at the constify_imm8! macro).

References

The compiler intrinsics available to us through LLVM can be found here: https://gist.github.com/anonymous/a25d3e3b4c14ee68d63bd1dcb0e1223c
The Intel vendor intrinsic API can be found here: https://gist.github.com/anonymous/25d752fda8521d29699a826b980218fc
The Clang header files for vendor intrinsics can also be incredibly useful. When in doubt, Do What Clang Does:
https://github.com/llvm-mirror/clang/tree/master/lib/Headers
TODO


MMX

mmx


  _mm_add_pi16 (a, b) // paddw
  _mm_add_pi32 (a, b) // paddd
  _mm_add_pi8 (a, b) // paddb
  _mm_adds_pi16 (a, b) // paddsw
  _mm_adds_pi8 (a, b) // paddsb
  _mm_adds_pu16 (a, b) // paddusw
  _mm_adds_pu8 (a, b) // paddusb
  _mm_and_si64 (a, b) // pand
  _mm_andnot_si64 (a, b) // pandn
  _mm_cmpeq_pi16 (a, b) // pcmpeqw
  _mm_cmpeq_pi32 (a, b) // pcmpeqd
  _mm_cmpeq_pi8 (a, b) // pcmpeqb
  _mm_cmpgt_pi16 (a, b) // pcmpgtw
  _mm_cmpgt_pi32 (a, b) // pcmpgtd
  _mm_cmpgt_pi8 (a, b) // pcmpgtb
  __int64 _mm_cvtm64_si64 (a) // movq
  _mm_cvtsi32_si64 (int a) // movd
  _mm_cvtsi64_m64 (__int64 a) // movq
  int _mm_cvtsi64_si32 (a) // movd
  void _m_empty (void) // emms
  void _mm_empty (void) // emms
  _m_from_int (int a) // movd
  _m_from_int64 (__int64 a) // movq
  _mm_madd_pi16 (a, b) // pmaddwd
  _mm_mulhi_pi16 (a, b) // pmulhw
  _mm_mullo_pi16 (a, b) // pmullw
  _mm_or_si64 (a, b) // por
  _mm_packs_pi16 (a, b) // packsswb
  _mm_packs_pi32 (a, b) // packssdw
  _mm_packs_pu16 (a, b) // packuswb
  _m_packssdw (a, b) // packssdw
  _m_packsswb (a, b) // packsswb
  _m_packuswb (a, b) // packuswb
  _m_paddb (a, b) // paddb
  _m_paddd (a, b) // paddd
  _m_paddsb (a, b) // paddsb
  _m_paddsw (a, b) // paddsw
  _m_paddusb (a, b) // paddusb
  _m_paddusw (a, b) // paddusw
  _m_paddw (a, b) // paddw
  _m_pand (a, b) // pand
  _m_pandn (a, b) // pandn
  _m_pcmpeqb (a, b) // pcmpeqb
  _m_pcmpeqd (a, b) // pcmpeqd
  _m_pcmpeqw (a, b) // pcmpeqw
  _m_pcmpgtb (a, b) // pcmpgtb
  _m_pcmpgtd (a, b) // pcmpgtd
  _m_pcmpgtw (a, b) // pcmpgtw
  _m_pmaddwd (a, b) // pmaddwd
  _m_pmulhw (a, b) // pmulhw
  _m_pmullw (a, b) // pmullw
  _m_por (a, b) // por
  _m_pslld (a, count) // pslld
  _m_pslldi (a, int imm8) // pslld
  _m_psllq (a, count) // psllq
  _m_psllqi (a, int imm8) // psllq
  _m_psllw (a, count) // psllw
  _m_psllwi (a, int imm8) // psllw
  _m_psrad (a, count) // psrad
  _m_psradi (a, int imm8) // psrad
  _m_psraw (a, count) // psraw
  _m_psrawi (a, int imm8) // psraw
  _m_psrld (a, count) // psrld
  _m_psrldi (a, int imm8) // psrld
  _m_psrlq (a, count) // psrlq
  _m_psrlqi (a, int imm8) // psrlq
  _m_psrlw (a, count) // psrlw
  _m_psrlwi (a, int imm8) // psrlw
  _m_psubb (a, b) // psubb
  _m_psubd (a, b) // psubd
  _m_psubsb (a, b) // psubsb
  _m_psubsw (a, b) // psubsw
  _m_psubusb (a, b) // psubusb
  _m_psubusw (a, b) // psubusw
  _m_psubw (a, b) // psubw
  _m_punpckhbw (a, b) // punpckhbw
  _m_punpckhdq (a, b) // punpckhdq
  _m_punpckhwd (a, b) // punpcklbw
  _m_punpcklbw (a, b) // punpcklbw
  _m_punpckldq (a, b) // punpckldq
  _m_punpcklwd (a, b) // punpcklwd
  _m_pxor (a, __m64) // pxor
  _mm_set_pi16 (short e3, short e2, short e1, short b) // ...
  _mm_set_pi32 (int e1, int e0) // ...
  _mm_set_pi8 (char e7, char e6, char e5, char e4, char e3, char e2, char e1, char e0) // ...
  _mm_set1_pi16 (short e0) // ...
  _mm_set1_pi32 (int a) // ...
  _mm_set1_pi8 (char a) // ...
  _mm_setr_pi16 (short e3, short e2, short e1, short a) // ...
  _mm_setr_pi32 (int e1, int e0) // ...
  _mm_setr_pi8 (char e7, char e6, char e5, char e4, char e3, char e2, char e1, char e0) // ...
  _mm_setzero_si64 (void) // pxor
  _mm_sll_pi16 (a, count) // psllw
  _mm_sll_pi32 (a, count) // pslld
  _mm_sll_si64 (a, count) // psllq
  _mm_slli_pi16 (a, int imm8) // psllw
  _mm_slli_pi32 (a, int imm8) // pslld
  _mm_slli_si64 (a, int imm8) // psllq
  _mm_sra_pi16 (a, count) // psraw
  _mm_sra_pi32 (a, count) // psrad
  _mm_srai_pi16 (a, int imm8) // psraw
  _mm_srai_pi32 (a, int imm8) // psrad
  _mm_srl_pi16 (a, count) // psrlw
  _mm_srl_pi32 (a, count) // psrld
  _mm_srl_si64 (a, count) // psrlq
  _mm_srli_pi16 (a, int imm8) // psrlw
  _mm_srli_pi32 (a, int imm8) // psrld
  _mm_srli_si64 (a, int imm8) // psrlq
  _mm_sub_pi16 (a, b) // psubw
  _mm_sub_pi32 (a, b) // psubd
  _mm_sub_pi8 (a, b) // psubb
  _mm_subs_pi16 (a, b) // psubsw
  _mm_subs_pi8 (a, b) // psubsb
  _mm_subs_pu16 (a, b) // psubusw
  _mm_subs_pu8 (a, b) // psubusb
  int _m_to_int (a) // movd
  __int64 _m_to_int64 (a) // movq
  _mm_unpackhi_pi16 (a, b) // punpcklbw
  _mm_unpackhi_pi32 (a, b) // punpckhdq
  _mm_unpackhi_pi8 (a, b) // punpckhbw
  _mm_unpacklo_pi16 (a, b) // punpcklwd
  _mm_unpacklo_pi32 (a, b) // punpckldq
  _mm_unpacklo_pi8 (a, b) // punpcklbw
  _mm_xor_si64 (a, b) // pxor


SSE (complete)

sse


 _MM_TRANSPOSE4_PS
 _mm_getcsr
 _mm_setcsr
 _MM_GET_EXCEPTION_STATE
 _MM_SET_EXCEPTION_STATE
 _MM_GET_EXCEPTION_MASK
 _MM_SET_EXCEPTION_MASK
 _MM_GET_ROUNDING_MODE
 _MM_SET_ROUNDING_MODE
 _MM_GET_FLUSH_ZERO_MODE
 _MM_SET_FLUSH_ZERO_MODE
 _mm_prefetch
 _mm_sfence
 _mm_max_pi16
 _m_pmaxsw
 _mm_max_pu8
 _m_pmaxub
 _mm_min_pi16
 _m_pminsw
 _mm_min_pu8
 _m_pminub
 _mm_mulhi_pu16
 _m_pmulhuw
 _mm_avg_pu8
 _m_pavgb
 _mm_avg_pu16
 _m_pavgw
 _mm_sad_pu8
 _m_psadbw
 _mm_cvtsi32_ss
 _mm_cvt_si2ss
 _mm_cvtsi64_ss
 _mm_cvtpi32_ps
 _mm_cvt_pi2ps
 _mm_cvtpi16_ps
 _mm_cvtpu16_ps
 _mm_cvtpi8_ps
 _mm_cvtpu8_ps
 _mm_cvtpi32x2_ps
 _mm_stream_pi
 _mm_maskmove_si64
 _m_maskmovq
 _mm_extract_pi16
 _m_pextrw
 _mm_insert_pi16
 _m_pinsrw
 _mm_movemask_pi8
 _m_pmovmskb
 _mm_shuffle_pi16
 _m_pshufw
 _mm_add_ss
 _mm_add_ps
 _mm_sub_ss
 _mm_sub_ps
 _mm_mul_ss
 _mm_mul_ps
 _mm_div_ss
 _mm_div_ps
 _mm_sqrt_ss
 _mm_sqrt_ps
 _mm_rcp_ss
 _mm_rcp_ps
 _mm_rsqrt_ss
 _mm_rsqrt_ps
 _mm_min_ss
 _mm_min_ps
 _mm_max_ss
 _mm_max_ps
 _mm_and_ps
 _mm_andnot_ps
 _mm_or_ps
 _mm_xor_ps
 _mm_cmpeq_ss
 _mm_cmpeq_ps
 _mm_cmplt_ss
 _mm_cmplt_ps
 _mm_cmple_ss
 _mm_cmple_ps
 _mm_cmpgt_ss
 _mm_cmpgt_ps
 _mm_cmpge_ss
 _mm_cmpge_ps
 _mm_cmpneq_ss
 _mm_cmpneq_ps
 _mm_cmpnlt_ss
 _mm_cmpnlt_ps
 _mm_cmpnle_ss
 _mm_cmpnle_ps
 _mm_cmpngt_ss
 _mm_cmpngt_ps
 _mm_cmpnge_ss
 _mm_cmpnge_ps
 _mm_cmpord_ss
 _mm_cmpord_ps
 _mm_cmpunord_ss
 _mm_cmpunord_ps
 _mm_comieq_ss
 _mm_comilt_ss
 _mm_comile_ss
 _mm_comigt_ss
 _mm_comige_ss
 _mm_comineq_ss
 _mm_ucomieq_ss
 _mm_ucomilt_ss
 _mm_ucomile_ss
 _mm_ucomigt_ss
 _mm_ucomige_ss
 _mm_ucomineq_ss
 _mm_cvtss_si32
 _mm_cvt_ss2si
 _mm_cvtss_si64
 _mm_cvtss_f32
 _mm_cvtps_pi32
 _mm_cvt_ps2pi
 _mm_cvttss_si32
 _mm_cvtt_ss2si
 _mm_cvttss_si64
 _mm_cvttps_pi32
 _mm_cvtt_ps2pi
 _mm_cvtps_pi16
 _mm_cvtps_pi8
 _mm_set_ss
 _mm_set1_ps
 _mm_set_ps1
 _mm_set_ps
 _mm_setr_ps
 _mm_setzero_ps
 _mm_loadh_pi
 _mm_loadl_pi
 _mm_load_ss
 _mm_load1_ps
 _mm_load_ps1
 _mm_load_ps
 _mm_loadu_ps
 _mm_loadr_ps
 _mm_stream_ps
 _mm_storeh_pi
 _mm_storel_pi
 _mm_store_ss
 _mm_store1_ps
 _mm_store_ps1
 _mm_store_ps
 _mm_storeu_ps
 _mm_storer_ps
 _mm_move_ss
 _mm_shuffle_ps
 _mm_unpackhi_ps
 _mm_unpacklo_ps
 _mm_movehl_ps
 _mm_movelh_ps
 _mm_movemask_ps
 _mm_undefined_ps


SSE2

sse2


 _mm_pause
 _mm_clflush
 _mm_lfence
 _mm_mfence
 _mm_add_epi8
 _mm_add_epi16
 _mm_add_epi32
 _mm_add_si64
 _mm_add_epi64
 _mm_adds_epi8
 _mm_adds_epi16
 _mm_adds_epu8
 _mm_adds_epu16
 _mm_avg_epu8
 _mm_avg_epu16
 _mm_madd_epi16
 _mm_max_epi16
 _mm_max_epu8
 _mm_min_epi16
 _mm_min_epu8
 _mm_mulhi_epi16
 _mm_mulhi_epu16
 _mm_mullo_epi16
 _mm_mul_su32
 _mm_mul_epu32
 _mm_sad_epu8
 _mm_sub_epi8
 _mm_sub_epi16
 _mm_sub_epi32
 _mm_sub_si64
 _mm_sub_epi64
 _mm_subs_epi8
 _mm_subs_epi16
 _mm_subs_epu8
 _mm_subs_epu16
 _mm_slli_si128
 _mm_bslli_si128
 _mm_bsrli_si128
 _mm_slli_epi16
 _mm_sll_epi16
 _mm_slli_epi32
 _mm_sll_epi32
 _mm_slli_epi64
 _mm_sll_epi64
 _mm_srai_epi16
 _mm_sra_epi16
 _mm_srai_epi32
 _mm_sra_epi32
 _mm_srli_si128
 _mm_srli_epi16
 _mm_srl_epi16
 _mm_srli_epi32
 _mm_srl_epi32
 _mm_srli_epi64
 _mm_srl_epi64
 _mm_and_si128
 _mm_andnot_si128
 _mm_or_si128
 _mm_xor_si128
 _mm_cmpeq_epi8
 _mm_cmpeq_epi16
 _mm_cmpeq_epi32
 _mm_cmpgt_epi8
 _mm_cmpgt_epi16
 _mm_cmpgt_epi32
 _mm_cmplt_epi8
 _mm_cmplt_epi16
 _mm_cmplt_epi32
 _mm_cvtepi32_pd
 _mm_cvtsi32_sd
 _mm_cvtsi64_sd
 _mm_cvtsi64x_sd
 _mm_cvtepi32_ps
 _mm_cvtpi32_pd
 _mm_cvtsi32_si128
 _mm_cvtsi64_si128
 _mm_cvtsi64x_si128
 _mm_cvtsi128_si32
 _mm_cvtsi128_si64
 _mm_cvtsi128_si64x
 _mm_set_epi64
 _mm_set_epi64x
 _mm_set_epi32
 _mm_set_epi16
 _mm_set_epi8
 _mm_set1_epi64
 _mm_set1_epi64x
 _mm_set1_epi32
 _mm_set1_epi16
 _mm_set1_epi8
 _mm_setr_epi64
 _mm_setr_epi32
 _mm_setr_epi16
 _mm_setr_epi8
 _mm_setzero_si128
 _mm_loadl_epi64
 _mm_load_si128
 _mm_loadu_si128
 _mm_maskmoveu_si128
 _mm_store_si128
 _mm_storeu_si128
 _mm_storel_epi64
 _mm_stream_si128
 _mm_stream_si32
 _mm_stream_si64
 _mm_movepi64_pi64
 _mm_movpi64_epi64
 _mm_move_epi64
 _mm_packs_epi16
 _mm_packs_epi32
 _mm_packus_epi16
 _mm_extract_epi16
 _mm_insert_epi16
 _mm_movemask_epi8
 _mm_shuffle_epi32
 _mm_shufflehi_epi16
 _mm_shufflelo_epi16
 _mm_unpackhi_epi8
 _mm_unpackhi_epi16
 _mm_unpackhi_epi32
 _mm_unpackhi_epi64
 _mm_unpacklo_epi8
 _mm_unpacklo_epi16
 _mm_unpacklo_epi32
 _mm_unpacklo_epi64
 _mm_add_sd
 _mm_add_pd
 _mm_div_sd
 _mm_div_pd
 _mm_max_sd
 _mm_max_pd
 _mm_min_sd
 _mm_min_pd
 _mm_mul_sd
 _mm_mul_pd
 _mm_sqrt_sd
 _mm_sqrt_pd
 _mm_sub_sd
 _mm_sub_pd
 _mm_and_pd
 _mm_andnot_pd
 _mm_or_pd
 _mm_xor_pd
 _mm_cmpeq_sd
 _mm_cmplt_sd
 _mm_cmple_sd
 _mm_cmpgt_sd
 _mm_cmpge_sd
 _mm_cmpord_sd
 _mm_cmpunord_sd
 _mm_cmpneq_sd
 _mm_cmpnlt_sd
 _mm_cmpnle_sd
 _mm_cmpngt_sd
 _mm_cmpnge_sd
 _mm_cmpeq_pd
 _mm_cmplt_pd
 _mm_cmple_pd
 _mm_cmpgt_pd
 _mm_cmpge_pd
 _mm_cmpord_pd
 _mm_cmpunord_pd
 _mm_cmpneq_pd
 _mm_cmpnlt_pd
 _mm_cmpnle_pd
 _mm_cmpngt_pd
 _mm_cmpnge_pd
 _mm_comieq_sd
 _mm_comilt_sd
 _mm_comile_sd
 _mm_comigt_sd
 _mm_comige_sd
 _mm_comineq_sd
 _mm_ucomieq_sd
 _mm_ucomilt_sd
 _mm_ucomile_sd
 _mm_ucomigt_sd
 _mm_ucomige_sd
 _mm_ucomineq_sd
 _mm_cvtpd_ps
 _mm_cvtps_pd
 _mm_cvtpd_epi32
 _mm_cvtsd_si32
 _mm_cvtsd_si64
 _mm_cvtsd_si64x
 _mm_cvtsd_ss
 _mm_cvtsd_f64
 _mm_cvtss_sd
 _mm_cvttpd_epi32
 _mm_cvttsd_si32
 _mm_cvttsd_si64
 _mm_cvttsd_si64x
 _mm_cvtps_epi32
 _mm_cvttps_epi32
 _mm_cvtpd_pi32
 _mm_cvttpd_pi32
 _mm_set_sd
 _mm_set1_pd
 _mm_set_pd1
 _mm_set_pd
 _mm_setr_pd
 _mm_setzero_pd
 _mm_load_pd
 _mm_load1_pd
 _mm_load_pd1
 _mm_loadr_pd
 _mm_loadu_pd
 _mm_load_sd
 _mm_loadh_pd
 _mm_loadl_pd
 _mm_stream_pd
 _mm_store_sd
 _mm_store1_pd
 _mm_store_pd1
 _mm_store_pd
 _mm_storeu_pd
 _mm_storer_pd
 _mm_storeh_pd
 _mm_storel_pd
 _mm_unpackhi_pd
 _mm_unpacklo_pd
 _mm_movemask_pd
 _mm_shuffle_pd
 _mm_move_sd
 _mm_castpd_ps
 _mm_castpd_si128
 _mm_castps_pd
 _mm_castps_si128
 _mm_castsi128_pd
 _mm_castsi128_ps
 _mm_undefined_pd
 _mm_undefined_si128


SSE3 (complete)

sse3


 _mm_addsub_ps
 _mm_addsub_pd
 _mm_hadd_pd
 _mm_hadd_ps
 _mm_hsub_pd
 _mm_hsub_ps
 _mm_lddqu_si128
 _mm_movedup_pd
 _mm_loaddup_pd
 _mm_movehdup_ps
 _mm_moveldup_ps


SSSE3 (complete)

ssse3


 _mm_abs_pi8
 _mm_abs_epi8
 _mm_abs_pi16
 _mm_abs_epi16
 _mm_abs_pi32
 _mm_abs_epi32
 _mm_shuffle_epi8
 _mm_shuffle_pi8
 _mm_alignr_epi8
 _mm_alignr_pi8
 _mm_hadd_epi16
 _mm_hadds_epi16
 _mm_hadd_epi32
 _mm_hadd_pi16
 _mm_hadd_pi32
 _mm_hadds_pi16
 _mm_hsub_epi16
 _mm_hsubs_epi16
 _mm_hsub_epi32
 _mm_hsub_pi16
 _mm_hsub_pi32
 _mm_hsubs_pi16
 _mm_maddubs_epi16
 _mm_maddubs_pi16
 _mm_mulhrs_epi16
 _mm_mulhrs_pi16
 _mm_sign_epi8
 _mm_sign_epi16
 _mm_sign_epi32
 _mm_sign_pi8
 _mm_sign_pi16
 _mm_sign_pi32


SSE4.1

sse4.1


 _mm_blend_pd
 _mm_blend_ps
 _mm_blendv_pd
 _mm_blendv_ps
 _mm_blendv_epi8
 _mm_blend_epi16
 _mm_dp_pd
 _mm_dp_ps
 _mm_extract_ps
 _mm_extract_epi8
 _mm_extract_epi32
 _mm_extract_epi64
 _mm_insert_ps
 _mm_insert_epi8
 _mm_insert_epi32
 _mm_insert_epi64
 _mm_max_epi8
 _mm_max_epi32
 _mm_max_epu32
 _mm_max_epu16
 _mm_min_epi8
 _mm_min_epi32
 _mm_min_epu32
 _mm_min_epu16
 _mm_packus_epi32
 _mm_cmpeq_epi64
 _mm_cvtepi8_epi16
 _mm_cvtepi8_epi32
 _mm_cvtepi8_epi64
 _mm_cvtepi16_epi32
 _mm_cvtepi16_epi64
 _mm_cvtepi32_epi64
 _mm_cvtepu8_epi16
 _mm_cvtepu8_epi32
 _mm_cvtepu8_epi64
 _mm_cvtepu16_epi32
 _mm_cvtepu16_epi64
 _mm_cvtepu32_epi64
 _mm_mul_epi32
 _mm_mullo_epi32
 _mm_testz_si128
 _mm_testc_si128
 _mm_testnzc_si128
 _mm_test_all_zeros
 _mm_test_mix_ones_zeros
 _mm_test_all_ones
 _mm_round_pd
 _mm_floor_pd
 _mm_ceil_pd
 _mm_round_ps
 _mm_floor_ps
 _mm_ceil_ps
 _mm_round_sd
 _mm_floor_sd
 _mm_ceil_sd
 _mm_round_ss
 _mm_floor_ss
 _mm_ceil_ss
 _mm_minpos_epu16
 _mm_mpsadbw_epu8
 _mm_stream_load_si128


SSE4.2 (complete)

sse4.2


 _mm_cmpistrm
 _mm_cmpistri
 _mm_cmpistrz
 _mm_cmpistrc
 _mm_cmpistrs
 _mm_cmpistro
 _mm_cmpistra
 _mm_cmpestrm
 _mm_cmpestri
 _mm_cmpestrz
 _mm_cmpestrc
 _mm_cmpestrs
 _mm_cmpestro
 _mm_cmpestra
 _mm_cmpgt_epi64
 _mm_crc32_u8
 _mm_crc32_u16
 _mm_crc32_u32
 _mm_crc32_u64


SSE4a (blocked by #249)

sse4a


 _mm_extracti_si64(x, len, idx) // EXTRQ
 _mm_extract_si64(__m128i __x, __m128i __y) // EXTRQ
 _mm_inserti_si64(x, y, len, idx) // INSERTQ
 _mm_insert_si64(__m128i __x, __m128i __y) // INSERTQ
 _mm_stream_sd(double *__p, __m128d __a) // MOVNTSD
 _mm_stream_ss(float *__p, __m128 __a) // MOVNTSS


AVX

avx


 _mm256_add_pd
 _mm256_add_ps
 _mm256_addsub_pd
 _mm256_addsub_ps
 _mm256_and_pd
 _mm256_and_ps
 _mm256_andnot_pd
 _mm256_andnot_ps
 _mm256_blend_pd
 _mm256_blend_ps
 _mm256_blendv_pd
 _mm256_blendv_ps
 _mm256_div_pd
 _mm256_div_ps
 _mm256_dp_ps
 _mm256_hadd_pd
 _mm256_hadd_ps
 _mm256_hsub_pd
 _mm256_hsub_ps
 _mm256_max_pd
 _mm256_max_ps
 _mm256_min_pd
 _mm256_min_ps
 _mm256_mul_pd
 _mm256_mul_ps
 _mm256_or_pd
 _mm256_or_ps
 _mm256_shuffle_pd
 _mm256_shuffle_ps
 _mm256_sub_pd
 _mm256_sub_ps
 _mm256_xor_pd
 _mm256_xor_ps
 _mm_cmp_pd
 _mm256_cmp_pd
 _mm_cmp_ps
 _mm256_cmp_ps
 _mm_cmp_sd
 _mm_cmp_ss
 _mm256_cvtepi32_pd
 _mm256_cvtepi32_ps
 _mm256_cvtpd_ps
 _mm256_cvtps_epi32
 _mm256_cvtps_pd
 _mm256_cvttpd_epi32
 _mm256_cvtpd_epi32
 _mm256_cvttps_epi32
 _mm256_extractf128_ps
 _mm256_extractf128_pd
 _mm256_extractf128_si256
 _mm256_extract_epi8
 _mm256_extract_epi16
 _mm256_extract_epi32
 _mm256_extract_epi64
 _mm256_zeroall
 _mm256_zeroupper
 _mm256_permutevar_ps
 _mm_permutevar_ps
 _mm256_permute_ps
 _mm_permute_ps
 _mm256_permutevar_pd
 _mm_permutevar_pd
 _mm256_permute_pd
 _mm_permute_pd
 _mm256_permute2f128_ps
 _mm256_permute2f128_pd
 _mm256_permute2f128_si256
 _mm256_broadcast_ss
 _mm_broadcast_ss
 _mm256_broadcast_sd
 _mm256_broadcast_ps
 _mm256_broadcast_pd
 _mm256_insertf128_ps
 _mm256_insertf128_pd
 _mm256_insertf128_si256
 _mm256_insert_epi8
 _mm256_insert_epi16
 _mm256_insert_epi32
 _mm256_insert_epi64
 _mm256_load_pd
 _mm256_store_pd
 _mm256_load_ps
 _mm256_store_ps
 _mm256_loadu_pd
 _mm256_storeu_pd
 _mm256_loadu_ps
 _mm256_storeu_ps
 _mm256_load_si256
 _mm256_store_si256
 _mm256_loadu_si256
 _mm256_storeu_si256
 _mm256_maskload_pd
 _mm256_maskstore_pd
 _mm_maskload_pd
 _mm_maskstore_pd
 _mm256_maskload_ps
 _mm256_maskstore_ps
 _mm_maskload_ps
 _mm_maskstore_ps
 _mm256_movehdup_ps
 _mm256_moveldup_ps
 _mm256_movedup_pd
 _mm256_lddqu_si256
 _mm256_stream_si256
 _mm256_stream_pd
 _mm256_stream_ps
 _mm256_rcp_ps
 _mm256_rsqrt_ps
 _mm256_sqrt_pd
 _mm256_sqrt_ps
 _mm256_round_pd
 _mm256_round_ps
 _mm256_unpackhi_pd
 _mm256_unpackhi_ps
 _mm256_unpacklo_pd
 _mm256_unpacklo_ps
 _mm256_testz_si256
 _mm256_testc_si256
 _mm256_testnzc_si256
 _mm256_testz_pd
 _mm256_testc_pd
 _mm256_testnzc_pd
 _mm_testz_pd
 _mm_testc_pd
 _mm_testnzc_pd
 _mm256_testz_ps
 _mm256_testc_ps
 _mm256_testnzc_ps
 _mm_testz_ps
 _mm_testc_ps
 _mm_testnzc_ps
 _mm256_movemask_pd
 _mm256_movemask_ps
 _mm256_setzero_pd
 _mm256_setzero_ps
 _mm256_setzero_si256
 _mm256_set_pd
 _mm256_set_ps
 _mm256_set_epi8
 _mm256_set_epi16
 _mm256_set_epi32
 _mm256_set_epi64x
 _mm256_setr_pd
 _mm256_setr_ps
 _mm256_setr_epi8
 _mm256_setr_epi16
 _mm256_setr_epi32
 _mm256_setr_epi64x
 _mm256_set1_pd
 _mm256_set1_ps
 _mm256_set1_epi8
 _mm256_set1_epi16
 _mm256_set1_epi32
 _mm256_set1_epi64x
 _mm256_castpd_ps
 _mm256_castps_pd
 _mm256_castps_si256
 _mm256_castpd_si256
 _mm256_castsi256_ps
 _mm256_castsi256_pd
 _mm256_castps256_ps128
 _mm256_castpd256_pd128
 _mm256_castsi256_si128
 _mm256_castps128_ps256
 _mm256_castpd128_pd256
 _mm256_castsi128_si256
 _mm256_zextps128_ps256
 _mm256_zextpd128_pd256
 _mm256_zextsi128_si256
 _mm256_floor_ps
 _mm256_ceil_ps
 _mm256_floor_pd
 _mm256_ceil_pd
 _mm256_undefined_ps
 _mm256_undefined_pd
 _mm256_undefined_si256
 _mm256_set_m128
 _mm256_set_m128d
 _mm256_set_m128i
 _mm256_setr_m128
 _mm256_setr_m128d
 _mm256_setr_m128i
 _mm256_loadu2_m128
 _mm256_loadu2_m128d
 _mm256_loadu2_m128i
 _mm256_storeu2_m128
 _mm256_storeu2_m128d
 _mm256_storeu2_m128i


AVX2

avx2


 _mm256_abs_epi8
 _mm256_abs_epi16
 _mm256_abs_epi32
 _mm256_add_epi8
 _mm256_add_epi16
 _mm256_add_epi32
 _mm256_add_epi64
 _mm256_adds_epi8
 _mm256_adds_epi16
 _mm256_adds_epu8
 _mm256_adds_epu16
 _mm256_alignr_epi8
 _mm256_and_si256
 _mm256_andnot_si256
 _mm256_avg_epu8
 _mm256_avg_epu16
 _mm256_blend_epi16
 _mm_blend_epi32
 _mm256_blend_epi32
 _mm256_blendv_epi8
 _mm_broadcastb_epi8
 _mm256_broadcastb_epi8
 _mm_broadcastd_epi32
 _mm256_broadcastd_epi32
 _mm_broadcastq_epi64
 _mm256_broadcastq_epi64
 _mm_broadcastsd_pd
 _mm256_broadcastsd_pd
 _mm_broadcastsi128_si256
 _mm256_broadcastsi128_si256
 _mm_broadcastss_ps
 _mm256_broadcastss_ps
 _mm_broadcastw_epi16
 _mm256_broadcastw_epi16
 _mm256_cmpeq_epi8
 _mm256_cmpeq_epi16
 _mm256_cmpeq_epi32
 _mm256_cmpeq_epi64
 _mm256_cmpgt_epi8
 _mm256_cmpgt_epi16
 _mm256_cmpgt_epi32
 _mm256_cmpgt_epi64
 _mm256_cvtepi16_epi32
 _mm256_cvtepi16_epi64
 _mm256_cvtepi32_epi64
 _mm256_cvtepi8_epi16
 _mm256_cvtepi8_epi32
 _mm256_cvtepi8_epi64
 _mm256_cvtepu16_epi32
 _mm256_cvtepu16_epi64
 _mm256_cvtepu32_epi64
 _mm256_cvtepu8_epi16
 _mm256_cvtepu8_epi32
 _mm256_cvtepu8_epi64
 _mm256_extracti128_si256
 _mm256_hadd_epi16
 _mm256_hadd_epi32
 _mm256_hadds_epi16
 _mm256_hsub_epi16
 _mm256_hsub_epi32
 _mm256_hsubs_epi16
 _mm_i32gather_pd
 _mm256_i32gather_pd
 _mm_i32gather_ps
 _mm256_i32gather_ps
 _mm_i32gather_epi32
 _mm256_i32gather_epi32
 _mm_i32gather_epi64
 _mm256_i32gather_epi64
 _mm_i64gather_pd
 _mm256_i64gather_pd
 _mm_i64gather_ps
 _mm256_i64gather_ps
 _mm_i64gather_epi32
 _mm256_i64gather_epi32
 _mm_i64gather_epi64
 _mm256_i64gather_epi64
 _mm256_inserti128_si256
 _mm256_madd_epi16
 _mm256_maddubs_epi16
 _mm_mask_i32gather_pd
 _mm256_mask_i32gather_pd
 _mm_mask_i32gather_ps
 _mm256_mask_i32gather_ps
 _mm_mask_i32gather_epi32
 _mm256_mask_i32gather_epi32
 _mm_mask_i32gather_epi64
 _mm256_mask_i32gather_epi64
 _mm_mask_i64gather_pd
 _mm256_mask_i64gather_pd
 _mm_mask_i64gather_ps
 _mm256_mask_i64gather_ps
 _mm_mask_i64gather_epi32
 _mm256_mask_i64gather_epi32
 _mm_mask_i64gather_epi64
 _mm256_mask_i64gather_epi64
 _mm_maskload_epi32
 _mm256_maskload_epi32
 _mm_maskload_epi64
 _mm256_maskload_epi64
 _mm_maskstore_epi32
 _mm256_maskstore_epi32
 _mm_maskstore_epi64
 _mm256_maskstore_epi64
 _mm256_max_epi8
 _mm256_max_epi16
 _mm256_max_epi32
 _mm256_max_epu8
 _mm256_max_epu16
 _mm256_max_epu32
 _mm256_min_epi8
 _mm256_min_epi16
 _mm256_min_epi32
 _mm256_min_epu8
 _mm256_min_epu16
 _mm256_min_epu32
 _mm256_movemask_epi8
 _mm256_mpsadbw_epu8
 _mm256_mul_epi32
 _mm256_mul_epu32
 _mm256_mulhi_epi16
 _mm256_mulhi_epu16
 _mm256_mulhrs_epi16
 _mm256_mullo_epi16
 _mm256_mullo_epi32
 _mm256_or_si256
 _mm256_packs_epi16
 _mm256_packs_epi32
 _mm256_packus_epi16
 _mm256_packus_epi32
 _mm256_permute2x128_si256
 _mm256_permute4x64_epi64
 _mm256_permute4x64_pd
 _mm256_permutevar8x32_epi32
 _mm256_permutevar8x32_ps
 _mm256_sad_epu8
 _mm256_shuffle_epi32
 _mm256_shuffle_epi8
 _mm256_shufflehi_epi16
 _mm256_shufflelo_epi16
 _mm256_sign_epi8
 _mm256_sign_epi16
 _mm256_sign_epi32
 _mm256_slli_si256
 _mm256_bslli_epi128
 _mm256_sll_epi16
 _mm256_slli_epi16
 _mm256_sll_epi32
 _mm256_slli_epi32
 _mm256_sll_epi64
 _mm256_slli_epi64
 _mm_sllv_epi32
 _mm256_sllv_epi32
 _mm_sllv_epi64
 _mm256_sllv_epi64
 _mm256_sra_epi16
 _mm256_srai_epi16
 _mm256_sra_epi32
 _mm256_srai_epi32
 _mm_srav_epi32
 _mm256_srav_epi32
 _mm256_srli_si256
 _mm256_bsrli_epi128
 _mm256_srl_epi16
 _mm256_srli_epi16
 _mm256_srl_epi32
 _mm256_srli_epi32
 _mm256_srl_epi64
 _mm256_srli_epi64
 _mm_srlv_epi32
 _mm256_srlv_epi32
 _mm_srlv_epi64
 _mm256_srlv_epi64
 _mm256_stream_load_si256
 _mm256_sub_epi8
 _mm256_sub_epi16
 _mm256_sub_epi32
 _mm256_sub_epi64
 _mm256_subs_epi8
 _mm256_subs_epi16
 _mm256_subs_epu8
 _mm256_subs_epu16
 _mm256_xor_si256
 _mm256_unpackhi_epi8
 _mm256_unpackhi_epi16
 _mm256_unpackhi_epi32
 _mm256_unpackhi_epi64
 _mm256_unpacklo_epi8
 _mm256_unpacklo_epi16
 _mm256_unpacklo_epi32
 _mm256_unpacklo_epi64