This is intended to be a tracking issue for implementing all vendor intrinsics in this repository. This issue is also intended to be a guide for documenting the process of adding new vendor intrinsics to this crate.
If you decide to implement a set of vendor intrinsics, please check the list below to make sure somebody else isn't already working on them. If it's not checked off or has a name next to it, feel free to comment that you'd like to implement it!
At a high level, each vendor intrinsic should correspond to a single exported Rust function with an appropriate target_feature
attribute. Here's an example for _mm_adds_epi16
:
/// Add packed 16-bit integers in `a` and `b` using saturation.
#[inline(always)]
#[target_feature = "+sse2"]
#[cfg_attr(test, assert_instr(paddsw))]
pub unsafe fn _mm_adds_epi16(a: i16x8, b: i16x8) -> i16x8 {
unsafe { paddsw(a, b) }
}
Let's break this down:
- The
#[inline(always)]
is added because vendor intrinsic functions generally should always be inlined because the intent of a vendor intrinsic is to correspond to a single particular CPU instruction. A vendor intrinsic that is compiled into an actual function call could be quite disastrous for performance. - The
#[target_feature = "+sse2"]
attribute intructs the compiler to generate code with thesse2
target feature enabled, regardless of the target platform. That is, even if you're compiling for a platform that doesn't supportsse2
, the compiler will still generate code for_mm_adds_epi16
as ifsse2
support existed. Without this attribute, the compiler might not generate the intended CPU instruction. - The
#[cfg_attr(test, assert_instr(paddsw))]
attribute indicates that when we're testing the crate we'll assert that thepaddsw
instruction is generated inside this function, ensuring that the SIMD intrinsic truly is an intrinsic for the instruction! - The types of the vectors given to the intrinsic should generally match the types as provided in the vendor interface. We'll talk about this more below.
- The implementation of the vendor intrinsic is generally very simple. Remember, the goal is to compile a call to
_mm_adds_epi16
down to a single particular CPU instruction. As such, the implementation typically defers to a compiler intrinsic (in this case,paddsw
) when one is available. More on this below as well. - The intrinsic itself is
unsafe
due to the usage of#[target_feature]
Once a function has been added, you should also add at least one test for basic functionality. Here's an example for _mm_adds_epi16
:
#[simd_test = "sse2"]
unsafe fn _mm_adds_epi16() {
let a = i16x8::new(0, 1, 2, 3, 4, 5, 6, 7);
let b = i16x8::new(8, 9, 10, 11, 12, 13, 14, 15);
let r = sse2::_mm_adds_epi16(a, b);
let e = i16x8::new(8, 10, 12, 14, 16, 18, 20, 22);
assert_eq!(r, e);
}
Note that #[simd_test]
is the same as #[test]
, it's just a custom macro to enable the target feature in the test and generate a wrapper for ensuring the feature is available on the local cpu as well.
Finally, once that's done, send a PR!
Determining the function signature of each vendor intrinsic can be tricky depending on the specificity of the vendor API. For SSE, Intel generally has three types in their interface:
__m128
consists of 4 single-precision (32-bit) floating point numbers.__m128d
consists of 2 double-precision (64-bit) floating point numbers.__m128i
consists ofN
integers, whereN
can be 16, 8, 4 or 2. The corresponding bit sizes for each value ofN
are 8-bit, 16-bit, 32-bit and 64-bit, respectively. Finally, there are signed and unsigned variants for each value ofN
, which means__m128i
can be mapped to one of eight possible concrete integer types.
In terms of the stdsimd
crate, the first two floating point types have a straight-forward translation. __m128
maps to f32x4
while __m128d
maps to f64x2
.
Unfortunately, since __m128i
can correspond to any number of integer types we need to actually inspect the vendor intrinsic to determine the type. Sometimes this is hinted at in the name of intrinsic itself. Continuing with our previous example, _mm_adds_epi16
, we can infer that it is a signed operation on an integer vector consisting of eight 16-bit integers. Namely, the epi
means signed (where as epu
means unsigned) and 16
means 16-bit.
Fortunately, Clang (and LLVM) have determined the specific concrete integer types for most of the vendor intrinsics already, but they aren't available in any easily access away (as far as this author knows). For example, you can see the types for _mm_adds_epi16
in Clang's emmintrin.h
header file.
An implementation of an intrinsic (so far) generally has one of three shapes:
- The vendor intrinsic does not have any corresponding compiler intrinsic, so you must write the implementation in such a way that the compiler will recognize it and produce the desired codegen. For example, the
_mm_add_epi16
intrinsic (note the missings
inadd
) is implemented viaa + b
, which compiles down to LLVM's cross platform SIMD vector API. - The vendor intrinsic does have a corresponding compiler intrinsic, so you must write an
extern
block to bring that intrinsic into scope and then call it. The example above (_mm_adds_epi16
) uses this approach. - The vendor intrinsic has a parameter that must be a constant value when given to the CPU instruction, where that constant is often a parameter that impacts the operation of the intrinsic. This means the implementation of the vendor intrinsic must guarantee that a particular parameter be a constant. This is tricky because Rust doesn't (yet) have a stable way of doing this, so we have to do it ourselves. How you do it can vary, but one particularly gnarly example is
_mm_cmpestri
(make sure to look at theconstify_imm8!
macro).
The compiler intrinsics available to us through LLVM can be found here: https://gist.github.com/anonymous/a25d3e3b4c14ee68d63bd1dcb0e1223c
The Intel vendor intrinsic API can be found here: https://gist.github.com/anonymous/25d752fda8521d29699a826b980218fc
The Clang header files for vendor intrinsics can also be incredibly useful. When in doubt, Do What Clang Does: https://github.com/llvm-mirror/clang/tree/master/lib/Headers
MMX
-
_mm_add_pi16 (a, b)
// paddw -
_mm_add_pi32 (a, b)
// paddd -
_mm_add_pi8 (a, b)
// paddb -
_mm_adds_pi16 (a, b)
// paddsw -
_mm_adds_pi8 (a, b)
// paddsb -
_mm_adds_pu16 (a, b)
// paddusw -
_mm_adds_pu8 (a, b)
// paddusb -
_mm_and_si64 (a, b)
// pand -
_mm_andnot_si64 (a, b)
// pandn -
_mm_cmpeq_pi16 (a, b)
// pcmpeqw -
_mm_cmpeq_pi32 (a, b)
// pcmpeqd -
_mm_cmpeq_pi8 (a, b)
// pcmpeqb -
_mm_cmpgt_pi16 (a, b)
// pcmpgtw -
_mm_cmpgt_pi32 (a, b)
// pcmpgtd -
_mm_cmpgt_pi8 (a, b)
// pcmpgtb -
__int64 _mm_cvtm64_si64 (a)
// movq -
_mm_cvtsi32_si64 (int a)
// movd -
_mm_cvtsi64_m64 (__int64 a)
// movq -
int _mm_cvtsi64_si32 (a)
// movd -
void _m_empty (void)
// emms -
void _mm_empty (void)
// emms -
_m_from_int (int a)
// movd -
_m_from_int64 (__int64 a)
// movq -
_mm_madd_pi16 (a, b)
// pmaddwd -
_mm_mulhi_pi16 (a, b)
// pmulhw -
_mm_mullo_pi16 (a, b)
// pmullw -
_mm_or_si64 (a, b)
// por -
_mm_packs_pi16 (a, b)
// packsswb -
_mm_packs_pi32 (a, b)
// packssdw -
_mm_packs_pu16 (a, b)
// packuswb -
_m_packssdw (a, b)
// packssdw -
_m_packsswb (a, b)
// packsswb -
_m_packuswb (a, b)
// packuswb -
_m_paddb (a, b)
// paddb -
_m_paddd (a, b)
// paddd -
_m_paddsb (a, b)
// paddsb -
_m_paddsw (a, b)
// paddsw -
_m_paddusb (a, b)
// paddusb -
_m_paddusw (a, b)
// paddusw -
_m_paddw (a, b)
// paddw -
_m_pand (a, b)
// pand -
_m_pandn (a, b)
// pandn -
_m_pcmpeqb (a, b)
// pcmpeqb -
_m_pcmpeqd (a, b)
// pcmpeqd -
_m_pcmpeqw (a, b)
// pcmpeqw -
_m_pcmpgtb (a, b)
// pcmpgtb -
_m_pcmpgtd (a, b)
// pcmpgtd -
_m_pcmpgtw (a, b)
// pcmpgtw -
_m_pmaddwd (a, b)
// pmaddwd -
_m_pmulhw (a, b)
// pmulhw -
_m_pmullw (a, b)
// pmullw -
_m_por (a, b)
// por -
_m_pslld (a, count)
// pslld -
_m_pslldi (a, int imm8)
// pslld -
_m_psllq (a, count)
// psllq -
_m_psllqi (a, int imm8)
// psllq -
_m_psllw (a, count)
// psllw -
_m_psllwi (a, int imm8)
// psllw -
_m_psrad (a, count)
// psrad -
_m_psradi (a, int imm8)
// psrad -
_m_psraw (a, count)
// psraw -
_m_psrawi (a, int imm8)
// psraw -
_m_psrld (a, count)
// psrld -
_m_psrldi (a, int imm8)
// psrld -
_m_psrlq (a, count)
// psrlq -
_m_psrlqi (a, int imm8)
// psrlq -
_m_psrlw (a, count)
// psrlw -
_m_psrlwi (a, int imm8)
// psrlw -
_m_psubb (a, b)
// psubb -
_m_psubd (a, b)
// psubd -
_m_psubsb (a, b)
// psubsb -
_m_psubsw (a, b)
// psubsw -
_m_psubusb (a, b)
// psubusb -
_m_psubusw (a, b)
// psubusw -
_m_psubw (a, b)
// psubw -
_m_punpckhbw (a, b)
// punpckhbw -
_m_punpckhdq (a, b)
// punpckhdq -
_m_punpckhwd (a, b)
// punpcklbw -
_m_punpcklbw (a, b)
// punpcklbw -
_m_punpckldq (a, b)
// punpckldq -
_m_punpcklwd (a, b)
// punpcklwd -
_m_pxor (a, __m64)
// pxor -
_mm_set_pi16 (short e3, short e2, short e1, short b)
// ... -
_mm_set_pi32 (int e1, int e0)
// ... -
_mm_set_pi8 (char e7, char e6, char e5, char e4, char e3, char e2, char e1, char e0)
// ... -
_mm_set1_pi16 (short e0)
// ... -
_mm_set1_pi32 (int a)
// ... -
_mm_set1_pi8 (char a)
// ... -
_mm_setr_pi16 (short e3, short e2, short e1, short a)
// ... -
_mm_setr_pi32 (int e1, int e0)
// ... -
_mm_setr_pi8 (char e7, char e6, char e5, char e4, char e3, char e2, char e1, char e0)
// ... -
_mm_setzero_si64 (void)
// pxor -
_mm_sll_pi16 (a, count)
// psllw -
_mm_sll_pi32 (a, count)
// pslld -
_mm_sll_si64 (a, count)
// psllq -
_mm_slli_pi16 (a, int imm8)
// psllw -
_mm_slli_pi32 (a, int imm8)
// pslld -
_mm_slli_si64 (a, int imm8)
// psllq -
_mm_sra_pi16 (a, count)
// psraw -
_mm_sra_pi32 (a, count)
// psrad -
_mm_srai_pi16 (a, int imm8)
// psraw -
_mm_srai_pi32 (a, int imm8)
// psrad -
_mm_srl_pi16 (a, count)
// psrlw -
_mm_srl_pi32 (a, count)
// psrld -
_mm_srl_si64 (a, count)
// psrlq -
_mm_srli_pi16 (a, int imm8)
// psrlw -
_mm_srli_pi32 (a, int imm8)
// psrld -
_mm_srli_si64 (a, int imm8)
// psrlq -
_mm_sub_pi16 (a, b)
// psubw -
_mm_sub_pi32 (a, b)
// psubd -
_mm_sub_pi8 (a, b)
// psubb -
_mm_subs_pi16 (a, b)
// psubsw -
_mm_subs_pi8 (a, b)
// psubsb -
_mm_subs_pu16 (a, b)
// psubusw -
_mm_subs_pu8 (a, b)
// psubusb -
int _m_to_int (a)
// movd -
__int64 _m_to_int64 (a)
// movq -
_mm_unpackhi_pi16 (a, b)
// punpcklbw -
_mm_unpackhi_pi32 (a, b)
// punpckhdq -
_mm_unpackhi_pi8 (a, b)
// punpckhbw -
_mm_unpacklo_pi16 (a, b)
// punpcklwd -
_mm_unpacklo_pi32 (a, b)
// punpckldq -
_mm_unpacklo_pi8 (a, b)
// punpcklbw -
_mm_xor_si64 (a, b)
// pxor
SSE (complete)
-
_MM_TRANSPOSE4_PS
-
_mm_getcsr
-
_mm_setcsr
-
_MM_GET_EXCEPTION_STATE
-
_MM_SET_EXCEPTION_STATE
-
_MM_GET_EXCEPTION_MASK
-
_MM_SET_EXCEPTION_MASK
-
_MM_GET_ROUNDING_MODE
-
_MM_SET_ROUNDING_MODE
-
_MM_GET_FLUSH_ZERO_MODE
-
_MM_SET_FLUSH_ZERO_MODE
-
_mm_prefetch
-
_mm_sfence
-
_mm_max_pi16
-
_m_pmaxsw
-
_mm_max_pu8
-
_m_pmaxub
-
_mm_min_pi16
-
_m_pminsw
-
_mm_min_pu8
-
_m_pminub
-
_mm_mulhi_pu16
-
_m_pmulhuw
-
_mm_avg_pu8
-
_m_pavgb
-
_mm_avg_pu16
-
_m_pavgw
-
_mm_sad_pu8
-
_m_psadbw
-
_mm_cvtsi32_ss
-
_mm_cvt_si2ss
-
_mm_cvtsi64_ss
-
_mm_cvtpi32_ps
-
_mm_cvt_pi2ps
-
_mm_cvtpi16_ps
-
_mm_cvtpu16_ps
-
_mm_cvtpi8_ps
-
_mm_cvtpu8_ps
-
_mm_cvtpi32x2_ps
-
_mm_stream_pi
-
_mm_maskmove_si64
-
_m_maskmovq
-
_mm_extract_pi16
-
_m_pextrw
-
_mm_insert_pi16
-
_m_pinsrw
-
_mm_movemask_pi8
-
_m_pmovmskb
-
_mm_shuffle_pi16
-
_m_pshufw
-
_mm_add_ss
-
_mm_add_ps
-
_mm_sub_ss
-
_mm_sub_ps
-
_mm_mul_ss
-
_mm_mul_ps
-
_mm_div_ss
-
_mm_div_ps
-
_mm_sqrt_ss
-
_mm_sqrt_ps
-
_mm_rcp_ss
-
_mm_rcp_ps
-
_mm_rsqrt_ss
-
_mm_rsqrt_ps
-
_mm_min_ss
-
_mm_min_ps
-
_mm_max_ss
-
_mm_max_ps
-
_mm_and_ps
-
_mm_andnot_ps
-
_mm_or_ps
-
_mm_xor_ps
-
_mm_cmpeq_ss
-
_mm_cmpeq_ps
-
_mm_cmplt_ss
-
_mm_cmplt_ps
-
_mm_cmple_ss
-
_mm_cmple_ps
-
_mm_cmpgt_ss
-
_mm_cmpgt_ps
-
_mm_cmpge_ss
-
_mm_cmpge_ps
-
_mm_cmpneq_ss
-
_mm_cmpneq_ps
-
_mm_cmpnlt_ss
-
_mm_cmpnlt_ps
-
_mm_cmpnle_ss
-
_mm_cmpnle_ps
-
_mm_cmpngt_ss
-
_mm_cmpngt_ps
-
_mm_cmpnge_ss
-
_mm_cmpnge_ps
-
_mm_cmpord_ss
-
_mm_cmpord_ps
-
_mm_cmpunord_ss
-
_mm_cmpunord_ps
-
_mm_comieq_ss
-
_mm_comilt_ss
-
_mm_comile_ss
-
_mm_comigt_ss
-
_mm_comige_ss
-
_mm_comineq_ss
-
_mm_ucomieq_ss
-
_mm_ucomilt_ss
-
_mm_ucomile_ss
-
_mm_ucomigt_ss
-
_mm_ucomige_ss
-
_mm_ucomineq_ss
-
_mm_cvtss_si32
-
_mm_cvt_ss2si
-
_mm_cvtss_si64
-
_mm_cvtss_f32
-
_mm_cvtps_pi32
-
_mm_cvt_ps2pi
-
_mm_cvttss_si32
-
_mm_cvtt_ss2si
-
_mm_cvttss_si64
-
_mm_cvttps_pi32
-
_mm_cvtt_ps2pi
-
_mm_cvtps_pi16
-
_mm_cvtps_pi8
-
_mm_set_ss
-
_mm_set1_ps
-
_mm_set_ps1
-
_mm_set_ps
-
_mm_setr_ps
-
_mm_setzero_ps
-
_mm_loadh_pi
-
_mm_loadl_pi
-
_mm_load_ss
-
_mm_load1_ps
-
_mm_load_ps1
-
_mm_load_ps
-
_mm_loadu_ps
-
_mm_loadr_ps
-
_mm_stream_ps
-
_mm_storeh_pi
-
_mm_storel_pi
-
_mm_store_ss
-
_mm_store1_ps
-
_mm_store_ps1
-
_mm_store_ps
-
_mm_storeu_ps
-
_mm_storer_ps
-
_mm_move_ss
-
_mm_shuffle_ps
-
_mm_unpackhi_ps
-
_mm_unpacklo_ps
-
_mm_movehl_ps
-
_mm_movelh_ps
-
_mm_movemask_ps
-
_mm_undefined_ps
SSE2
-
_mm_pause
-
_mm_clflush
-
_mm_lfence
-
_mm_mfence
-
_mm_add_epi8
-
_mm_add_epi16
-
_mm_add_epi32
-
_mm_add_si64
-
_mm_add_epi64
-
_mm_adds_epi8
-
_mm_adds_epi16
-
_mm_adds_epu8
-
_mm_adds_epu16
-
_mm_avg_epu8
-
_mm_avg_epu16
-
_mm_madd_epi16
-
_mm_max_epi16
-
_mm_max_epu8
-
_mm_min_epi16
-
_mm_min_epu8
-
_mm_mulhi_epi16
-
_mm_mulhi_epu16
-
_mm_mullo_epi16
-
_mm_mul_su32
-
_mm_mul_epu32
-
_mm_sad_epu8
-
_mm_sub_epi8
-
_mm_sub_epi16
-
_mm_sub_epi32
-
_mm_sub_si64
-
_mm_sub_epi64
-
_mm_subs_epi8
-
_mm_subs_epi16
-
_mm_subs_epu8
-
_mm_subs_epu16
-
_mm_slli_si128
-
_mm_bslli_si128
-
_mm_bsrli_si128
-
_mm_slli_epi16
-
_mm_sll_epi16
-
_mm_slli_epi32
-
_mm_sll_epi32
-
_mm_slli_epi64
-
_mm_sll_epi64
-
_mm_srai_epi16
-
_mm_sra_epi16
-
_mm_srai_epi32
-
_mm_sra_epi32
-
_mm_srli_si128
-
_mm_srli_epi16
-
_mm_srl_epi16
-
_mm_srli_epi32
-
_mm_srl_epi32
-
_mm_srli_epi64
-
_mm_srl_epi64
-
_mm_and_si128
-
_mm_andnot_si128
-
_mm_or_si128
-
_mm_xor_si128
-
_mm_cmpeq_epi8
-
_mm_cmpeq_epi16
-
_mm_cmpeq_epi32
-
_mm_cmpgt_epi8
-
_mm_cmpgt_epi16
-
_mm_cmpgt_epi32
-
_mm_cmplt_epi8
-
_mm_cmplt_epi16
-
_mm_cmplt_epi32
-
_mm_cvtepi32_pd
-
_mm_cvtsi32_sd
-
_mm_cvtsi64_sd
-
_mm_cvtsi64x_sd
-
_mm_cvtepi32_ps
-
_mm_cvtpi32_pd
-
_mm_cvtsi32_si128
-
_mm_cvtsi64_si128
-
_mm_cvtsi64x_si128
-
_mm_cvtsi128_si32
-
_mm_cvtsi128_si64
-
_mm_cvtsi128_si64x
-
_mm_set_epi64
-
_mm_set_epi64x
-
_mm_set_epi32
-
_mm_set_epi16
-
_mm_set_epi8
-
_mm_set1_epi64
-
_mm_set1_epi64x
-
_mm_set1_epi32
-
_mm_set1_epi16
-
_mm_set1_epi8
-
_mm_setr_epi64
-
_mm_setr_epi32
-
_mm_setr_epi16
-
_mm_setr_epi8
-
_mm_setzero_si128
-
_mm_loadl_epi64
-
_mm_load_si128
-
_mm_loadu_si128
-
_mm_maskmoveu_si128
-
_mm_store_si128
-
_mm_storeu_si128
-
_mm_storel_epi64
-
_mm_stream_si128
-
_mm_stream_si32
-
_mm_stream_si64
-
_mm_movepi64_pi64
-
_mm_movpi64_epi64
-
_mm_move_epi64
-
_mm_packs_epi16
-
_mm_packs_epi32
-
_mm_packus_epi16
-
_mm_extract_epi16
-
_mm_insert_epi16
-
_mm_movemask_epi8
-
_mm_shuffle_epi32
-
_mm_shufflehi_epi16
-
_mm_shufflelo_epi16
-
_mm_unpackhi_epi8
-
_mm_unpackhi_epi16
-
_mm_unpackhi_epi32
-
_mm_unpackhi_epi64
-
_mm_unpacklo_epi8
-
_mm_unpacklo_epi16
-
_mm_unpacklo_epi32
-
_mm_unpacklo_epi64
-
_mm_add_sd
-
_mm_add_pd
-
_mm_div_sd
-
_mm_div_pd
-
_mm_max_sd
-
_mm_max_pd
-
_mm_min_sd
-
_mm_min_pd
-
_mm_mul_sd
-
_mm_mul_pd
-
_mm_sqrt_sd
-
_mm_sqrt_pd
-
_mm_sub_sd
-
_mm_sub_pd
-
_mm_and_pd
-
_mm_andnot_pd
-
_mm_or_pd
-
_mm_xor_pd
-
_mm_cmpeq_sd
-
_mm_cmplt_sd
-
_mm_cmple_sd
-
_mm_cmpgt_sd
-
_mm_cmpge_sd
-
_mm_cmpord_sd
-
_mm_cmpunord_sd
-
_mm_cmpneq_sd
-
_mm_cmpnlt_sd
-
_mm_cmpnle_sd
-
_mm_cmpngt_sd
-
_mm_cmpnge_sd
-
_mm_cmpeq_pd
-
_mm_cmplt_pd
-
_mm_cmple_pd
-
_mm_cmpgt_pd
-
_mm_cmpge_pd
-
_mm_cmpord_pd
-
_mm_cmpunord_pd
-
_mm_cmpneq_pd
-
_mm_cmpnlt_pd
-
_mm_cmpnle_pd
-
_mm_cmpngt_pd
-
_mm_cmpnge_pd
-
_mm_comieq_sd
-
_mm_comilt_sd
-
_mm_comile_sd
-
_mm_comigt_sd
-
_mm_comige_sd
-
_mm_comineq_sd
-
_mm_ucomieq_sd
-
_mm_ucomilt_sd
-
_mm_ucomile_sd
-
_mm_ucomigt_sd
-
_mm_ucomige_sd
-
_mm_ucomineq_sd
-
_mm_cvtpd_ps
-
_mm_cvtps_pd
-
_mm_cvtpd_epi32
-
_mm_cvtsd_si32
-
_mm_cvtsd_si64
-
_mm_cvtsd_si64x
-
_mm_cvtsd_ss
-
_mm_cvtsd_f64
-
_mm_cvtss_sd
-
_mm_cvttpd_epi32
-
_mm_cvttsd_si32
-
_mm_cvttsd_si64
-
_mm_cvttsd_si64x
-
_mm_cvtps_epi32
-
_mm_cvttps_epi32
-
_mm_cvtpd_pi32
-
_mm_cvttpd_pi32
-
_mm_set_sd
-
_mm_set1_pd
-
_mm_set_pd1
-
_mm_set_pd
-
_mm_setr_pd
-
_mm_setzero_pd
-
_mm_load_pd
-
_mm_load1_pd
-
_mm_load_pd1
-
_mm_loadr_pd
-
_mm_loadu_pd
-
_mm_load_sd
-
_mm_loadh_pd
-
_mm_loadl_pd
-
_mm_stream_pd
-
_mm_store_sd
-
_mm_store1_pd
-
_mm_store_pd1
-
_mm_store_pd
-
_mm_storeu_pd
-
_mm_storer_pd
-
_mm_storeh_pd
-
_mm_storel_pd
-
_mm_unpackhi_pd
-
_mm_unpacklo_pd
-
_mm_movemask_pd
-
_mm_shuffle_pd
-
_mm_move_sd
-
_mm_castpd_ps
-
_mm_castpd_si128
-
_mm_castps_pd
-
_mm_castps_si128
-
_mm_castsi128_pd
-
_mm_castsi128_ps
-
_mm_undefined_pd
-
_mm_undefined_si128
SSE3 (complete)
-
_mm_addsub_ps
-
_mm_addsub_pd
-
_mm_hadd_pd
-
_mm_hadd_ps
-
_mm_hsub_pd
-
_mm_hsub_ps
-
_mm_lddqu_si128
-
_mm_movedup_pd
-
_mm_loaddup_pd
-
_mm_movehdup_ps
-
_mm_moveldup_ps
SSSE3 (complete)
-
_mm_abs_pi8
-
_mm_abs_epi8
-
_mm_abs_pi16
-
_mm_abs_epi16
-
_mm_abs_pi32
-
_mm_abs_epi32
-
_mm_shuffle_epi8
-
_mm_shuffle_pi8
-
_mm_alignr_epi8
-
_mm_alignr_pi8
-
_mm_hadd_epi16
-
_mm_hadds_epi16
-
_mm_hadd_epi32
-
_mm_hadd_pi16
-
_mm_hadd_pi32
-
_mm_hadds_pi16
-
_mm_hsub_epi16
-
_mm_hsubs_epi16
-
_mm_hsub_epi32
-
_mm_hsub_pi16
-
_mm_hsub_pi32
-
_mm_hsubs_pi16
-
_mm_maddubs_epi16
-
_mm_maddubs_pi16
-
_mm_mulhrs_epi16
-
_mm_mulhrs_pi16
-
_mm_sign_epi8
-
_mm_sign_epi16
-
_mm_sign_epi32
-
_mm_sign_pi8
-
_mm_sign_pi16
-
_mm_sign_pi32
SSE4.1
-
_mm_blend_pd
-
_mm_blend_ps
-
_mm_blendv_pd
-
_mm_blendv_ps
-
_mm_blendv_epi8
-
_mm_blend_epi16
-
_mm_dp_pd
-
_mm_dp_ps
-
_mm_extract_ps
-
_mm_extract_epi8
-
_mm_extract_epi32
-
_mm_extract_epi64
-
_mm_insert_ps
-
_mm_insert_epi8
-
_mm_insert_epi32
-
_mm_insert_epi64
-
_mm_max_epi8
-
_mm_max_epi32
-
_mm_max_epu32
-
_mm_max_epu16
-
_mm_min_epi8
-
_mm_min_epi32
-
_mm_min_epu32
-
_mm_min_epu16
-
_mm_packus_epi32
-
_mm_cmpeq_epi64
-
_mm_cvtepi8_epi16
-
_mm_cvtepi8_epi32
-
_mm_cvtepi8_epi64
-
_mm_cvtepi16_epi32
-
_mm_cvtepi16_epi64
-
_mm_cvtepi32_epi64
-
_mm_cvtepu8_epi16
-
_mm_cvtepu8_epi32
-
_mm_cvtepu8_epi64
-
_mm_cvtepu16_epi32
-
_mm_cvtepu16_epi64
-
_mm_cvtepu32_epi64
-
_mm_mul_epi32
-
_mm_mullo_epi32
-
_mm_testz_si128
-
_mm_testc_si128
-
_mm_testnzc_si128
-
_mm_test_all_zeros
-
_mm_test_mix_ones_zeros
-
_mm_test_all_ones
-
_mm_round_pd
-
_mm_floor_pd
-
_mm_ceil_pd
-
_mm_round_ps
-
_mm_floor_ps
-
_mm_ceil_ps
-
_mm_round_sd
-
_mm_floor_sd
-
_mm_ceil_sd
-
_mm_round_ss
-
_mm_floor_ss
-
_mm_ceil_ss
-
_mm_minpos_epu16
-
_mm_mpsadbw_epu8
-
_mm_stream_load_si128
SSE4.2 (complete)
-
_mm_cmpistrm
-
_mm_cmpistri
-
_mm_cmpistrz
-
_mm_cmpistrc
-
_mm_cmpistrs
-
_mm_cmpistro
-
_mm_cmpistra
-
_mm_cmpestrm
-
_mm_cmpestri
-
_mm_cmpestrz
-
_mm_cmpestrc
-
_mm_cmpestrs
-
_mm_cmpestro
-
_mm_cmpestra
-
_mm_cmpgt_epi64
-
_mm_crc32_u8
-
_mm_crc32_u16
-
_mm_crc32_u32
-
_mm_crc32_u64
SSE4a (blocked by #249)
-
_mm_extracti_si64(x, len, idx)
// EXTRQ -
_mm_extract_si64(__m128i __x, __m128i __y)
// EXTRQ -
_mm_inserti_si64(x, y, len, idx)
// INSERTQ -
_mm_insert_si64(__m128i __x, __m128i __y)
// INSERTQ -
_mm_stream_sd(double *__p, __m128d __a)
// MOVNTSD -
_mm_stream_ss(float *__p, __m128 __a)
// MOVNTSS
AVX
-
_mm256_add_pd
-
_mm256_add_ps
-
_mm256_addsub_pd
-
_mm256_addsub_ps
-
_mm256_and_pd
-
_mm256_and_ps
-
_mm256_andnot_pd
-
_mm256_andnot_ps
-
_mm256_blend_pd
-
_mm256_blend_ps
-
_mm256_blendv_pd
-
_mm256_blendv_ps
-
_mm256_div_pd
-
_mm256_div_ps
-
_mm256_dp_ps
-
_mm256_hadd_pd
-
_mm256_hadd_ps
-
_mm256_hsub_pd
-
_mm256_hsub_ps
-
_mm256_max_pd
-
_mm256_max_ps
-
_mm256_min_pd
-
_mm256_min_ps
-
_mm256_mul_pd
-
_mm256_mul_ps
-
_mm256_or_pd
-
_mm256_or_ps
-
_mm256_shuffle_pd
-
_mm256_shuffle_ps
-
_mm256_sub_pd
-
_mm256_sub_ps
-
_mm256_xor_pd
-
_mm256_xor_ps
-
_mm_cmp_pd
-
_mm256_cmp_pd
-
_mm_cmp_ps
-
_mm256_cmp_ps
-
_mm_cmp_sd
-
_mm_cmp_ss
-
_mm256_cvtepi32_pd
-
_mm256_cvtepi32_ps
-
_mm256_cvtpd_ps
-
_mm256_cvtps_epi32
-
_mm256_cvtps_pd
-
_mm256_cvttpd_epi32
-
_mm256_cvtpd_epi32
-
_mm256_cvttps_epi32
-
_mm256_extractf128_ps
-
_mm256_extractf128_pd
-
_mm256_extractf128_si256
-
_mm256_extract_epi8
-
_mm256_extract_epi16
-
_mm256_extract_epi32
-
_mm256_extract_epi64
-
_mm256_zeroall
-
_mm256_zeroupper
-
_mm256_permutevar_ps
-
_mm_permutevar_ps
-
_mm256_permute_ps
-
_mm_permute_ps
-
_mm256_permutevar_pd
-
_mm_permutevar_pd
-
_mm256_permute_pd
-
_mm_permute_pd
-
_mm256_permute2f128_ps
-
_mm256_permute2f128_pd
-
_mm256_permute2f128_si256
-
_mm256_broadcast_ss
-
_mm_broadcast_ss
-
_mm256_broadcast_sd
-
_mm256_broadcast_ps
-
_mm256_broadcast_pd
-
_mm256_insertf128_ps
-
_mm256_insertf128_pd
-
_mm256_insertf128_si256
-
_mm256_insert_epi8
-
_mm256_insert_epi16
-
_mm256_insert_epi32
-
_mm256_insert_epi64
-
_mm256_load_pd
-
_mm256_store_pd
-
_mm256_load_ps
-
_mm256_store_ps
-
_mm256_loadu_pd
-
_mm256_storeu_pd
-
_mm256_loadu_ps
-
_mm256_storeu_ps
-
_mm256_load_si256
-
_mm256_store_si256
-
_mm256_loadu_si256
-
_mm256_storeu_si256
-
_mm256_maskload_pd
-
_mm256_maskstore_pd
-
_mm_maskload_pd
-
_mm_maskstore_pd
-
_mm256_maskload_ps
-
_mm256_maskstore_ps
-
_mm_maskload_ps
-
_mm_maskstore_ps
-
_mm256_movehdup_ps
-
_mm256_moveldup_ps
-
_mm256_movedup_pd
-
_mm256_lddqu_si256
-
_mm256_stream_si256
-
_mm256_stream_pd
-
_mm256_stream_ps
-
_mm256_rcp_ps
-
_mm256_rsqrt_ps
-
_mm256_sqrt_pd
-
_mm256_sqrt_ps
-
_mm256_round_pd
-
_mm256_round_ps
-
_mm256_unpackhi_pd
-
_mm256_unpackhi_ps
-
_mm256_unpacklo_pd
-
_mm256_unpacklo_ps
-
_mm256_testz_si256
-
_mm256_testc_si256
-
_mm256_testnzc_si256
-
_mm256_testz_pd
-
_mm256_testc_pd
-
_mm256_testnzc_pd
-
_mm_testz_pd
-
_mm_testc_pd
-
_mm_testnzc_pd
-
_mm256_testz_ps
-
_mm256_testc_ps
-
_mm256_testnzc_ps
-
_mm_testz_ps
-
_mm_testc_ps
-
_mm_testnzc_ps
-
_mm256_movemask_pd
-
_mm256_movemask_ps
-
_mm256_setzero_pd
-
_mm256_setzero_ps
-
_mm256_setzero_si256
-
_mm256_set_pd
-
_mm256_set_ps
-
_mm256_set_epi8
-
_mm256_set_epi16
-
_mm256_set_epi32
-
_mm256_set_epi64x
-
_mm256_setr_pd
-
_mm256_setr_ps
-
_mm256_setr_epi8
-
_mm256_setr_epi16
-
_mm256_setr_epi32
-
_mm256_setr_epi64x
-
_mm256_set1_pd
-
_mm256_set1_ps
-
_mm256_set1_epi8
-
_mm256_set1_epi16
-
_mm256_set1_epi32
-
_mm256_set1_epi64x
-
_mm256_castpd_ps
-
_mm256_castps_pd
-
_mm256_castps_si256
-
_mm256_castpd_si256
-
_mm256_castsi256_ps
-
_mm256_castsi256_pd
-
_mm256_castps256_ps128
-
_mm256_castpd256_pd128
-
_mm256_castsi256_si128
-
_mm256_castps128_ps256
-
_mm256_castpd128_pd256
-
_mm256_castsi128_si256
-
_mm256_zextps128_ps256
-
_mm256_zextpd128_pd256
-
_mm256_zextsi128_si256
-
_mm256_floor_ps
-
_mm256_ceil_ps
-
_mm256_floor_pd
-
_mm256_ceil_pd
-
_mm256_undefined_ps
-
_mm256_undefined_pd
-
_mm256_undefined_si256
-
_mm256_set_m128
-
_mm256_set_m128d
-
_mm256_set_m128i
-
_mm256_setr_m128
-
_mm256_setr_m128d
-
_mm256_setr_m128i
-
_mm256_loadu2_m128
-
_mm256_loadu2_m128d
-
_mm256_loadu2_m128i
-
_mm256_storeu2_m128
-
_mm256_storeu2_m128d
-
_mm256_storeu2_m128i
AVX2
-
_mm256_abs_epi8
-
_mm256_abs_epi16
-
_mm256_abs_epi32
-
_mm256_add_epi8
-
_mm256_add_epi16
-
_mm256_add_epi32
-
_mm256_add_epi64
-
_mm256_adds_epi8
-
_mm256_adds_epi16
-
_mm256_adds_epu8
-
_mm256_adds_epu16
-
_mm256_alignr_epi8
-
_mm256_and_si256
-
_mm256_andnot_si256
-
_mm256_avg_epu8
-
_mm256_avg_epu16
-
_mm256_blend_epi16
-
_mm_blend_epi32
-
_mm256_blend_epi32
-
_mm256_blendv_epi8
-
_mm_broadcastb_epi8
-
_mm256_broadcastb_epi8
-
_mm_broadcastd_epi32
-
_mm256_broadcastd_epi32
-
_mm_broadcastq_epi64
-
_mm256_broadcastq_epi64
-
_mm_broadcastsd_pd
-
_mm256_broadcastsd_pd
-
_mm_broadcastsi128_si256
-
_mm256_broadcastsi128_si256
-
_mm_broadcastss_ps
-
_mm256_broadcastss_ps
-
_mm_broadcastw_epi16
-
_mm256_broadcastw_epi16
-
_mm256_cmpeq_epi8
-
_mm256_cmpeq_epi16
-
_mm256_cmpeq_epi32
-
_mm256_cmpeq_epi64
-
_mm256_cmpgt_epi8
-
_mm256_cmpgt_epi16
-
_mm256_cmpgt_epi32
-
_mm256_cmpgt_epi64
-
_mm256_cvtepi16_epi32
-
_mm256_cvtepi16_epi64
-
_mm256_cvtepi32_epi64
-
_mm256_cvtepi8_epi16
-
_mm256_cvtepi8_epi32
-
_mm256_cvtepi8_epi64
-
_mm256_cvtepu16_epi32
-
_mm256_cvtepu16_epi64
-
_mm256_cvtepu32_epi64
-
_mm256_cvtepu8_epi16
-
_mm256_cvtepu8_epi32
-
_mm256_cvtepu8_epi64
-
_mm256_extracti128_si256
-
_mm256_hadd_epi16
-
_mm256_hadd_epi32
-
_mm256_hadds_epi16
-
_mm256_hsub_epi16
-
_mm256_hsub_epi32
-
_mm256_hsubs_epi16
-
_mm_i32gather_pd
-
_mm256_i32gather_pd
-
_mm_i32gather_ps
-
_mm256_i32gather_ps
-
_mm_i32gather_epi32
-
_mm256_i32gather_epi32
-
_mm_i32gather_epi64
-
_mm256_i32gather_epi64
-
_mm_i64gather_pd
-
_mm256_i64gather_pd
-
_mm_i64gather_ps
-
_mm256_i64gather_ps
-
_mm_i64gather_epi32
-
_mm256_i64gather_epi32
-
_mm_i64gather_epi64
-
_mm256_i64gather_epi64
-
_mm256_inserti128_si256
-
_mm256_madd_epi16
-
_mm256_maddubs_epi16
-
_mm_mask_i32gather_pd
-
_mm256_mask_i32gather_pd
-
_mm_mask_i32gather_ps
-
_mm256_mask_i32gather_ps
-
_mm_mask_i32gather_epi32
-
_mm256_mask_i32gather_epi32
-
_mm_mask_i32gather_epi64
-
_mm256_mask_i32gather_epi64
-
_mm_mask_i64gather_pd
-
_mm256_mask_i64gather_pd
-
_mm_mask_i64gather_ps
-
_mm256_mask_i64gather_ps
-
_mm_mask_i64gather_epi32
-
_mm256_mask_i64gather_epi32
-
_mm_mask_i64gather_epi64
-
_mm256_mask_i64gather_epi64
-
_mm_maskload_epi32
-
_mm256_maskload_epi32
-
_mm_maskload_epi64
-
_mm256_maskload_epi64
-
_mm_maskstore_epi32
-
_mm256_maskstore_epi32
-
_mm_maskstore_epi64
-
_mm256_maskstore_epi64
-
_mm256_max_epi8
-
_mm256_max_epi16
-
_mm256_max_epi32
-
_mm256_max_epu8
-
_mm256_max_epu16
-
_mm256_max_epu32
-
_mm256_min_epi8
-
_mm256_min_epi16
-
_mm256_min_epi32
-
_mm256_min_epu8
-
_mm256_min_epu16
-
_mm256_min_epu32
-
_mm256_movemask_epi8
-
_mm256_mpsadbw_epu8
-
_mm256_mul_epi32
-
_mm256_mul_epu32
-
_mm256_mulhi_epi16
-
_mm256_mulhi_epu16
-
_mm256_mulhrs_epi16
-
_mm256_mullo_epi16
-
_mm256_mullo_epi32
-
_mm256_or_si256
-
_mm256_packs_epi16
-
_mm256_packs_epi32
-
_mm256_packus_epi16
-
_mm256_packus_epi32
-
_mm256_permute2x128_si256
-
_mm256_permute4x64_epi64
-
_mm256_permute4x64_pd
-
_mm256_permutevar8x32_epi32
-
_mm256_permutevar8x32_ps
-
_mm256_sad_epu8
-
_mm256_shuffle_epi32
-
_mm256_shuffle_epi8
-
_mm256_shufflehi_epi16
-
_mm256_shufflelo_epi16
-
_mm256_sign_epi8
-
_mm256_sign_epi16
-
_mm256_sign_epi32
-
_mm256_slli_si256
-
_mm256_bslli_epi128
-
_mm256_sll_epi16
-
_mm256_slli_epi16
-
_mm256_sll_epi32
-
_mm256_slli_epi32
-
_mm256_sll_epi64
-
_mm256_slli_epi64
-
_mm_sllv_epi32
-
_mm256_sllv_epi32
-
_mm_sllv_epi64
-
_mm256_sllv_epi64
-
_mm256_sra_epi16
-
_mm256_srai_epi16
-
_mm256_sra_epi32
-
_mm256_srai_epi32
-
_mm_srav_epi32
-
_mm256_srav_epi32
-
_mm256_srli_si256
-
_mm256_bsrli_epi128
-
_mm256_srl_epi16
-
_mm256_srli_epi16
-
_mm256_srl_epi32
-
_mm256_srli_epi32
-
_mm256_srl_epi64
-
_mm256_srli_epi64
-
_mm_srlv_epi32
-
_mm256_srlv_epi32
-
_mm_srlv_epi64
-
_mm256_srlv_epi64
-
_mm256_stream_load_si256
-
_mm256_sub_epi8
-
_mm256_sub_epi16
-
_mm256_sub_epi32
-
_mm256_sub_epi64
-
_mm256_subs_epi8
-
_mm256_subs_epi16
-
_mm256_subs_epu8
-
_mm256_subs_epu16
-
_mm256_xor_si256
-
_mm256_unpackhi_epi8
-
_mm256_unpackhi_epi16
-
_mm256_unpackhi_epi32
-
_mm256_unpackhi_epi64
-
_mm256_unpacklo_epi8
-
_mm256_unpacklo_epi16
-
_mm256_unpacklo_epi32
-
_mm256_unpacklo_epi64