# SSE to BQN dictonary
# This is an implementation of SSE SIMD intristics in BQN. The purpose of it
# is understanding: it's a look at SIMD instructions from an array programmer POV.
# There are limit of what you can can represent in a programming language without
# losing clarity. Bit hackery is one of such features. So instead of converting
# from floats to bits and back, we just assume that the data *already* is
# represented the way we need.
# As they say, all models are wrong but some are useful.
# For the reference for all kinds of SIMD for x86_64, see the Intel® C++ Compiler Classic
# Developer Guide and Reference
# For a quick reference, see the Intel® Intrinsics Guide
# The actual value of this is 0xFFFFFFFF, but we model it as NaN.
mf ← 0÷0
# Two floating-point numbers are called *ordered* if neither of them is NaN
Ord ← ∧´·=˜∾
# Round to integer (ties even) and round to zero (truncate)
Rti ← ⌊⊢+2÷˜0.5≠2|⊢ ⋄ Trn ← ××⌊∘|
## Arithmetics
Add_ps ← + ⋄ Sub_ps ← - ⋄ Mul_ps ← × ⋄ Div_ps ← ÷
Sqrt_ps ← √ ⋄ Rcp_ps ← ÷ ⋄ Rsqrt_ps ← ÷∘√ ⋄
Min_ps ← ⌊ ⋄ Max_ps ← ⌈ ⋄
Add_ss ← +⌾⊏˜ ⋄ Sub_ss ← -˜⌾⊏˜ ⋄ Mul_ss ← ×⌾⊏˜ ⋄ Div_ss ← ÷˜⌾⊏˜
Sqrt_ss ← √⌾⊏ ⋄ Rcp_ss ← ÷⌾⊏ ⋄ Rsqrt_ss ← ÷∘√⌾⊏ ⋄
Rin_ss ← ⌊⌾⊏˜ ⋄ Rax_ss ← ⌈⌾⊏˜ ⋄
# Most of other _ss intristics follow the same `F⌾⊏` pattern, so we will skip them
## Logic
# Here values are 4‿32 arrays of bits
And_ps ← ∧ ⋄ Andnot_ps ← ¬∘∧
Or_ps ← ∨ ⋄ Xor_ps ← ≠
## Comparsions
Cmpeq_ps ← 0‿mf⊏˜= ⋄ Cmple_ps ← 0‿mf⊏˜≤ ⋄ Cmpge_ps ← 0‿mf⊏˜≥
Cmpneq_ps ← 0‿mf⊏˜≠ ⋄ Cmplt_ps ← 0‿mf⊏˜< ⋄ Cmpgt_ps ← 0‿mf⊏˜>
Cmpnlt_ps ← 0‿mf⊏˜¬∘< ⋄ Cmpnle_ps ← 0‿mf⊏˜¬∘≤
Cmpngt_ps ← 0‿mf⊏˜¬∘> ⋄ Cmpnge_ps ← 0‿mf⊏˜¬∘≥
Cmpord_ps ← 0‿mf⊏˜Ord¨ ⋄ Cmpunord_ps ← 0‿mf⊏˜¬∘Ord¨
# This ones return an i32
Comieq_ss ← =○⊑ ⋄ Comilt_ss ← <○⊑ ⋄ Comigt_ss ← >○⊑
Comineq_ss ← ≠○⊑ ⋄ Comile_ss ← ≤○⊑ ⋄ Comige_ss ← ≥○⊑
# ucomi* instructions are the same as comi*, except they do not signal
# an exception for QNaNs
## Conversions
# Conversions are most poorly modeled, because BQN has only a single number type
# This return single or packed i32
Cvtss_si32 ← Rti ⊑ ⋄ Cvtps_pi32 ← Rti 2⊸↑
Cvttss_si32 ← Trn ⊑ ⋄ Cvttps_pi32 ← Trn 2⊸↑
Cvtps_pi16 ← Trn ⋄ Cvtps_pi8 ← Trn
# These convert integers to floats
Cvtsi32_ss ← ⊣⌾⊑˜ ⋄ Cvtpi32_ps ← ⊣⌾(2⊸↑)˜ ⋄
Cvtpi16_ps ← ⊢ ⋄ Cvtpu16_ps ← ⊢ ⋄
Cvtpi8_ps ← ⊢ ⋄ Cvtpu8_ps ← ⊢ ⋄ Cvtpi32x2_ps ← ⊢∾
# This one returns an f32
Cvtss_f32 ← ⊑
## Load intristics
# Load intristics have a pointer as an argument, but we assume it's an array or a value
Loadh_pi ← (2↑⊣)∾⊢ ⋄ Loadl_pi ← ⊢∾(2↓⊣) ⋄
Load_ss ← 4↑⊢ ⋄ Load1_ps ← 4⥊⊢ ⋄
Load_ps ← ⊢ ⋄ Loadu_ps ← ⊢ ⋄ Loadr_ps ← ⌽
## Set intristics
# Set intristics are like load intristics, except they take values instead of a pointer
Set_ss ← 4↑⊢ ⋄ Set1_ps ← 4⥊⊢ ⋄
Set_ps ← ⊢ ⋄ Setr_ps ← ⌽ ⋄ setzero_ps ← 4⥊0
## Store inristics
# Store intristcs are essentially an inverse of load intristics
Storeh_pi ← 2⊸↓ ⋄ Storel_pi ← 2⊸↑ ⋄
Store_ss ← ⊑ ⋄ Store1_ps ← 4⥊⊑ ⋄
Store_ps ← ⊢ ⋄ Storeu_ps ← ⊢ ⋄ Storer_ps ← ⌽
## Integer Intrinsics
# Integer intristics see _m64 as an array of integers
# Integer out, integer in
Extract_pi16 ← ⊑˜ ⋄ Insert_pi16 ← {d‿n←𝕩 ⋄ d⌾(n⊑⊢)𝕨}
# ⌈ and ⌊
Max_pi16 ← ⌈ ⋄ Max_pu8 ← ⌈ ⋄ Min_pi16 ← ⌊ ⋄ Min_pu8 ← ⌊
# Interpret the input as 8‿i8. Result is an 8‿u1 bitmask
Movemask_pi8 ← <⟜0
# Take the higher halfs of the result
Mulhi_pu16 ← (2⋆16) ⌊∘÷˜ ×
# Indices are 4‿u2 as int
Shuffle_pi16 ← ⊏˜
# The intristic actually writes to d
Maskmove_si64 ← {m‿d←𝕩 ⋄ (m/𝕨)⌾(m/⊢)d}
# Averages
Avg_pu8 ← 2 ⌊∘÷˜ + ⋄ Avg_pu16 ← 2 ⌊∘÷˜ +
# Input is 8‿u8, output is 4‿i32
Sad_pu8 ← 4↑|∘-´ # pu8 is sad :(
## Miscellaneous
# Indices are 4‿u2 as uint
Shuffle_ps ← {a‿b←𝕨 ⋄ (a⊏˜2↑𝕩)∾b⊏˜2↓𝕩}
Unpackhi_ps ← ⥊∘⍉ ≍○(2↓⊢) ⋄ Unpacklo_ps ← ⥊∘⍉ ≍○(2↑⊢)
Move_ss ← ⊑⌾⊑˜ ⋄ Movehl_ps ← (2↓⊣)⌾(2↑⊢)˜
Movelh_ps ← (2↑⊣)⌾(2↓⊢)˜ ⋄ Movemask_ps ← <⟜0
# `Undefined_ps` is, well, undefined
# That's all of the first SSE. But there are also SSE2, SSE3, SSSE3 and even SSE4...
