Skip to content

Instantly share code, notes, and snippets.

@toivoh
Created November 2, 2014 09:08
Show Gist options
  • Save toivoh/436cb54b6d23cdfb9888 to your computer and use it in GitHub Desktop.
Save toivoh/436cb54b6d23cdfb9888 to your computer and use it in GitHub Desktop.
Trying to get Julia to emit an efficient sequence of SIMD instructions with the aid of llvmcall
module TestSIMD2
# requires Julia 0.4 for llvmcall
typealias Uint64x2 NTuple{2, Uint64}
function ($)(x::Uint64x2, y::Uint64x2)
Base.llvmcall("""%3 = xor <2 x i64> %1, %0
ret <2 x i64> %3""",
Uint64x2, (Uint64x2, Uint64x2), x, y)
end
function innerloop!{T}(dest::Vector{T}, dest_ofs, src::Vector{T}, src_ofs)
@inbounds s = ( src[1 + 2*src_ofs], src[2 + 2*src_ofs])
@inbounds d = (dest[1 + 2*dest_ofs], dest[2 + 2*dest_ofs])
d $= s
@inbounds (dest[1 + 2*dest_ofs], dest[2 + 2*dest_ofs]) = d
end
T = Uint64
code_native(innerloop!, (Vector{T}, Int, Vector{T}, Int))
end
@toivoh
Copy link
Author

toivoh commented Nov 2, 2014

I'm trying to get Julia to emit an unrolled inner loop that uses SIMD instructions. (Trying to optimize https://gist.github.com/toivoh/c9a1f1e064396bdf3447)
Using JuliaLang/julia#8786 (comment) I can get pretty far (thanks, @GunnarFarneback!) by using llvmcall (Julia 0.4 only) and manually unrolling.

Running the example I get SIMD instructions:

Source line: 12
    push    RBP
    mov RBP, RSP
Source line: 12
    shl RSI, 4
    mov RAX, QWORD PTR [RDI + 8]
    movq    XMM1, QWORD PTR [RAX + RSI + 8]
    movq    XMM0, QWORD PTR [RAX + RSI]
    punpcklqdq  XMM0, XMM1      # xmm0 = xmm0[0],xmm1[0]
Source line: 11
    shl RCX, 4
    mov RDX, QWORD PTR [RDX + 8]
    movq    XMM2, QWORD PTR [RDX + RCX + 8]
    movq    XMM1, QWORD PTR [RDX + RCX]
    punpcklqdq  XMM1, XMM2      # xmm1 = xmm1[0],xmm2[0]
    pxor    XMM1, XMM0
Source line: 14
    movq    QWORD PTR [RAX + RSI], XMM1
    mov RAX, QWORD PTR [RDI + 8]
    pextrq  QWORD PTR [RAX + RSI + 8], XMM1, 1
    mov EAX, 7257360
    pop RBP
    ret

But it appears that llvm doesn't realize that the memory accesses are always 16 byte aligned, it's emitting a lot of packing and unpacking instructions to load/store only 8 bytes at time. Arrays always seem to be 16 byte aligned in Julia; I wonder if we can make llvm realize this?

@Keno
Copy link

Keno commented Nov 2, 2014

Unfortunately, Julia's arrays can also hold unaligned data. For fun, I told LLVM this anyway, but unfortunately that does not seem to make a difference in the generated code.

@Keno
Copy link

Keno commented Nov 2, 2014

Note: If you really want you can use llvmcall with a load <2xi64> ... align 16 to get the desired behavior.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment