nlw0/SIMD_demo.ipynb

## SIMD_demo.ipynb
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "SIMD merge-sort\n",
    "====\n",
    "\n",
    "This is an experiment and exercise where we'll implement a merge-sort algorithm in Julia using SIMD instructions. \n",
    "\n",
    "\n",
    "### The idea\n",
    "The ideas used here can be found in the articles:\n",
    " - _Efficient Implementation of Sorting on Multi-Core SIMD CPU Architecture_, Jatin Chhugani et al, 2008\n",
    " - _SIMD- and Cache-Friendly Algorithm for Sorting an Array of Structures_, Hiroshi Inoue and Kenjiro Taura, 2007\n",
    "\n",
    "Radix sort whith infinite buckets is \"the best\" sorting algorithm for simple integers in theory. In practice, the existence of memory bandwidth, cache memory and SIMD instructions makes everything non-trivial, and something more in the lines of a merge sort can pay off. This notebook illustrates some ideas taken from those articles.\n",
    "\n",
    "### The code\n",
    "\n",
    "The very first step of the process is to take a sequence of 16 items from memory, split into four groups where we can sort in a vectorized fashion over each \"lane\" of the registers. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[32m\u001b[1m Resolving\u001b[22m\u001b[39m package versions...\n",
      "\u001b[32m\u001b[1m  Updating\u001b[22m\u001b[39m `~/.julia/Project.toml`\n",
      "\u001b[90m [no changes]\u001b[39m\n",
      "\u001b[32m\u001b[1m  Updating\u001b[22m\u001b[39m `~/.julia/Manifest.toml`\n",
      "\u001b[90m [no changes]\u001b[39m\n"
     ]
    }
   ],
   "source": [
    "import Pkg; Pkg.add(\"SIMD\")\n",
    "using SIMD"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "@inline function in_register_sorting_t(a1::Vec{4, T}, a2::Vec{4, T}, a3::Vec{4, T}, a4::Vec{4, T}) where T\n",
    "    b1 = min(a1,a2)\n",
    "    b2 = max(a1,a2)\n",
    "    b3 = min(a3,a4)\n",
    "    b4 = max(a3,a4)\n",
    "\n",
    "    c1 = min(b1,b3)\n",
    "    c3 = max(b1,b3)\n",
    "    c2 = min(b2,b4)\n",
    "    c4 = max(b2,b4)\n",
    "\n",
    "    d2 = min(c2,c3)\n",
    "    d3 = max(c2,c3)\n",
    "\n",
    "    (c1, d2, d3, c4)\n",
    "end;\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "An example data set with 4x4 random integers."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "4-element Array{Vec{4,Int8},1}:\n",
       " <4 x Int8>[65, -48, 95, -18]  \n",
       " <4 x Int8>[19, -100, 117, -18]\n",
       " <4 x Int8>[80, 5, 63, 19]     \n",
       " <4 x Int8>[-128, 5, 120, -108]"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "T = Int8\n",
    "aa = [Vec(tuple(rand(T, 4)...)) for _ in 1:4]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we have the smallet values on top and largest on bottom, and each \"lane\" is a sorted sequence of elements.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "4-element Array{Vec{4,Int8},1}:\n",
       " <4 x Int8>[-128, -100, 63, -108]\n",
       " <4 x Int8>[19, -48, 95, -18]    \n",
       " <4 x Int8>[65, 5, 117, -18]     \n",
       " <4 x Int8>[80, 5, 120, 19]      "
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "[in_register_sorting_t(aa...)...]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "How does the machine code look like? It should have lots of `vpmin` and `vpmax`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\t.text\n",
      "; Function in_register_sorting_t {\n",
      "; Location: In[2]:2\n",
      "; Function min; {\n",
      "; Location: SIMD.jl:934\n",
      "; Function >=; {\n",
      "; Location: SIMD.jl:833\n",
      "; Function llvmwrap; {\n",
      "; Location: SIMD.jl:564\n",
      "; Function macro expansion; {\n",
      "; Location: In[2]:2\n",
      "\tvpmovsxbd\t(%rsi), %xmm0\n",
      "\tvpmovsxbd\t(%rdx), %xmm1\n",
      ";}}}\n",
      "; Function vifelse; {\n",
      "; Location: SIMD.jl:863\n",
      "; Function macro expansion; {\n",
      "; Location: SIMD.jl:886\n",
      "\tvpminsd\t%xmm1, %xmm0, %xmm2\n",
      ";}}}\n",
      "; Location: In[2]:3\n",
      "; Function max; {\n",
      "; Location: SIMD.jl:932\n",
      "; Function vifelse; {\n",
      "; Location: SIMD.jl:863\n",
      "; Function macro expansion; {\n",
      "; Location: SIMD.jl:886\n",
      "\tvpmaxsd\t%xmm1, %xmm0, %xmm0\n",
      ";}}}\n",
      "; Location: In[2]:4\n",
      "; Function min; {\n",
      "; Location: SIMD.jl:934\n",
      "; Function >=; {\n",
      "; Location: SIMD.jl:833\n",
      "; Function llvmwrap; {\n",
      "; Location: SIMD.jl:564\n",
      "; Function macro expansion; {\n",
      "; Location: SIMD.jl:591\n",
      "\tvpmovsxbd\t(%rcx), %xmm1\n",
      "\tvpmovsxbd\t(%r8), %xmm3\n",
      ";}}}\n",
      "; Function vifelse; {\n",
      "; Location: SIMD.jl:863\n",
      "; Function macro expansion; {\n",
      "; Location: SIMD.jl:886\n",
      "\tvpminsd\t%xmm3, %xmm1, %xmm4\n",
      ";}}}\n",
      "; Location: In[2]:5\n",
      "; Function max; {\n",
      "; Location: SIMD.jl:932\n",
      "; Function vifelse; {\n",
      "; Location: SIMD.jl:863\n",
      "; Function macro expansion; {\n",
      "; Location: SIMD.jl:886\n",
      "\tvpmaxsd\t%xmm3, %xmm1, %xmm1\n",
      ";}}}\n",
      "; Location: In[2]:7\n",
      "; Function min; {\n",
      "; Location: SIMD.jl:934\n",
      "; Function vifelse; {\n",
      "; Location: SIMD.jl:863\n",
      "; Function macro expansion; {\n",
      "; Location: SIMD.jl:886\n",
      "\tvpminsd\t%xmm4, %xmm2, %xmm3\n",
      ";}}}\n",
      "; Location: In[2]:8\n",
      "; Function max; {\n",
      "; Location: SIMD.jl:932\n",
      "; Function vifelse; {\n",
      "; Location: SIMD.jl:863\n",
      "; Function macro expansion; {\n",
      "; Location: SIMD.jl:886\n",
      "\tvpmaxsd\t%xmm4, %xmm2, %xmm2\n",
      ";}}}\n",
      "; Location: In[2]:9\n",
      "; Function min; {\n",
      "; Location: SIMD.jl:934\n",
      "; Function vifelse; {\n",
      "; Location: SIMD.jl:863\n",
      "; Function macro expansion; {\n",
      "; Location: SIMD.jl:886\n",
      "\tvpminsd\t%xmm1, %xmm0, %xmm4\n",
      ";}}}\n",
      "; Location: In[2]:10\n",
      "; Function max; {\n",
      "; Location: SIMD.jl:932\n",
      "; Function vifelse; {\n",
      "; Location: SIMD.jl:863\n",
      "; Function macro expansion; {\n",
      "; Location: SIMD.jl:886\n",
      "\tvpmaxsd\t%xmm1, %xmm0, %xmm0\n",
      ";}}}\n",
      "; Location: In[2]:12\n",
      "; Function min; {\n",
      "; Location: SIMD.jl:934\n",
      "; Function vifelse; {\n",
      "; Location: SIMD.jl:863\n",
      "; Function macro expansion; {\n",
      "; Location: SIMD.jl:886\n",
      "\tvpminsd\t%xmm2, %xmm4, %xmm1\n",
      ";}}}\n",
      "; Location: In[2]:13\n",
      "; Function max; {\n",
      "; Location: SIMD.jl:932\n",
      "; Function vifelse; {\n",
      "; Location: SIMD.jl:863\n",
      "; Function macro expansion; {\n",
      "; Location: SIMD.jl:886\n",
      "\tvpmaxsd\t%xmm2, %xmm4, %xmm2\n",
      "\tmovabsq\t$140358466906096, %rax  # imm = 0x7FA7C08FA7F0\n",
      ";}}}\n",
      "; Location: In[2]:15\n",
      "\tvpshufb\t(%rax), %xmm2, %xmm2\n",
      "\tmovabsq\t$140358466906112, %rax  # imm = 0x7FA7C08FA800\n",
      "\tvpshufb\t(%rax), %xmm0, %xmm0\n",
      "\tvpblendd\t$8, %xmm0, %xmm2, %xmm0 # xmm0 = xmm2[0,1,2],xmm0[3]\n",
      "\tmovabsq\t$140358466906128, %rax  # imm = 0x7FA7C08FA810\n",
      "\tvmovdqa\t(%rax), %xmm2\n",
      "\tvpshufb\t%xmm2, %xmm3, %xmm3\n",
      "\tvpshufb\t%xmm2, %xmm1, %xmm1\n",
      "\tvpunpckldq\t%xmm1, %xmm3, %xmm1 # xmm1 = xmm3[0],xmm1[0],xmm3[1],xmm1[1]\n",
      "\tvpblendd\t$12, %xmm0, %xmm1, %xmm0 # xmm0 = xmm1[0,1],xmm0[2,3]\n",
      "\tvmovdqu\t%xmm0, (%rdi)\n",
      "\tmovq\t%rdi, %rax\n",
      "\tretq\n",
      "\tnopw\t%cs:(%rax,%rax)\n",
      ";}\n"
     ]
    }
   ],
   "source": [
    "@code_native in_register_sorting_t(aa...)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we need to transpose these values so each register holds a sorted sequence of 4 values."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "@inline function transpose_registers(a1::Vec{4, T}, a2::Vec{4, T}, a3::Vec{4, T}, a4::Vec{4, T}) where T\n",
    "    b1 = shufflevector(a1,a3,Val{(0,1,4,5)})\n",
    "    b2 = shufflevector(a2,a4,Val{(0,1,4,5)})\n",
    "    b3 = shufflevector(a1,a3,Val{(2,3,6,7)})\n",
    "    b4 = shufflevector(a2,a4,Val{(2,3,6,7)})\n",
    "\n",
    "    c1 = shufflevector(b1,b2,Val{(0,4,2,6)})\n",
    "    c2 = shufflevector(b1,b2,Val{(1,5,3,7)})\n",
    "    c3 = shufflevector(b3,b4,Val{(0,4,2,6)})\n",
    "    c4 = shufflevector(b3,b4,Val{(1,5,3,7)})\n",
    "\n",
    "    (c1, c2, c3, c4)\n",
    "end\n",
    "\n",
    "@inline function in_register_sorting(a1::Vec{4, T}, a2::Vec{4, T}, a3::Vec{4, T}, a4::Vec{4, T}) where T\n",
    "    transpose_registers(in_register_sorting_t(a1, a2, a3, a4)...)\n",
    "end;"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "4-element Array{Vec{4,Int8},1}:\n",
       " <4 x Int8>[-128, 19, 65, 80]  \n",
       " <4 x Int8>[-100, -48, 5, 5]   \n",
       " <4 x Int8>[63, 95, 117, 120]  \n",
       " <4 x Int8>[-108, -18, -18, 19]"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "[in_register_sorting(aa...)...]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The transposition is done by instructions such as `vpunpckl` and `vpshuf`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\t.text\n",
      "; Function in_register_sorting {\n",
      "; Location: In[6]:16\n",
      "; Function in_register_sorting_t; {\n",
      "; Location: In[2]:2\n",
      "; Function min; {\n",
      "; Location: SIMD.jl:934\n",
      "; Function >=; {\n",
      "; Location: SIMD.jl:833\n",
      "; Function llvmwrap; {\n",
      "; Location: SIMD.jl:564\n",
      "; Function macro expansion; {\n",
      "; Location: In[6]:16\n",
      "\tvpmovsxbd\t(%rsi), %xmm0\n",
      "\tvpmovsxbd\t(%rdx), %xmm1\n",
      ";}}}\n",
      "; Location: SIMD.jl:886\n",
      "\tvpminsd\t%xmm1, %xmm0, %xmm2\n",
      ";}\n",
      "; Location: In[2]:3\n",
      "; Function max; {\n",
      "; Location: SIMD.jl:932\n",
      "; Function vifelse; {\n",
      "; Location: SIMD.jl:863\n",
      "; Function macro expansion; {\n",
      "; Location: SIMD.jl:886\n",
      "\tvpmaxsd\t%xmm1, %xmm0, %xmm0\n",
      ";}}}\n",
      "; Location: In[2]:4\n",
      "; Function min; {\n",
      "; Location: SIMD.jl:934\n",
      "; Function >=; {\n",
      "; Location: SIMD.jl:833\n",
      "; Function llvmwrap; {\n",
      "; Location: SIMD.jl:564\n",
      "; Function macro expansion; {\n",
      "; Location: SIMD.jl:591\n",
      "\tvpmovsxbd\t(%rcx), %xmm1\n",
      "\tvpmovsxbd\t(%r8), %xmm3\n",
      ";}}}\n",
      "; Location: SIMD.jl:886\n",
      "\tvpminsd\t%xmm3, %xmm1, %xmm4\n",
      ";}}\n",
      "; Function in_register_sorting_t; {\n",
      "; Location: SIMD.jl:886\n",
      "\tvpmaxsd\t%xmm3, %xmm1, %xmm1\n",
      ";}\n",
      "; Function in_register_sorting_t; {\n",
      "; Location: In[2]:7\n",
      "; Function min; {\n",
      "; Location: SIMD.jl:934\n",
      "; Function vifelse; {\n",
      "; Location: SIMD.jl:863\n",
      "; Function macro expansion; {\n",
      "; Location: SIMD.jl:886\n",
      "\tvpminsd\t%xmm4, %xmm2, %xmm3\n",
      ";}}}\n",
      "; Location: In[2]:8\n",
      "; Function max; {\n",
      "; Location: SIMD.jl:932\n",
      "; Function vifelse; {\n",
      "; Location: SIMD.jl:863\n",
      "; Function macro expansion; {\n",
      "; Location: SIMD.jl:886\n",
      "\tvpmaxsd\t%xmm4, %xmm2, %xmm2\n",
      ";}}}}\n",
      "; Function in_register_sorting_t; {\n",
      "; Location: SIMD.jl:886\n",
      "\tvpminsd\t%xmm1, %xmm0, %xmm4\n",
      ";}\n",
      "; Function in_register_sorting_t; {\n",
      "; Location: In[2]:10\n",
      "; Function max; {\n",
      "; Location: SIMD.jl:932\n",
      "; Function vifelse; {\n",
      "; Location: SIMD.jl:863\n",
      "; Function macro expansion; {\n",
      "; Location: SIMD.jl:886\n",
      "\tvpmaxsd\t%xmm1, %xmm0, %xmm0\n",
      ";}}}}\n",
      "; Function in_register_sorting_t; {\n",
      "; Location: SIMD.jl:886\n",
      "\tvpminsd\t%xmm2, %xmm4, %xmm1\n",
      ";}\n",
      "; Function in_register_sorting_t; {\n",
      "; Location: In[2]:13\n",
      "; Function max; {\n",
      "; Location: SIMD.jl:932\n",
      "; Function vifelse; {\n",
      "; Location: SIMD.jl:863\n",
      "; Function macro expansion; {\n",
      "; Location: SIMD.jl:886\n",
      "\tvpmaxsd\t%xmm2, %xmm4, %xmm2\n",
      ";}}}}\n",
      "; Function transpose_registers; {\n",
      "; Location: In[6]:2\n",
      "; Function shufflevector; {\n",
      "; Location: SIMD.jl:1443\n",
      "; Function macro expansion; {\n",
      "; Location: SIMD.jl:1446\n",
      "\tvpunpcklqdq\t%xmm2, %xmm3, %xmm4 # xmm4 = xmm3[0],xmm2[0]\n",
      ";}}}\n",
      "; Function transpose_registers; {\n",
      "; Location: SIMD.jl:1446\n",
      "\tvpunpcklqdq\t%xmm0, %xmm1, %xmm5 # xmm5 = xmm1[0],xmm0[0]\n",
      ";}\n",
      "; Function transpose_registers; {\n",
      "; Location: In[6]:4\n",
      "; Function shufflevector; {\n",
      "; Location: SIMD.jl:1443\n",
      "; Function macro expansion; {\n",
      "; Location: SIMD.jl:1446\n",
      "\tvpunpckhqdq\t%xmm2, %xmm3, %xmm2 # xmm2 = xmm3[1],xmm2[1]\n",
      ";}}}\n",
      "; Function transpose_registers; {\n",
      "; Location: SIMD.jl:1446\n",
      "\tvpunpckhqdq\t%xmm0, %xmm1, %xmm0 # xmm0 = xmm1[1],xmm0[1]\n",
      "\tmovabsq\t$140358466911312, %rax  # imm = 0x7FA7C08FBC50\n",
      ";}\n",
      "\tvmovdqa\t(%rax), %xmm1\n",
      "\tvpshufb\t%xmm1, %xmm5, %xmm3\n",
      "\tvpshufb\t%xmm1, %xmm4, %xmm6\n",
      "\tvpunpcklbw\t%xmm3, %xmm6, %xmm3 # xmm3 = xmm6[0],xmm3[0],xmm6[1],xmm3[1],xmm6[2],xmm3[2],xmm6[3],xmm3[3],xmm6[4],xmm3[4],xmm6[5],xmm3[5],xmm6[6],xmm3[6],xmm6[7],xmm3[7]\n",
      "\tmovabsq\t$140358466911328, %rax  # imm = 0x7FA7C08FBC60\n",
      "\tvmovdqa\t(%rax), %xmm6\n",
      "\tvpshufb\t%xmm6, %xmm5, %xmm5\n",
      "\tvpshufb\t%xmm6, %xmm4, %xmm4\n",
      "\tvpunpcklbw\t%xmm5, %xmm4, %xmm4 # xmm4 = xmm4[0],xmm5[0],xmm4[1],xmm5[1],xmm4[2],xmm5[2],xmm4[3],xmm5[3],xmm4[4],xmm5[4],xmm4[5],xmm5[5],xmm4[6],xmm5[6],xmm4[7],xmm5[7]\n",
      "\tvpunpckldq\t%xmm4, %xmm3, %xmm3 # xmm3 = xmm3[0],xmm4[0],xmm3[1],xmm4[1]\n",
      "\tvpshufb\t%xmm6, %xmm0, %xmm4\n",
      "\tvpshufb\t%xmm6, %xmm2, %xmm5\n",
      "\tvpunpcklbw\t%xmm4, %xmm5, %xmm4 # xmm4 = xmm5[0],xmm4[0],xmm5[1],xmm4[1],xmm5[2],xmm4[2],xmm5[3],xmm4[3],xmm5[4],xmm4[4],xmm5[5],xmm4[5],xmm5[6],xmm4[6],xmm5[7],xmm4[7]\n",
      "\tvpbroadcastd\t%xmm4, %xmm4\n",
      "\tvpshufb\t%xmm1, %xmm0, %xmm0\n",
      "\tvpshufb\t%xmm1, %xmm2, %xmm1\n",
      "\tvpunpcklbw\t%xmm0, %xmm1, %xmm0 # xmm0 = xmm1[0],xmm0[0],xmm1[1],xmm0[1],xmm1[2],xmm0[2],xmm1[3],xmm0[3],xmm1[4],xmm0[4],xmm1[5],xmm0[5],xmm1[6],xmm0[6],xmm1[7],xmm0[7]\n",
      "\tvpbroadcastq\t%xmm0, %xmm0\n",
      "\tvpblendd\t$8, %xmm4, %xmm0, %xmm0 # xmm0 = xmm0[0,1,2],xmm4[3]\n",
      "\tvpblendd\t$12, %xmm0, %xmm3, %xmm0 # xmm0 = xmm3[0,1],xmm0[2,3]\n",
      "\tvmovdqu\t%xmm0, (%rdi)\n",
      "\tmovq\t%rdi, %rax\n",
      "\tretq\n",
      "\tnopl\t(%rax)\n",
      ";}\n"
     ]
    }
   ],
   "source": [
    "@code_native in_register_sorting(aa...)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Once we have a bunch of small sequences loaded in the registers we can perform our first merge. The _bitonic merge_ achieves this using alternating steps of swapping (via min/max) and shuffling the values from two registers. Each step of swap and shuffle is also commonly known as a _swuffle_. For two sequences with four values the merge is finished after three swuffles."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "@inline function bitonic_merge_2x4(a::Vec{4, T}, b::Vec{4, T}) where T\n",
    "    a1 = a\n",
    "    b1 = shufflevector(b, Val{(3,2,1,0)})\n",
    "\n",
    "    L1 = min(a1, b1)\n",
    "    H1 = max(a1, b1)\n",
    "\n",
    "    a2 = shufflevector(L1, H1, Val{(2,3,6,7)})\n",
    "    b2 = shufflevector(L1, H1, Val{(0,1,4,5)})\n",
    "\n",
    "    L2 = min(a2, b2)\n",
    "    H2 = max(a2, b2)\n",
    "\n",
    "    a3 = shufflevector(L2, H2, Val{(1,5,3,7)})\n",
    "    b3 = shufflevector(L2, H2, Val{(0,4,2,6)})\n",
    "\n",
    "    L3 = min(a3, b3)\n",
    "    H3 = max(a3, b3)\n",
    "\n",
    "    c1 = shufflevector(L3, H3, Val{(0,4,1,5)})\n",
    "    c2 = shufflevector(L3, H3, Val{(2,6,3,7)})\n",
    "\n",
    "    (c1, c2)\n",
    "end;"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(<4 x Int64>[11, 12, 13, 14], <4 x Int64>[21, 22, 23, 24])"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "bitonic_merge_2x4(Vec((11,12,22,23)), Vec((13,14,21,24)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To merge four sequences we can first merge two pairs separately. Then we merge the first part of the two larger sequences, and the last part of that is an intermediate sequence that we merge with whatever second half started with the smallest value, and so on."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "bitonic_merge_4x4 (generic function with 1 method)"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "@inline function bitonic_merge_4x4(a::Vec{4, T}, b::Vec{4, T}, c::Vec{4, T}, d::Vec{4, T}) where T\n",
    "    ab1, ab2 = bitonic_merge_2x4(a, b)\n",
    "    cd1, cd2 = bitonic_merge_2x4(c, d)\n",
    "\n",
    "    e1, ei2 = bitonic_merge_2x4(ab1, cd1)\n",
    "\n",
    "    if ab2[1] < cd2[1]\n",
    "        e2, ei3 = bitonic_merge_2x4(ei2, ab2)\n",
    "        e3, e4 = bitonic_merge_2x4(ei3, cd2)\n",
    "    else\n",
    "        e2, ei3 = bitonic_merge_2x4(ei2, cd2)\n",
    "        e3, e4 = bitonic_merge_2x4(ei3, ab2)\n",
    "    end\n",
    "    (e1, e2, e3, e4)\n",
    "end"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "4-element Array{Vec{4,Int8},1}:\n",
       " <4 x Int8>[-128, -108, -100, -48]\n",
       " <4 x Int8>[-18, -18, 5, 5]       \n",
       " <4 x Int8>[19, 19, 63, 65]       \n",
       " <4 x Int8>[80, 95, 117, 120]     "
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "[bitonic_merge_4x4(aa...)...]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can now traverse an array and sort every interval of 16 elements in-place."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [],
   "source": [
    "function simd_sort_in_place(a::Array{T,1}) where T\n",
    "    N = size(a, 1) >> 4\n",
    "    for n in 1:N\n",
    "        aa = vload(Vec{16, T}, a, n * 16 - 15)\n",
    "        qq = simd_sort_16(aa)\n",
    "        vstore(qq, a, n*16-15)\n",
    "    end\n",
    "end\n",
    "\n",
    "\n",
    "function simd_sort_16(a::Vec{16, T}) where T\n",
    "    a1 = Vec((a[1], a[2], a[3], a[4]))\n",
    "    a2 = Vec((a[5], a[6], a[7], a[8]))\n",
    "    a3 = Vec((a[9], a[10], a[11], a[12]))\n",
    "    a4 = Vec((a[13], a[14], a[15], a[16]))\n",
    "\n",
    "    bb = in_register_sorting(a1, a2, a3, a4)\n",
    "    c1,c2,c3,c4 = bitonic_merge_4x4(bb...)\n",
    "\n",
    "    Vec((c1[1],c1[2],c1[3],c1[4],\n",
    "         c2[1],c2[2],c2[3],c2[4],\n",
    "         c3[1],c3[2],c3[3],c3[4],\n",
    "         c4[1],c4[2],c4[3],c4[4]))\n",
    "end;"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "16×16 Array{Int8,2}:\n",
       " -104   77   20   108  -115    13   -32  …   -64    71  114   -81    67   109\n",
       "  102    3  -67   121    58    49  -122      103    13   97    51    85    69\n",
       "   39  -38   67  -125    28   117    73      -70   120  -44   113   -16    19\n",
       "   43  -87   53    66   -98    33    64       85   -14   83   -83   -20   -77\n",
       "  -51  -70  -57    94    66   -26   105        0   -16   49  -108    69   119\n",
       "  -13  -94  -93    57    43  -125    35  …    67  -103   78   115    17    58\n",
       " -108   68  -26    15   -26    82   105       74    97  -95   -66  -122   -48\n",
       "   -2   98  -33    57   112   -33   -29       78  -128  106   -99    48     8\n",
       "  -15   89  -69    26   -24    95   -79      115    -9   97  -115     6    86\n",
       "   88   15  -20  -113  -102     1   -72       81   -25  -78    22    86    37\n",
       "   36  -95  -28   -68   116   -65   -28  …    35    26  -83    43   -25    -2\n",
       "  -59  108  -94   -47    93   -55   -96      121    37  -33    77   -14   -23\n",
       "   54  -66   -8   -39   -47    -5   -82      -42   123   -1    92    -2    16\n",
       "   13   36   45    -7    58  -115   -54      -91   -55  -88   -33   -30    18\n",
       "  -75  -65   37   -37  -101    51    97     -114   -38  -42  -102   -70   -21\n",
       "    7  -73   12     3  -100  -108   120  …   -24   -12   11   107   -78  -101"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": [
       "16×16 Array{Int8,2}:\n",
       " -108  -95  -94  -125  -115  -125  -122  …  -114  -128  -95  -115  -122  -101\n",
       " -104  -94  -93  -113  -102  -115   -96      -91  -103  -88  -108   -78   -77\n",
       "  -75  -87  -69   -68  -101  -108   -82      -70   -55  -83  -102   -70   -48\n",
       "  -59  -73  -67   -47  -100   -65   -79      -64   -38  -78   -99   -30   -23\n",
       "  -51  -70  -57   -39   -98   -55   -72      -42   -25  -44   -83   -25   -21\n",
       "  -15  -66  -33   -37   -47   -33   -54  …   -24   -16  -42   -81   -20    -2\n",
       "  -13  -65  -28    -7   -26   -26   -32        0   -14  -33   -66   -16     8\n",
       "   -2  -38  -26     3   -24    -5   -29       35   -12   -1   -33   -14    16\n",
       "    7    3  -20    15    28     1   -28       67    -9   11    22    -2    18\n",
       "   13   15   -8    26    43    13    35       74    13   49    43     6    19\n",
       "   36   36   12    57    58    33    64  …    78    26   78    51    17    37\n",
       "   39   68   20    57    58    49    73       81    37   83    77    48    58\n",
       "   43   77   37    66    66    51    97       85    71   97    92    67    69\n",
       "   54   89   45    94    93    82   105      103    97   97   107    69    86\n",
       "   88   98   53   108   112    95   105      115   120  106   113    85   109\n",
       "  102  108   67   121   116   117   120  …   121   123  114   115    86   119"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "T = Int8\n",
    "ab = rand(T, 16*16)\n",
    "display(reshape(ab,16,:))\n",
    "simd_sort_in_place(ab)\n",
    "display(reshape(ab,16,:))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Conclusion\n",
    "\n",
    "One may ask:_Is it worth it?_  I don't know! I still have to finish implementing a complete algorithm, application and alternative approaches to make a meaningful test. I am only interested here in seeing if I can do this kind of low-level programming in Julia, and it has exceeded my expectations (again). I was thinking at first I might have to write some small C snippets, but the SIMD.jl package already gave me a lot of what I needed. And even if I need something else it seems easy to use intrinsics in Julia via [ccall](http://kristofferc.github.io/post/intrinsics/) or `llvmcall`.\n",
    "\n",
    "Next steps are to extend this to 8-element sequences using the 256 bit registers, implement a function that can \n",
    "merge long streams read from memory, and also experiment with [non-temporal memory I/O](https://lwn.net/Articles/255364/).\n",
    "\n",
    "It should also be made clear that this isn't just about sorting arrays of numbers, but potentially structures and key-value pairs, and performing aggregation. Running whatever `sort` function the standard library provides may not be the best approach for every task and every architecture, and it may well be worth it to investigate this kind of thing in detail in a specialized application."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Julia 1.0.3",
   "language": "julia",
   "name": "julia-1.0"
  },
  "language_info": {
   "file_extension": ".jl",
   "mimetype": "application/julia",
   "name": "julia",
   "version": "1.0.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}