Skip to content

Instantly share code, notes, and snippets.

@nlw0
Created February 9, 2019 19:10
Show Gist options
  • Save nlw0/a684f82c5c3c0f0c725f9e853e5a2fc7 to your computer and use it in GitHub Desktop.
Save nlw0/a684f82c5c3c0f0c725f9e853e5a2fc7 to your computer and use it in GitHub Desktop.
SIMD based merge sort in Julia
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"SIMD merge-sort\n",
"====\n",
"\n",
"This is an experiment and exercise where we'll implement a merge-sort algorithm in Julia using SIMD instructions. \n",
"\n",
"\n",
"### The idea\n",
"The ideas used here can be found in the articles:\n",
" - _Efficient Implementation of Sorting on Multi-Core SIMD CPU Architecture_, Jatin Chhugani et al, 2008\n",
" - _SIMD- and Cache-Friendly Algorithm for Sorting an Array of Structures_, Hiroshi Inoue and Kenjiro Taura, 2007\n",
"\n",
"Radix sort whith infinite buckets is \"the best\" sorting algorithm for simple integers in theory. In practice, the existence of memory bandwidth, cache memory and SIMD instructions makes everything non-trivial, and something more in the lines of a merge sort can pay off. This notebook illustrates some ideas taken from those articles.\n",
"\n",
"### The code\n",
"\n",
"The very first step of the process is to take a sequence of 16 items from memory, split into four groups where we can sort in a vectorized fashion over each \"lane\" of the registers. "
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\u001b[32m\u001b[1m Resolving\u001b[22m\u001b[39m package versions...\n",
"\u001b[32m\u001b[1m Updating\u001b[22m\u001b[39m `~/.julia/Project.toml`\n",
"\u001b[90m [no changes]\u001b[39m\n",
"\u001b[32m\u001b[1m Updating\u001b[22m\u001b[39m `~/.julia/Manifest.toml`\n",
"\u001b[90m [no changes]\u001b[39m\n"
]
}
],
"source": [
"import Pkg; Pkg.add(\"SIMD\")\n",
"using SIMD"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"@inline function in_register_sorting_t(a1::Vec{4, T}, a2::Vec{4, T}, a3::Vec{4, T}, a4::Vec{4, T}) where T\n",
" b1 = min(a1,a2)\n",
" b2 = max(a1,a2)\n",
" b3 = min(a3,a4)\n",
" b4 = max(a3,a4)\n",
"\n",
" c1 = min(b1,b3)\n",
" c3 = max(b1,b3)\n",
" c2 = min(b2,b4)\n",
" c4 = max(b2,b4)\n",
"\n",
" d2 = min(c2,c3)\n",
" d3 = max(c2,c3)\n",
"\n",
" (c1, d2, d3, c4)\n",
"end;\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"An example data set with 4x4 random integers."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"4-element Array{Vec{4,Int8},1}:\n",
" <4 x Int8>[65, -48, 95, -18] \n",
" <4 x Int8>[19, -100, 117, -18]\n",
" <4 x Int8>[80, 5, 63, 19] \n",
" <4 x Int8>[-128, 5, 120, -108]"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"T = Int8\n",
"aa = [Vec(tuple(rand(T, 4)...)) for _ in 1:4]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we have the smallet values on top and largest on bottom, and each \"lane\" is a sorted sequence of elements.\n"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"4-element Array{Vec{4,Int8},1}:\n",
" <4 x Int8>[-128, -100, 63, -108]\n",
" <4 x Int8>[19, -48, 95, -18] \n",
" <4 x Int8>[65, 5, 117, -18] \n",
" <4 x Int8>[80, 5, 120, 19] "
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"[in_register_sorting_t(aa...)...]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"How does the machine code look like? It should have lots of `vpmin` and `vpmax`."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\t.text\n",
"; Function in_register_sorting_t {\n",
"; Location: In[2]:2\n",
"; Function min; {\n",
"; Location: SIMD.jl:934\n",
"; Function >=; {\n",
"; Location: SIMD.jl:833\n",
"; Function llvmwrap; {\n",
"; Location: SIMD.jl:564\n",
"; Function macro expansion; {\n",
"; Location: In[2]:2\n",
"\tvpmovsxbd\t(%rsi), %xmm0\n",
"\tvpmovsxbd\t(%rdx), %xmm1\n",
";}}}\n",
"; Function vifelse; {\n",
"; Location: SIMD.jl:863\n",
"; Function macro expansion; {\n",
"; Location: SIMD.jl:886\n",
"\tvpminsd\t%xmm1, %xmm0, %xmm2\n",
";}}}\n",
"; Location: In[2]:3\n",
"; Function max; {\n",
"; Location: SIMD.jl:932\n",
"; Function vifelse; {\n",
"; Location: SIMD.jl:863\n",
"; Function macro expansion; {\n",
"; Location: SIMD.jl:886\n",
"\tvpmaxsd\t%xmm1, %xmm0, %xmm0\n",
";}}}\n",
"; Location: In[2]:4\n",
"; Function min; {\n",
"; Location: SIMD.jl:934\n",
"; Function >=; {\n",
"; Location: SIMD.jl:833\n",
"; Function llvmwrap; {\n",
"; Location: SIMD.jl:564\n",
"; Function macro expansion; {\n",
"; Location: SIMD.jl:591\n",
"\tvpmovsxbd\t(%rcx), %xmm1\n",
"\tvpmovsxbd\t(%r8), %xmm3\n",
";}}}\n",
"; Function vifelse; {\n",
"; Location: SIMD.jl:863\n",
"; Function macro expansion; {\n",
"; Location: SIMD.jl:886\n",
"\tvpminsd\t%xmm3, %xmm1, %xmm4\n",
";}}}\n",
"; Location: In[2]:5\n",
"; Function max; {\n",
"; Location: SIMD.jl:932\n",
"; Function vifelse; {\n",
"; Location: SIMD.jl:863\n",
"; Function macro expansion; {\n",
"; Location: SIMD.jl:886\n",
"\tvpmaxsd\t%xmm3, %xmm1, %xmm1\n",
";}}}\n",
"; Location: In[2]:7\n",
"; Function min; {\n",
"; Location: SIMD.jl:934\n",
"; Function vifelse; {\n",
"; Location: SIMD.jl:863\n",
"; Function macro expansion; {\n",
"; Location: SIMD.jl:886\n",
"\tvpminsd\t%xmm4, %xmm2, %xmm3\n",
";}}}\n",
"; Location: In[2]:8\n",
"; Function max; {\n",
"; Location: SIMD.jl:932\n",
"; Function vifelse; {\n",
"; Location: SIMD.jl:863\n",
"; Function macro expansion; {\n",
"; Location: SIMD.jl:886\n",
"\tvpmaxsd\t%xmm4, %xmm2, %xmm2\n",
";}}}\n",
"; Location: In[2]:9\n",
"; Function min; {\n",
"; Location: SIMD.jl:934\n",
"; Function vifelse; {\n",
"; Location: SIMD.jl:863\n",
"; Function macro expansion; {\n",
"; Location: SIMD.jl:886\n",
"\tvpminsd\t%xmm1, %xmm0, %xmm4\n",
";}}}\n",
"; Location: In[2]:10\n",
"; Function max; {\n",
"; Location: SIMD.jl:932\n",
"; Function vifelse; {\n",
"; Location: SIMD.jl:863\n",
"; Function macro expansion; {\n",
"; Location: SIMD.jl:886\n",
"\tvpmaxsd\t%xmm1, %xmm0, %xmm0\n",
";}}}\n",
"; Location: In[2]:12\n",
"; Function min; {\n",
"; Location: SIMD.jl:934\n",
"; Function vifelse; {\n",
"; Location: SIMD.jl:863\n",
"; Function macro expansion; {\n",
"; Location: SIMD.jl:886\n",
"\tvpminsd\t%xmm2, %xmm4, %xmm1\n",
";}}}\n",
"; Location: In[2]:13\n",
"; Function max; {\n",
"; Location: SIMD.jl:932\n",
"; Function vifelse; {\n",
"; Location: SIMD.jl:863\n",
"; Function macro expansion; {\n",
"; Location: SIMD.jl:886\n",
"\tvpmaxsd\t%xmm2, %xmm4, %xmm2\n",
"\tmovabsq\t$140358466906096, %rax # imm = 0x7FA7C08FA7F0\n",
";}}}\n",
"; Location: In[2]:15\n",
"\tvpshufb\t(%rax), %xmm2, %xmm2\n",
"\tmovabsq\t$140358466906112, %rax # imm = 0x7FA7C08FA800\n",
"\tvpshufb\t(%rax), %xmm0, %xmm0\n",
"\tvpblendd\t$8, %xmm0, %xmm2, %xmm0 # xmm0 = xmm2[0,1,2],xmm0[3]\n",
"\tmovabsq\t$140358466906128, %rax # imm = 0x7FA7C08FA810\n",
"\tvmovdqa\t(%rax), %xmm2\n",
"\tvpshufb\t%xmm2, %xmm3, %xmm3\n",
"\tvpshufb\t%xmm2, %xmm1, %xmm1\n",
"\tvpunpckldq\t%xmm1, %xmm3, %xmm1 # xmm1 = xmm3[0],xmm1[0],xmm3[1],xmm1[1]\n",
"\tvpblendd\t$12, %xmm0, %xmm1, %xmm0 # xmm0 = xmm1[0,1],xmm0[2,3]\n",
"\tvmovdqu\t%xmm0, (%rdi)\n",
"\tmovq\t%rdi, %rax\n",
"\tretq\n",
"\tnopw\t%cs:(%rax,%rax)\n",
";}\n"
]
}
],
"source": [
"@code_native in_register_sorting_t(aa...)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we need to transpose these values so each register holds a sorted sequence of 4 values."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"@inline function transpose_registers(a1::Vec{4, T}, a2::Vec{4, T}, a3::Vec{4, T}, a4::Vec{4, T}) where T\n",
" b1 = shufflevector(a1,a3,Val{(0,1,4,5)})\n",
" b2 = shufflevector(a2,a4,Val{(0,1,4,5)})\n",
" b3 = shufflevector(a1,a3,Val{(2,3,6,7)})\n",
" b4 = shufflevector(a2,a4,Val{(2,3,6,7)})\n",
"\n",
" c1 = shufflevector(b1,b2,Val{(0,4,2,6)})\n",
" c2 = shufflevector(b1,b2,Val{(1,5,3,7)})\n",
" c3 = shufflevector(b3,b4,Val{(0,4,2,6)})\n",
" c4 = shufflevector(b3,b4,Val{(1,5,3,7)})\n",
"\n",
" (c1, c2, c3, c4)\n",
"end\n",
"\n",
"@inline function in_register_sorting(a1::Vec{4, T}, a2::Vec{4, T}, a3::Vec{4, T}, a4::Vec{4, T}) where T\n",
" transpose_registers(in_register_sorting_t(a1, a2, a3, a4)...)\n",
"end;"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"4-element Array{Vec{4,Int8},1}:\n",
" <4 x Int8>[-128, 19, 65, 80] \n",
" <4 x Int8>[-100, -48, 5, 5] \n",
" <4 x Int8>[63, 95, 117, 120] \n",
" <4 x Int8>[-108, -18, -18, 19]"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"[in_register_sorting(aa...)...]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The transposition is done by instructions such as `vpunpckl` and `vpshuf`."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\t.text\n",
"; Function in_register_sorting {\n",
"; Location: In[6]:16\n",
"; Function in_register_sorting_t; {\n",
"; Location: In[2]:2\n",
"; Function min; {\n",
"; Location: SIMD.jl:934\n",
"; Function >=; {\n",
"; Location: SIMD.jl:833\n",
"; Function llvmwrap; {\n",
"; Location: SIMD.jl:564\n",
"; Function macro expansion; {\n",
"; Location: In[6]:16\n",
"\tvpmovsxbd\t(%rsi), %xmm0\n",
"\tvpmovsxbd\t(%rdx), %xmm1\n",
";}}}\n",
"; Location: SIMD.jl:886\n",
"\tvpminsd\t%xmm1, %xmm0, %xmm2\n",
";}\n",
"; Location: In[2]:3\n",
"; Function max; {\n",
"; Location: SIMD.jl:932\n",
"; Function vifelse; {\n",
"; Location: SIMD.jl:863\n",
"; Function macro expansion; {\n",
"; Location: SIMD.jl:886\n",
"\tvpmaxsd\t%xmm1, %xmm0, %xmm0\n",
";}}}\n",
"; Location: In[2]:4\n",
"; Function min; {\n",
"; Location: SIMD.jl:934\n",
"; Function >=; {\n",
"; Location: SIMD.jl:833\n",
"; Function llvmwrap; {\n",
"; Location: SIMD.jl:564\n",
"; Function macro expansion; {\n",
"; Location: SIMD.jl:591\n",
"\tvpmovsxbd\t(%rcx), %xmm1\n",
"\tvpmovsxbd\t(%r8), %xmm3\n",
";}}}\n",
"; Location: SIMD.jl:886\n",
"\tvpminsd\t%xmm3, %xmm1, %xmm4\n",
";}}\n",
"; Function in_register_sorting_t; {\n",
"; Location: SIMD.jl:886\n",
"\tvpmaxsd\t%xmm3, %xmm1, %xmm1\n",
";}\n",
"; Function in_register_sorting_t; {\n",
"; Location: In[2]:7\n",
"; Function min; {\n",
"; Location: SIMD.jl:934\n",
"; Function vifelse; {\n",
"; Location: SIMD.jl:863\n",
"; Function macro expansion; {\n",
"; Location: SIMD.jl:886\n",
"\tvpminsd\t%xmm4, %xmm2, %xmm3\n",
";}}}\n",
"; Location: In[2]:8\n",
"; Function max; {\n",
"; Location: SIMD.jl:932\n",
"; Function vifelse; {\n",
"; Location: SIMD.jl:863\n",
"; Function macro expansion; {\n",
"; Location: SIMD.jl:886\n",
"\tvpmaxsd\t%xmm4, %xmm2, %xmm2\n",
";}}}}\n",
"; Function in_register_sorting_t; {\n",
"; Location: SIMD.jl:886\n",
"\tvpminsd\t%xmm1, %xmm0, %xmm4\n",
";}\n",
"; Function in_register_sorting_t; {\n",
"; Location: In[2]:10\n",
"; Function max; {\n",
"; Location: SIMD.jl:932\n",
"; Function vifelse; {\n",
"; Location: SIMD.jl:863\n",
"; Function macro expansion; {\n",
"; Location: SIMD.jl:886\n",
"\tvpmaxsd\t%xmm1, %xmm0, %xmm0\n",
";}}}}\n",
"; Function in_register_sorting_t; {\n",
"; Location: SIMD.jl:886\n",
"\tvpminsd\t%xmm2, %xmm4, %xmm1\n",
";}\n",
"; Function in_register_sorting_t; {\n",
"; Location: In[2]:13\n",
"; Function max; {\n",
"; Location: SIMD.jl:932\n",
"; Function vifelse; {\n",
"; Location: SIMD.jl:863\n",
"; Function macro expansion; {\n",
"; Location: SIMD.jl:886\n",
"\tvpmaxsd\t%xmm2, %xmm4, %xmm2\n",
";}}}}\n",
"; Function transpose_registers; {\n",
"; Location: In[6]:2\n",
"; Function shufflevector; {\n",
"; Location: SIMD.jl:1443\n",
"; Function macro expansion; {\n",
"; Location: SIMD.jl:1446\n",
"\tvpunpcklqdq\t%xmm2, %xmm3, %xmm4 # xmm4 = xmm3[0],xmm2[0]\n",
";}}}\n",
"; Function transpose_registers; {\n",
"; Location: SIMD.jl:1446\n",
"\tvpunpcklqdq\t%xmm0, %xmm1, %xmm5 # xmm5 = xmm1[0],xmm0[0]\n",
";}\n",
"; Function transpose_registers; {\n",
"; Location: In[6]:4\n",
"; Function shufflevector; {\n",
"; Location: SIMD.jl:1443\n",
"; Function macro expansion; {\n",
"; Location: SIMD.jl:1446\n",
"\tvpunpckhqdq\t%xmm2, %xmm3, %xmm2 # xmm2 = xmm3[1],xmm2[1]\n",
";}}}\n",
"; Function transpose_registers; {\n",
"; Location: SIMD.jl:1446\n",
"\tvpunpckhqdq\t%xmm0, %xmm1, %xmm0 # xmm0 = xmm1[1],xmm0[1]\n",
"\tmovabsq\t$140358466911312, %rax # imm = 0x7FA7C08FBC50\n",
";}\n",
"\tvmovdqa\t(%rax), %xmm1\n",
"\tvpshufb\t%xmm1, %xmm5, %xmm3\n",
"\tvpshufb\t%xmm1, %xmm4, %xmm6\n",
"\tvpunpcklbw\t%xmm3, %xmm6, %xmm3 # xmm3 = xmm6[0],xmm3[0],xmm6[1],xmm3[1],xmm6[2],xmm3[2],xmm6[3],xmm3[3],xmm6[4],xmm3[4],xmm6[5],xmm3[5],xmm6[6],xmm3[6],xmm6[7],xmm3[7]\n",
"\tmovabsq\t$140358466911328, %rax # imm = 0x7FA7C08FBC60\n",
"\tvmovdqa\t(%rax), %xmm6\n",
"\tvpshufb\t%xmm6, %xmm5, %xmm5\n",
"\tvpshufb\t%xmm6, %xmm4, %xmm4\n",
"\tvpunpcklbw\t%xmm5, %xmm4, %xmm4 # xmm4 = xmm4[0],xmm5[0],xmm4[1],xmm5[1],xmm4[2],xmm5[2],xmm4[3],xmm5[3],xmm4[4],xmm5[4],xmm4[5],xmm5[5],xmm4[6],xmm5[6],xmm4[7],xmm5[7]\n",
"\tvpunpckldq\t%xmm4, %xmm3, %xmm3 # xmm3 = xmm3[0],xmm4[0],xmm3[1],xmm4[1]\n",
"\tvpshufb\t%xmm6, %xmm0, %xmm4\n",
"\tvpshufb\t%xmm6, %xmm2, %xmm5\n",
"\tvpunpcklbw\t%xmm4, %xmm5, %xmm4 # xmm4 = xmm5[0],xmm4[0],xmm5[1],xmm4[1],xmm5[2],xmm4[2],xmm5[3],xmm4[3],xmm5[4],xmm4[4],xmm5[5],xmm4[5],xmm5[6],xmm4[6],xmm5[7],xmm4[7]\n",
"\tvpbroadcastd\t%xmm4, %xmm4\n",
"\tvpshufb\t%xmm1, %xmm0, %xmm0\n",
"\tvpshufb\t%xmm1, %xmm2, %xmm1\n",
"\tvpunpcklbw\t%xmm0, %xmm1, %xmm0 # xmm0 = xmm1[0],xmm0[0],xmm1[1],xmm0[1],xmm1[2],xmm0[2],xmm1[3],xmm0[3],xmm1[4],xmm0[4],xmm1[5],xmm0[5],xmm1[6],xmm0[6],xmm1[7],xmm0[7]\n",
"\tvpbroadcastq\t%xmm0, %xmm0\n",
"\tvpblendd\t$8, %xmm4, %xmm0, %xmm0 # xmm0 = xmm0[0,1,2],xmm4[3]\n",
"\tvpblendd\t$12, %xmm0, %xmm3, %xmm0 # xmm0 = xmm3[0,1],xmm0[2,3]\n",
"\tvmovdqu\t%xmm0, (%rdi)\n",
"\tmovq\t%rdi, %rax\n",
"\tretq\n",
"\tnopl\t(%rax)\n",
";}\n"
]
}
],
"source": [
"@code_native in_register_sorting(aa...)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Once we have a bunch of small sequences loaded in the registers we can perform our first merge. The _bitonic merge_ achieves this using alternating steps of swapping (via min/max) and shuffling the values from two registers. Each step of swap and shuffle is also commonly known as a _swuffle_. For two sequences with four values the merge is finished after three swuffles."
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [],
"source": [
"@inline function bitonic_merge_2x4(a::Vec{4, T}, b::Vec{4, T}) where T\n",
" a1 = a\n",
" b1 = shufflevector(b, Val{(3,2,1,0)})\n",
"\n",
" L1 = min(a1, b1)\n",
" H1 = max(a1, b1)\n",
"\n",
" a2 = shufflevector(L1, H1, Val{(2,3,6,7)})\n",
" b2 = shufflevector(L1, H1, Val{(0,1,4,5)})\n",
"\n",
" L2 = min(a2, b2)\n",
" H2 = max(a2, b2)\n",
"\n",
" a3 = shufflevector(L2, H2, Val{(1,5,3,7)})\n",
" b3 = shufflevector(L2, H2, Val{(0,4,2,6)})\n",
"\n",
" L3 = min(a3, b3)\n",
" H3 = max(a3, b3)\n",
"\n",
" c1 = shufflevector(L3, H3, Val{(0,4,1,5)})\n",
" c2 = shufflevector(L3, H3, Val{(2,6,3,7)})\n",
"\n",
" (c1, c2)\n",
"end;"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(<4 x Int64>[11, 12, 13, 14], <4 x Int64>[21, 22, 23, 24])"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"bitonic_merge_2x4(Vec((11,12,22,23)), Vec((13,14,21,24)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To merge four sequences we can first merge two pairs separately. Then we merge the first part of the two larger sequences, and the last part of that is an intermediate sequence that we merge with whatever second half started with the smallest value, and so on."
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"bitonic_merge_4x4 (generic function with 1 method)"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"@inline function bitonic_merge_4x4(a::Vec{4, T}, b::Vec{4, T}, c::Vec{4, T}, d::Vec{4, T}) where T\n",
" ab1, ab2 = bitonic_merge_2x4(a, b)\n",
" cd1, cd2 = bitonic_merge_2x4(c, d)\n",
"\n",
" e1, ei2 = bitonic_merge_2x4(ab1, cd1)\n",
"\n",
" if ab2[1] < cd2[1]\n",
" e2, ei3 = bitonic_merge_2x4(ei2, ab2)\n",
" e3, e4 = bitonic_merge_2x4(ei3, cd2)\n",
" else\n",
" e2, ei3 = bitonic_merge_2x4(ei2, cd2)\n",
" e3, e4 = bitonic_merge_2x4(ei3, ab2)\n",
" end\n",
" (e1, e2, e3, e4)\n",
"end"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"4-element Array{Vec{4,Int8},1}:\n",
" <4 x Int8>[-128, -108, -100, -48]\n",
" <4 x Int8>[-18, -18, 5, 5] \n",
" <4 x Int8>[19, 19, 63, 65] \n",
" <4 x Int8>[80, 95, 117, 120] "
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"[bitonic_merge_4x4(aa...)...]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can now traverse an array and sort every interval of 16 elements in-place."
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [],
"source": [
"function simd_sort_in_place(a::Array{T,1}) where T\n",
" N = size(a, 1) >> 4\n",
" for n in 1:N\n",
" aa = vload(Vec{16, T}, a, n * 16 - 15)\n",
" qq = simd_sort_16(aa)\n",
" vstore(qq, a, n*16-15)\n",
" end\n",
"end\n",
"\n",
"\n",
"function simd_sort_16(a::Vec{16, T}) where T\n",
" a1 = Vec((a[1], a[2], a[3], a[4]))\n",
" a2 = Vec((a[5], a[6], a[7], a[8]))\n",
" a3 = Vec((a[9], a[10], a[11], a[12]))\n",
" a4 = Vec((a[13], a[14], a[15], a[16]))\n",
"\n",
" bb = in_register_sorting(a1, a2, a3, a4)\n",
" c1,c2,c3,c4 = bitonic_merge_4x4(bb...)\n",
"\n",
" Vec((c1[1],c1[2],c1[3],c1[4],\n",
" c2[1],c2[2],c2[3],c2[4],\n",
" c3[1],c3[2],c3[3],c3[4],\n",
" c4[1],c4[2],c4[3],c4[4]))\n",
"end;"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"16×16 Array{Int8,2}:\n",
" -104 77 20 108 -115 13 -32 … -64 71 114 -81 67 109\n",
" 102 3 -67 121 58 49 -122 103 13 97 51 85 69\n",
" 39 -38 67 -125 28 117 73 -70 120 -44 113 -16 19\n",
" 43 -87 53 66 -98 33 64 85 -14 83 -83 -20 -77\n",
" -51 -70 -57 94 66 -26 105 0 -16 49 -108 69 119\n",
" -13 -94 -93 57 43 -125 35 … 67 -103 78 115 17 58\n",
" -108 68 -26 15 -26 82 105 74 97 -95 -66 -122 -48\n",
" -2 98 -33 57 112 -33 -29 78 -128 106 -99 48 8\n",
" -15 89 -69 26 -24 95 -79 115 -9 97 -115 6 86\n",
" 88 15 -20 -113 -102 1 -72 81 -25 -78 22 86 37\n",
" 36 -95 -28 -68 116 -65 -28 … 35 26 -83 43 -25 -2\n",
" -59 108 -94 -47 93 -55 -96 121 37 -33 77 -14 -23\n",
" 54 -66 -8 -39 -47 -5 -82 -42 123 -1 92 -2 16\n",
" 13 36 45 -7 58 -115 -54 -91 -55 -88 -33 -30 18\n",
" -75 -65 37 -37 -101 51 97 -114 -38 -42 -102 -70 -21\n",
" 7 -73 12 3 -100 -108 120 … -24 -12 11 107 -78 -101"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"16×16 Array{Int8,2}:\n",
" -108 -95 -94 -125 -115 -125 -122 … -114 -128 -95 -115 -122 -101\n",
" -104 -94 -93 -113 -102 -115 -96 -91 -103 -88 -108 -78 -77\n",
" -75 -87 -69 -68 -101 -108 -82 -70 -55 -83 -102 -70 -48\n",
" -59 -73 -67 -47 -100 -65 -79 -64 -38 -78 -99 -30 -23\n",
" -51 -70 -57 -39 -98 -55 -72 -42 -25 -44 -83 -25 -21\n",
" -15 -66 -33 -37 -47 -33 -54 … -24 -16 -42 -81 -20 -2\n",
" -13 -65 -28 -7 -26 -26 -32 0 -14 -33 -66 -16 8\n",
" -2 -38 -26 3 -24 -5 -29 35 -12 -1 -33 -14 16\n",
" 7 3 -20 15 28 1 -28 67 -9 11 22 -2 18\n",
" 13 15 -8 26 43 13 35 74 13 49 43 6 19\n",
" 36 36 12 57 58 33 64 … 78 26 78 51 17 37\n",
" 39 68 20 57 58 49 73 81 37 83 77 48 58\n",
" 43 77 37 66 66 51 97 85 71 97 92 67 69\n",
" 54 89 45 94 93 82 105 103 97 97 107 69 86\n",
" 88 98 53 108 112 95 105 115 120 106 113 85 109\n",
" 102 108 67 121 116 117 120 … 121 123 114 115 86 119"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"T = Int8\n",
"ab = rand(T, 16*16)\n",
"display(reshape(ab,16,:))\n",
"simd_sort_in_place(ab)\n",
"display(reshape(ab,16,:))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Conclusion\n",
"\n",
"One may ask:_Is it worth it?_ I don't know! I still have to finish implementing a complete algorithm, application and alternative approaches to make a meaningful test. I am only interested here in seeing if I can do this kind of low-level programming in Julia, and it has exceeded my expectations (again). I was thinking at first I might have to write some small C snippets, but the SIMD.jl package already gave me a lot of what I needed. And even if I need something else it seems easy to use intrinsics in Julia via [ccall](http://kristofferc.github.io/post/intrinsics/) or `llvmcall`.\n",
"\n",
"Next steps are to extend this to 8-element sequences using the 256 bit registers, implement a function that can \n",
"merge long streams read from memory, and also experiment with [non-temporal memory I/O](https://lwn.net/Articles/255364/).\n",
"\n",
"It should also be made clear that this isn't just about sorting arrays of numbers, but potentially structures and key-value pairs, and performing aggregation. Running whatever `sort` function the standard library provides may not be the best approach for every task and every architecture, and it may well be worth it to investigate this kind of thing in detail in a specialized application."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Julia 1.0.3",
"language": "julia",
"name": "julia-1.0"
},
"language_info": {
"file_extension": ".jl",
"mimetype": "application/julia",
"name": "julia",
"version": "1.0.3"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment