Skip to content

Instantly share code, notes, and snippets.

@ohno
Last active March 26, 2022 20:34
Show Gist options
  • Save ohno/86f1ba4c9a237b17a10cb88904e283e6 to your computer and use it in GitHub Desktop.
Save ohno/86f1ba4c9a237b17a10cb88904e283e6 to your computer and use it in GitHub Desktop.
BenchmarkTools.jl Tutorial
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Juliaで速いコードを書く方法(キャッシュチューニング)\n",
"\n",
"[Juliaで速いコードを書くコツはいくつかあります](https://myenigma.hatenablog.com/entry/2017/08/22/093953)が, 今回はキャッシュチューニングについて解説していきます. BenchmarkTools.jlを用いて, JuliaでもFortranと同じ方法でのキャッシュチューニングが有効か検討しました."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 動作環境"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Julia Version 1.7.1\n",
"Commit ac5cc99908 (2021-12-22 19:35 UTC)\n",
"Platform Info:\n",
" OS: Windows (x86_64-w64-mingw32)\n",
" CPU: Intel(R) Core(TM) i7-4650U CPU @ 1.70GHz\n",
" WORD_SIZE: 64\n",
" LIBM: libopenlibm\n",
" LLVM: libLLVM-12.0.1 (ORCJIT, haswell)\n"
]
}
],
"source": [
"versioninfo()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## パッケージ"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"# using Pkg\n",
"# Pkg.add(\"BenchmarkTools\")\n",
"using BenchmarkTools"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## キャッシュチューニング\n",
"\n",
"雑に説明すると, コンピュータはメモリ⇄キャッシュ⇄CPU(レジスタ)へとデータを受け渡します. メモリ⇄キャッシュ間の通信は遅いので, できるだけキャッシュ⇄CPU(レジスタ)間の通信だけで済むようにプログラムを工夫することをキャッシュチューニングといいます.\n",
"\n",
"[青山幸也『チューニング技法虎の巻』](https://www.hpci-office.jp/pages/seminar_texts) p.4-4 より\n",
"\n",
">【重要】多次元配列(多重ループ)の場合、キャッシュミスを少なくするためには、<br>\n",
"    Fortranでは、図4-2-3(1)に示すように、内側のループを配列の左側の添字で反復させること。\n",
" \n",
" \n",
"```fortran:図4-2-3(1)\n",
"DIMENSION A(4,9)\n",
"DO J=1,9\n",
" DO I=1,4\n",
" A(I,J) = A(I,J) + 1.0\n",
" ENDDO\n",
"ENDDO\n",
"```\n",
"\n",
"```fortran:図4-2-4(1)\n",
"DIMENSION A(4,9)\n",
"DO I=1,4\n",
" DO J=1,9\n",
" A(I,J) = A(I,J) + 1.0\n",
" ENDDO\n",
"ENDDO\n",
"```\n",
"\n",
"\n",
"メモリ上の多次元配列の格納方向は, CとFortranでは逆, JuliaとFortranでは同じであるため, JuliaでもFortranと同じ方法でのキャッシュチューニングが有効であろうと予想できます."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## ベンチマーク\n",
"\n",
"2次元配列の総和の計算します. 比較する関数は以下の`bad()`と`good()`に加えて標準ライブラリの`sum()`の3つです."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"bad (generic function with 1 method)"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# 遅い\n",
"function bad(arr)\n",
" sum = 0\n",
" for i in 1:size(arr)[1]\n",
" for j in 1:size(arr)[2]\n",
" sum += arr[i,j]\n",
" end\n",
" end\n",
" return sum\n",
"end"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"good (generic function with 1 method)"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# 速い\n",
"function good(arr)\n",
" sum = 0\n",
" for j in 1:size(arr)[2]\n",
" for i in 1:size(arr)[1]\n",
" sum += arr[i,j]\n",
" end\n",
" end\n",
" return sum\n",
"end"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"次の配列`A`の総和を計算する. 配列`A`は倍精度浮動小数点数(Float64)の1000×1000配列です."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"1000×1000 Matrix{Float64}:\n",
" 0.730536 0.12352 0.219293 … 0.0675552 0.296471 0.280385\n",
" 0.567086 0.381602 0.296314 0.690441 0.14324 0.40432\n",
" 0.984433 0.536352 0.655172 0.178301 0.460183 0.184478\n",
" 0.920644 0.896053 0.236717 0.868032 0.371098 0.577329\n",
" 0.186625 0.948587 0.615851 0.813894 0.256957 0.567861\n",
" 0.689388 0.450486 0.526836 … 0.0893588 0.835153 0.668836\n",
" 0.960888 0.77027 0.0195096 0.63292 0.780756 0.727685\n",
" 0.455635 0.582899 0.781037 0.536568 0.364438 0.398084\n",
" 0.929348 0.0185914 0.989302 0.270407 0.397853 0.0639801\n",
" 0.0719316 0.0236982 0.17726 0.888011 0.491766 0.555778\n",
" 0.662934 0.276135 0.894759 … 0.0072779 0.998696 0.431419\n",
" 0.663759 0.805376 0.504657 0.061532 0.295655 0.888306\n",
" 0.401665 0.528259 0.667585 0.150353 0.97375 0.563587\n",
" ⋮ ⋱ \n",
" 0.525315 0.613083 0.695356 0.223117 0.332436 0.382263\n",
" 0.0251914 0.692443 0.23277 0.620846 0.97116 0.187392\n",
" 0.946732 0.612567 0.858092 … 0.762704 0.557224 0.811968\n",
" 0.558617 0.669468 0.627795 0.694151 0.549284 0.1481\n",
" 0.115266 0.205627 0.546792 0.0178188 0.400527 0.954316\n",
" 0.320129 0.196223 0.317189 0.317803 0.487448 0.752907\n",
" 0.176393 0.576737 0.814849 0.319391 0.400982 0.677275\n",
" 0.741602 0.536035 0.695513 … 0.839493 0.106422 0.6837\n",
" 0.81777 0.0515903 0.531702 0.521287 0.0290636 0.224369\n",
" 0.308238 0.327846 0.548927 0.784832 0.210909 0.250473\n",
" 0.923223 0.918547 0.917423 0.103074 0.552284 0.930064\n",
" 0.863915 0.407121 0.794925 0.152373 0.323071 0.455128"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"A = rand(1000,1000)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"各要素は$[0,1)$の一様分布に従う疑似乱数であるので, 総和は概ね500000になるはずです."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"bad(A) = 500432.86382117023\n",
"good(A) = 500432.8638211867\n",
"sum(A) = 500432.86382120027\n"
]
}
],
"source": [
"println(\"bad(A) = \", bad(A))\n",
"println(\"good(A) = \", good(A))\n",
"println(\"sum(A) = \", sum(A))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"浮動小数点の計算であるため, 和の順序によって僅かに結果が異なりますが, 概ね500000になっています."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 結果\n",
"\n",
"ベンチマークテストの結果は以下の通りです."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"BenchmarkTools.Trial: 1479 samples with 1 evaluation.\n",
" Range \u001b[90m(\u001b[39m\u001b[36m\u001b[1mmin\u001b[22m\u001b[39m … \u001b[35mmax\u001b[39m\u001b[90m): \u001b[39m\u001b[36m\u001b[1m3.153 ms\u001b[22m\u001b[39m … \u001b[35m 11.431 ms\u001b[39m \u001b[90m┊\u001b[39m GC \u001b[90m(\u001b[39mmin … max\u001b[90m): \u001b[39m0.00% … 0.00%\n",
" Time \u001b[90m(\u001b[39m\u001b[34m\u001b[1mmedian\u001b[22m\u001b[39m\u001b[90m): \u001b[39m\u001b[34m\u001b[1m3.327 ms \u001b[22m\u001b[39m\u001b[90m┊\u001b[39m GC \u001b[90m(\u001b[39mmedian\u001b[90m): \u001b[39m0.00%\n",
" Time \u001b[90m(\u001b[39m\u001b[32m\u001b[1mmean\u001b[22m\u001b[39m ± \u001b[32mσ\u001b[39m\u001b[90m): \u001b[39m\u001b[32m\u001b[1m3.363 ms\u001b[22m\u001b[39m ± \u001b[32m419.905 μs\u001b[39m \u001b[90m┊\u001b[39m GC \u001b[90m(\u001b[39mmean ± σ\u001b[90m): \u001b[39m0.00% ± 0.00%\n",
"\n",
" \u001b[39m▆\u001b[39m▇\u001b[39m \u001b[39m \u001b[39m▃\u001b[39m▅\u001b[39m▅\u001b[39m▆\u001b[34m█\u001b[39m\u001b[39m▆\u001b[32m▇\u001b[39m\u001b[39m▃\u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \n",
" \u001b[39m█\u001b[39m█\u001b[39m█\u001b[39m█\u001b[39m█\u001b[39m█\u001b[39m█\u001b[39m█\u001b[34m█\u001b[39m\u001b[39m█\u001b[32m█\u001b[39m\u001b[39m█\u001b[39m█\u001b[39m█\u001b[39m▇\u001b[39m▅\u001b[39m▅\u001b[39m▃\u001b[39m▃\u001b[39m▃\u001b[39m▃\u001b[39m▂\u001b[39m▂\u001b[39m▂\u001b[39m▂\u001b[39m▂\u001b[39m▂\u001b[39m▂\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▂\u001b[39m▂\u001b[39m▁\u001b[39m▂\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▂\u001b[39m▂\u001b[39m▁\u001b[39m▂\u001b[39m▁\u001b[39m▁\u001b[39m▂\u001b[39m▂\u001b[39m▁\u001b[39m▁\u001b[39m▁\u001b[39m▂\u001b[39m▁\u001b[39m▂\u001b[39m▁\u001b[39m▁\u001b[39m▂\u001b[39m▂\u001b[39m \u001b[39m▃\n",
" 3.15 ms\u001b[90m Histogram: frequency by time\u001b[39m 4.44 ms \u001b[0m\u001b[1m<\u001b[22m\n",
"\n",
" Memory estimate\u001b[90m: \u001b[39m\u001b[33m16 bytes\u001b[39m, allocs estimate\u001b[90m: \u001b[39m\u001b[33m1\u001b[39m."
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"@benchmark bad(A)"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"BenchmarkTools.Trial: 4180 samples with 1 evaluation.\n",
" Range \u001b[90m(\u001b[39m\u001b[36m\u001b[1mmin\u001b[22m\u001b[39m … \u001b[35mmax\u001b[39m\u001b[90m): \u001b[39m\u001b[36m\u001b[1m979.800 μs\u001b[22m\u001b[39m … \u001b[35m 11.564 ms\u001b[39m \u001b[90m┊\u001b[39m GC \u001b[90m(\u001b[39mmin … max\u001b[90m): \u001b[39m0.00% … 0.00%\n",
" Time \u001b[90m(\u001b[39m\u001b[34m\u001b[1mmedian\u001b[22m\u001b[39m\u001b[90m): \u001b[39m\u001b[34m\u001b[1m 1.151 ms \u001b[22m\u001b[39m\u001b[90m┊\u001b[39m GC \u001b[90m(\u001b[39mmedian\u001b[90m): \u001b[39m0.00%\n",
" Time \u001b[90m(\u001b[39m\u001b[32m\u001b[1mmean\u001b[22m\u001b[39m ± \u001b[32mσ\u001b[39m\u001b[90m): \u001b[39m\u001b[32m\u001b[1m 1.179 ms\u001b[22m\u001b[39m ± \u001b[32m307.241 μs\u001b[39m \u001b[90m┊\u001b[39m GC \u001b[90m(\u001b[39mmean ± σ\u001b[90m): \u001b[39m0.00% ± 0.00%\n",
"\n",
" \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m▄\u001b[39m \u001b[39m█\u001b[39m \u001b[39m▇\u001b[34m \u001b[39m\u001b[32m \u001b[39m\u001b[39m▄\u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \n",
" \u001b[39m▃\u001b[39m▂\u001b[39m▄\u001b[39m▃\u001b[39m█\u001b[39m▃\u001b[39m█\u001b[39m▄\u001b[39m█\u001b[39m▄\u001b[39m█\u001b[34m▅\u001b[39m\u001b[32m▆\u001b[39m\u001b[39m█\u001b[39m▅\u001b[39m▆\u001b[39m▅\u001b[39m▄\u001b[39m▄\u001b[39m▃\u001b[39m▃\u001b[39m▃\u001b[39m▃\u001b[39m▂\u001b[39m▂\u001b[39m▂\u001b[39m▂\u001b[39m▂\u001b[39m▂\u001b[39m▂\u001b[39m▂\u001b[39m▂\u001b[39m▂\u001b[39m▂\u001b[39m▁\u001b[39m▂\u001b[39m▂\u001b[39m▂\u001b[39m▂\u001b[39m▁\u001b[39m▂\u001b[39m▂\u001b[39m▂\u001b[39m▂\u001b[39m▂\u001b[39m▁\u001b[39m▁\u001b[39m▂\u001b[39m▂\u001b[39m▁\u001b[39m▁\u001b[39m▂\u001b[39m▁\u001b[39m▂\u001b[39m▂\u001b[39m▁\u001b[39m▂\u001b[39m▂\u001b[39m▂\u001b[39m▁\u001b[39m▂\u001b[39m \u001b[39m▃\n",
" 980 μs\u001b[90m Histogram: frequency by time\u001b[39m 1.96 ms \u001b[0m\u001b[1m<\u001b[22m\n",
"\n",
" Memory estimate\u001b[90m: \u001b[39m\u001b[33m16 bytes\u001b[39m, allocs estimate\u001b[90m: \u001b[39m\u001b[33m1\u001b[39m."
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"@benchmark good(A)"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"BenchmarkTools.Trial: 9209 samples with 1 evaluation.\n",
" Range \u001b[90m(\u001b[39m\u001b[36m\u001b[1mmin\u001b[22m\u001b[39m … \u001b[35mmax\u001b[39m\u001b[90m): \u001b[39m\u001b[36m\u001b[1m478.700 μs\u001b[22m\u001b[39m … \u001b[35m 5.138 ms\u001b[39m \u001b[90m┊\u001b[39m GC \u001b[90m(\u001b[39mmin … max\u001b[90m): \u001b[39m0.00% … 0.00%\n",
" Time \u001b[90m(\u001b[39m\u001b[34m\u001b[1mmedian\u001b[22m\u001b[39m\u001b[90m): \u001b[39m\u001b[34m\u001b[1m509.500 μs \u001b[22m\u001b[39m\u001b[90m┊\u001b[39m GC \u001b[90m(\u001b[39mmedian\u001b[90m): \u001b[39m0.00%\n",
" Time \u001b[90m(\u001b[39m\u001b[32m\u001b[1mmean\u001b[22m\u001b[39m ± \u001b[32mσ\u001b[39m\u001b[90m): \u001b[39m\u001b[32m\u001b[1m525.046 μs\u001b[22m\u001b[39m ± \u001b[32m104.547 μs\u001b[39m \u001b[90m┊\u001b[39m GC \u001b[90m(\u001b[39mmean ± σ\u001b[90m): \u001b[39m0.00% ± 0.00%\n",
"\n",
" \u001b[39m \u001b[39m▄\u001b[39m█\u001b[39m▄\u001b[39m█\u001b[39m▆\u001b[34m▅\u001b[39m\u001b[39m▄\u001b[39m▂\u001b[32m▁\u001b[39m\u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \u001b[39m \n",
" \u001b[39m▂\u001b[39m█\u001b[39m█\u001b[39m█\u001b[39m█\u001b[39m█\u001b[34m█\u001b[39m\u001b[39m█\u001b[39m█\u001b[32m█\u001b[39m\u001b[39m▇\u001b[39m▅\u001b[39m▆\u001b[39m▄\u001b[39m▄\u001b[39m▄\u001b[39m▃\u001b[39m▃\u001b[39m▃\u001b[39m▃\u001b[39m▃\u001b[39m▃\u001b[39m▂\u001b[39m▂\u001b[39m▃\u001b[39m▂\u001b[39m▂\u001b[39m▂\u001b[39m▂\u001b[39m▂\u001b[39m▂\u001b[39m▂\u001b[39m▂\u001b[39m▂\u001b[39m▂\u001b[39m▂\u001b[39m▂\u001b[39m▂\u001b[39m▂\u001b[39m▂\u001b[39m▂\u001b[39m▂\u001b[39m▂\u001b[39m▂\u001b[39m▂\u001b[39m▂\u001b[39m▂\u001b[39m▂\u001b[39m▂\u001b[39m▂\u001b[39m▁\u001b[39m▂\u001b[39m▂\u001b[39m▂\u001b[39m▂\u001b[39m▂\u001b[39m▂\u001b[39m▂\u001b[39m▁\u001b[39m▂\u001b[39m▂\u001b[39m \u001b[39m▃\n",
" 479 μs\u001b[90m Histogram: frequency by time\u001b[39m 783 μs \u001b[0m\u001b[1m<\u001b[22m\n",
"\n",
" Memory estimate\u001b[90m: \u001b[39m\u001b[33m16 bytes\u001b[39m, allocs estimate\u001b[90m: \u001b[39m\u001b[33m1\u001b[39m."
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"@benchmark sum(A)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"表にまとめると以下のような計算時間でした.\n",
"\n",
"|関数|Time(mean)|\n",
"|:---|:---|\n",
"|`bad()`|3327µs|\n",
"|`good()`|1151µs|\n",
"|`sum()`|509µs|"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## まとめ\n",
"\n",
"JuliaでもFortranと同じ方法でキャッシュチューニングが可能であることがわかりました. しかし, `sum(A)`が圧倒的に速かったので, できるだけ標準ライブラリの関数を利用することをお勧めします."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 参考文献\n",
"\n",
"- [青山幸也『チューニング技法虎の巻』](https://www.hpci-office.jp/pages/seminar_texts) p.4-4\n",
"- https://julialang.org/blog/2013/09/fast-numeric/#write_cache-friendly_codes\n",
"\n",
"この記事の元になったJupyter Notebookのデータは下記のリンクにある.\n",
"\n",
"https://gist.github.com/ohno/86f1ba4c9a237b17a10cb88904e283e6"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"© 2021 Shuhei Ohno\n",
"<br>Source: https://gist.github.com/ohno\n",
"<br>License: https://opensource.org/licenses/MIT"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Julia 1.7.1",
"language": "julia",
"name": "julia-1.7"
},
"language_info": {
"file_extension": ".jl",
"mimetype": "application/julia",
"name": "julia",
"version": "1.7.1"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment