cia-rana/memo.md

## memo.md

      
    Raw
  

              memo.md
            
          
    Halideのチュートリアルまとめ

Halideのチュートリアルを読んでまとめる。
1. Getting started with Funcs, Vars, and Exprs

Func、Var、Exprの定義について見ていく。

`Func` それぞれの画素がどのような値を持つか定義する。
`Var` `Func`で使う変数名を定義する。Halideでは慣例的に、`x`は列のインデックス、`y`は行のインデックスを表す。32-bitの符号付き整数。
`Expr` `Var`で定義した変数を用いて、`Func`の具体的な内容を定義する。

Func、Var、Exprはそれぞれ画像処理におけるパイプラインを定めただけで、まだ処理を行っていない。

Funcに対しrealizeを呼ぶことで、パイプライン処理に必要な諸々のコードを実行する。
2. Processing images

入力画像がどのように処理されていくのか見ていく。  
例として、画像を明るくするシングルステージパイプラインの処理を挙げる。
png画像を読み込むためのコードをロードする。
#include "halide_image_io.h"
using namespace Halide::Tools;
uint8_tで入力画像を読み込む。
Halide::Buffer<uint8_t> input = load_image("images/rgb.png");
画像を明るくするFuncを定義する。
Halide::Func brighter;
Funcで使う座標x,yおよびカラーチャンネルを表すcを定義する。
Halide::Var x, y, c;
入力画像input(x, y, c)をExprのvalueに対応付ける
Halide::Expr value = input(x, y, c);
valueの型をfloatにキャストする。
value = Halide::cast<float>(value);
valueを1.5倍する。
value = value * 1.5f;
valueを255.0以下に丸める。
value = Halide::min(value, 255.0f);
valueをuint8_tに戻す。
value = Halide::cast<uint8_t>(value);
valueをbrighter(x, y, c)に対応付ける。
brighter(x, y, c) = value;
Funcであるbrighterを実行（realize）する。
Halide::Buffer<uint8_t> output = brighter.realize(input.width(), input.height(), input.channels());
処理済みの画像を保存する。
save_image(output, "brighter.png");
3. Inspecting the generated code

Halideコンパイラの生成物をどのように調べれば良いか見ていく。
C++ではオブジェクトがオブジェクト自身の名前を覚えておくことができない。
そのため、Func,Varそれぞれのコンストラクタに文字列を渡すことによって、自身の名前を認識させる。
環境変数HL_DEBUG_CODEGENを1にしてFuncを実行することで、コンパイルの途中経過や最終的なパイプラインを擬似コードで生成する。
この環境変数の値を上げることで、より詳細な擬似コードを生成する。HL_DEBUG_CODEGENが2のとき、コンパイルのそれぞれの過程におけるHalideのコードや最終的なllvmのビットコードを生成する。  
以下のようにして、Halideコンパイラの生成物をHTMLで出力することができる。
gradient.compile_to_lowered_stmt("gradient.html", {}, HTML);
4. Debugging with tracing, print, and print_when

Halideがランタイムで行っていることをどのように追っていけば良いか見ていく。  
全ての評価結果を標準出力するように、Halideに示す。
gradient.trace_stores();
y方向に並列で処理するように、Halideに示す。
parallel_gradient.parallel(y);
printを使うことで、特定の変数の評価結果だけを標準出力することができる。
g(x, y) = sin(x) + print(cos(y));
以下のようにすれば、標準出力を編集することができる。printの第一引数だけが後の計算に利用される。
f(x, y) = sin(x) + print(cos(y), "<- this is cos(", y, ") when x =", x);
print_whenを使うことで、第一引数がtrueの時だけ標準出力するように出力を制御できる。第二引数は第一引数の真偽に関わらず後の計算に利用される。
e = print_when(x == 37 && y == 42, e, "<- this is cos(y) at x, y == (37, 42)");
以下のようにすれば、コンパイル時の「式」を出力できる。
Var fizz("fizz"), buzz("buzz");
Expr e = 1;
for (int i = 2; i < 100; i++) {
    if (i % 3 == 0 && i % 5 == 0) e += fizz*buzz;
    else if (i % 3 == 0) e += fizz;
    else if (i % 5 == 0) e += buzz;
    else e += i;
}
std::cout << "Printing a complex Expr: " << e << "\n";
5. Vectorize, parallelize, unroll and tile your code

ここではFuncがどのような順番でピクセルを評価していくのか、ベクトル化（vectorize）、並列化（parallelize）、タイル化（tile）の観点からみていく。
まずはおなじみの記述。gradientを定義して全ての評価結果を標準出力するように、Halideに促す。
Func gradient("gradient");
gradient(x, y) = x + y;
gradient.trace_stores();
評価結果の標準出力は、デフォルトではrow-majorで処理される。
printf("Evaluating gradient row-major\n");
Buffer<int> output = gradient.realize(4, 4);
/*
> Begin pipeline gradient.0()
> Store gradient.0(0, 0) = 0
> Store gradient.0(1, 0) = 1
> Store gradient.0(2, 0) = 2
> Store gradient.0(3, 0) = 3
> Store gradient.0(0, 1) = 1
> Store gradient.0(1, 1) = 2
> Store gradient.0(2, 1) = 3
> Store gradient.0(3, 1) = 4
> Store gradient.0(0, 2) = 2
> Store gradient.0(1, 2) = 3
> Store gradient.0(2, 2) = 4
> Store gradient.0(3, 2) = 5
> Store gradient.0(0, 3) = 3
> Store gradient.0(1, 3) = 4
> Store gradient.0(2, 3) = 5
> Store gradient.0(3, 3) = 6
> End pipeline gradient.0()
*/
Halideがどのようなネスト構造のコードを生成するのかも見ることができる。
gradient.print_loop_nest();
/*
> compute gradient:
>   for y:
>     for x:
>       gradient(...) = ...
*/
Func::reoder

ネストされた処理の順序を変えることもできる。Funcの関数reoderは2つの引数をとり、第一引数が内側のループ、第二引数が外側のループとすることができる。
gradient.reorder(y, x);
この関数により、column-majorで処理することができる。
gradient.print_loop_nest();
/*
> compute gradient_col_major:
>   for x:
>     for y:
>       gradient_col_major(...) = ...
*/
Func::split

Funcのsplit関数によりVarを複数のVarに分割することができる。以下はxを2つのネストしたループ、すなわちアウターループとインナーループに分ける例である。

インナーループのインデックスは0からsplitの(第三引数-1)までであり、これにより元のループはouter * factor + innerと定義することができる。この分割によって処理の順番が変わることはない。
Var x_outer, x_inner;
gradient.split(x, x_outer, x_inner, 2);

gradient.print_loop_nest();
/*
> compute gradient_split:
>   for y:
>     for x.x_outer:
>       for x.x_inner in [0, 1]:
>         gradient_split(...) = ...
*/
Func::fuse

逆に2つの変数を1つに合わせることもできる。
Var fused;
gradient.fuse(x, y, fused);
gradient.print_loop_nest();
/*
> compute gradient_fused:
>   for x.fused:
>     gradient_fused(...) = ...
*/
Func::split + Func::reoder => Func::tile

Funcのsplitとreoderを組合わせることでtileを実現することも可能である。以下は8x8の画像を4x4の4つのタイルに分割して処理する例である。
Var x_outer, x_inner, y_outer, y_inner;
gradient.split(x, x_outer, x_inner, 4);
gradient.split(y, y_outer, y_inner, 4);
gradient.reorder(x_inner, y_inner, x_outer, y_outer);

Buffer<int> output = gradient.realize(8, 8);

gradient.print_loop_nest();

/*
> compute gradient_tiled:
>   for y.y_outer:
>     for x.x_outer:
>       for y.y_inner in [0, 3]:
>         for x.x_inner in [0, 3]:
>           gradient_tiled(...) = ...
*/
split,split,reorderとしてtileを構成する代わりに、Func::tileと短縮できる。
gradient.tile(x, y, x_outer, y_outer, x_inner, y_inner, 4, 4);
Func::tileによって分割されたループの順番は内側から第五引数、第六引数、第三引数、第四引数となる。
Func::vectorize

次はVarをベクトル化する例である。ベクトル化された変数は同時に処理される。
Var x_outer, x_inner;
gradient.split(x, x_outer, x_inner, 4);
gradient.vectorize(x_inner);

Buffer<int> output = gradient.realize(8, 4);
/*
> Begin pipeline gradient_in_vectors.0()
> Store gradient_in_vectors.0(<0, 1, 2, 3>, <0, 0, 0, 0>) = <0, 1, 2, 3>
> Store gradient_in_vectors.0(<4, 5, 6, 7>, <0, 0, 0, 0>) = <4, 5, 6, 7>
> Store gradient_in_vectors.0(<0, 1, 2, 3>, <1, 1, 1, 1>) = <1, 2, 3, 4>
> Store gradient_in_vectors.0(<4, 5, 6, 7>, <1, 1, 1, 1>) = <5, 6, 7, 8>
> Store gradient_in_vectors.0(<0, 1, 2, 3>, <2, 2, 2, 2>) = <2, 3, 4, 5>
> Store gradient_in_vectors.0(<4, 5, 6, 7>, <2, 2, 2, 2>) = <6, 7, 8, 9>
> Store gradient_in_vectors.0(<0, 1, 2, 3>, <3, 3, 3, 3>) = <3, 4, 5, 6>
> Store gradient_in_vectors.0(<4, 5, 6, 7>, <3, 3, 3, 3>) = <7, 8, 9, 10>
> End pipeline gradient_in_vectors.0()
*/

gradient.print_loop_nest();
> compute gradient_in_vectors:
>   for y:
>     for x.x_outer:
>       vectorized x.x_inner in [0, 3]:
>         gradient_in_vectors(...) = ...
Func::unroll

Funcのunrollによってループを展開することができる。
Var x_outer, x_inner;
gradient.split(x, x_outer, x_inner, 2);
gradient.unroll(x_inner);

/*
> compute gradient_unroll:
>   for y:
>     for x.x_outer:
>       unrolled x.x_inner in [0, 1]:
>         gradient_unroll(...) = ...
*/
Func::split（画像(のrow or col)が第三引数で割れないとき）

Funcのsplit関数が画像の行数または列数で割り切れないとき、もとのループは次のように定義される。
x = min(x_outer * factor, x_extent - factor) + x_inner

例えば、7x2の画像に対し、xをそのインナーループ数が3となるように分割した場合の擬似コードは次の通りである。
Var x_outer, x_inner;
gradient.split(x, x_outer, x_inner, 3);

printf("Evaluating gradient over a 7x2 box with x split by three \n");
Buffer<int> output = gradient.realize(7, 2);

printf("Equivalent C:\n");
for (int y = 0; y < 2; y++) {
    for (int x_outer = 0; x_outer < (7 + 3 - 1) / 3 / ; x_outer++) { // Now runs from 0 to 2
        for (int x_inner = 0; x_inner < 3; x_inner++) {
            // Before we add x_inner, make sure we don't
            // evaluate points outside of the 7x2 box. We'll
            // clamp x to be at most 4 (7 minus the split
            // factor).
            int x = min(x_outer * 3, 7 - 3) + x_inner
            
            printf("Evaluating at x = %d, y = %d: %d\n", x, y, x + y);
        }
    }
}
まとめ

このセクションでは、Func::reoder,Func::split,Func::fuse,Func::tile,Func::vectorize,Func::unrollを組合わせることによりベクトル化（vectorize）、並列化（parallelize）、タイル化（tile）が可能であることを示した。
6. Realizing funcs over arbitrary domains

Buffer<int> shifted(5, 7); // In the constructor we tell it the size.
shifted.set_min(100, 50);
とすることで、画像の原点の位置を変更できる。
7. Multi-stage pipelines

Funcの中でFuncを使うことも可能。
8. Scheduling multi-stage pipelines

ここでは、Funcのスケージュールの仕方によって処理のパフォーマンスに差が出ることを示す。
まずはプレーンの処理が次の通り。
Var x("x"), y("y");
Func producer("producer_default"), consumer("consumer_default");

producer(x, y) = sin(x * y);
consumer(x, y) = (producer(x, y) +
                  producer(x, y+1) +
                  producer(x+1, y) +
                  producer(x+1, y+1))/4;

consumer.realize(4, 4);
consumerの各座標を計算するたびにproducerを呼んでいるため、sinのコールは4x4x4=64回、メモリストアは4x4=16回。一方で一時的に値を退避させていないので、テンポラリメモリアロケーション、メモリロードは共に0回である。
次はFunc::compute_rootを使った例で、producer.compute_root()とすることで、producerの計算が全て完了した後に他の処理を行う。
Func producer("producer_default"), consumer("consumer_default");

producer(x, y) = sin(x * y);
consumer(x, y) = (producer(x, y) +
                  producer(x, y+1) +
                  producer(x+1, y) +
                  producer(x+1, y+1))/4;

producer.compute_root();

consumer.realize(4, 4);
Func::compute_rootにより、事前にproducerの計算を行うのでsinのコールは矩形の端1ピクセルを考慮して5x5=25回、テンポラリメモリアロケーションも25回。メモリロードは4x4x4=64回で、メモリストアは39回？。
最初のスケジュールはテンポラリメモリアロケーションを行わない点では優れているが、consumerの計算の度にsinを計算している。一方2番目のスケジューリングはsinの計算を最小限に抑えることが出きるが、テンポラリメモリアロケーションが必要である。このようにどのスケジューリングにもメモリアロケーションや計算回数などのトレードオフがあるので、処理ごとにスケジューリングを考えることが重要である。
producerのy'に関する計算が全て終わったときに初めて、y'を必要とするconsumerの計算が行われる。
producer.compute_at(consumer, y);
/*
> compute consumer_y:
>   for y:
>     compute producer_y:
>       for y:
>         for x:
>           producer_y(...) = ...
>     for x:
>       consumer_y(...) = ...
*/
producerの計算結果を全てストアする。
producer.store_root();
producer.compute_at(consumer, y);
/*
> store producer_root_y:
>   compute consumer_root_y:
>     for y:
>       compute producer_root_y:
>         for y:
>           for x:
>             producer_root_y(...) = ...
>       for x:
>         consumer_root_y(...) = ...
*/
producerの計算結果を全てストアする。また、producerのx'に関する計算が全て終わったときに初めて、x'を必要とするconsumerの計算が行われる。
producer.store_root().compute_at(consumer, x);
/*
> store producer_root_x:
>   compute consumer_root_x:
>     for y:
>       for x:
>         compute producer_root_x:
>           for y:
>             for x:
>               producer_root_x(...) = ...
>         consumer_root_x(...) = ...
*/
9. Multi-pass Funcs, undate definitions, and reductions

10.1. AOT compilation part 1

HalideはデフォルトではJIT（just-in-time）コンパイルをするが、Func::realizeの代わりにFunc::compile_to_static_libraryを使うことで、AOT（ahead-of-time）コンパイルを行うことができる。

Func::compile_to_static_libraryの第一引数はAOTコンパイルで出力する各種ファイル名に使われ、第二引数は実行時に必要となるパラメータ（型はParamやImageParamがよく使われる）、第三引数は外部に公開する関数名である。
参考


JITとかAOTとか.NET Nativeとか

10.2. AOT compilation part 2

AOTコンパイルによって公開された関数の最後の引数は関数戻り値になる。
brighter.compile_to_static_library("lesson_10_halide", {input, offset}, "brighter");
// ↓
// int brighter(halide_buffer_t *_input_buffer, uint8_t _offset, halide_buffer_t *_brighter_buffer);
int error = brighter(input, offset, output);
11. Cross-compilation

12. Using the GPU

13. Tuples

14. The Halide type system

Halideで特定の型の変数を使用したい場合は以下のように初期化する。
Var x;
Expr u8 = cast<uint8_t>(x);
Expr u16 = cast<uint16_t>(x);
Expr u32 = cast<uint32_t>(x);
Expr u64 = cast<uint64_t>(x);
Expr s8 = cast<int8_t>(x);
Expr s16 = cast<int16_t>(x);
Expr s32 = cast<int32_t>(x);
Expr s64 = cast<int64_t>(x);
Expr f32 = cast<float>(x);
Expr f64 = cast<double>(x);
また、使用可能な変数の型は以下の通り。
Type valid_halide_types[] = {
    UInt(8), UInt(16), UInt(32), UInt(64),
    Int(8), Int(16), Int(32), Int(64),
    Float(32), Float(64),
    Handle()
};
Handle()は不明瞭なTypeでvoid *に相当し、ビット幅は64bitである。
15.1. Generators part 1

15.2. Generators part 2

16.1. RGB images and memory layouts part 1

16.2. RGB images and memory layouts part 2

17. Reductions over non-rectangular domains

18. Factoring an associative reduction using rfactor

19. Wrapper Funcs

20. Cloning Funcs

21. Auto-Scheduler(generate)

21. Auto-Scheduler(run)
`Func`	それぞれの画素がどのような値を持つか定義する。
`Var`	`Func`で使う変数名を定義する。Halideでは慣例的に、`x`は列のインデックス、`y`は行のインデックスを表す。32-bitの符号付き整数。
`Expr`	`Var`で定義した変数を用いて、`Func`の具体的な内容を定義する。