Instantly share code, notes, and snippets.

[TOC]

TVM如何优化GPU卷积

准备和算法

```import numpy as np
import tvm

#定义inputs和filters的维度大小
batch = 256
in_channel = 256
out_channel = 512
in_size = 14
kernel = 3
stride = 1

#算法
A = tvm.placeholder((in_size, in_size, in_channel, batch), name='A')
W = tvm.placeholder((kernel, kernel, in_channel, out_channel), name='W')
#输出featuremap size
out_size = (in_size - kernel + 2*pad)//stride + 1
#创建reduce轴
rc = tvm.reduce_axis((0, in_channel), name='rc')
ry = tvm.reduce_axis((0, kernel), name='ry')
rx = tvm.reduce_axis((0, kernel), name='rx')
#计算卷积
B = tvm.compute((out_size,out_size,out_channel,batch),lambda yy,xx,ff,nn: tvm.sum(Apad[yy*stride+ry,xx*stride+rx,rc,nn]*W[ry,rx,rc,ff],axis=[ry,rx,rc]),name='B')          ```

内存层次结构

```#指定内存层次结构
s = tvm.create_schedule(B.op)
BL = s.cache_write(B, "local")```

分块

```#平铺常量
tile = 8
step = 8
#获取GPU线程标记（范围，线程标记）

#分裂工作负载
hi, wi, fi, ni = s[B].op.axis
bz = s[B].fuse(hi, wi)
by, fi = s[B].split(fi, factor=block_factor)
bx, ni = s[B].split(ni, factor=block_factor)

#绑定迭代变量到GPU线程标识
s[B].bind(bz, block_z)
s[B].bind(by, block_y)
s[B].bind(bx, block_x)```

虚拟线程分裂

```tyz, fi = s[B].split(fi, nparts=vthread)  # virtual thread split
s[B].reorder(bz, by, bx, tyz, txz, ty, tx, fi, ni)

并发数据获取

```# Schedule BL local write
s[BL].compute_at(s[B], tx)
yi, xi, fi, ni = s[BL].op.axis
ry, rx, rc = s[BL].op.reduce_axis
rco, rci = s[BL].split(rc, factor=step)
s[BL].reorder(rco, ry, rx, rci, fi, ni)

# Attach computation to iteration variables
s[AA].compute_at(s[BL], rx)
s[WW].compute_at(s[BL], rx)
s[AL].compute_at(s[BL], rci)
s[WL].compute_at(s[BL], rci)

# Schedule for A's shared memory load
yi, xi, ci, ni = s[AA].op.axis
_, ni = s[AA].split(ni, factor=4)
s[AA].reorder(ty, tx, yi, xi, ci, ni)

# Schedule for W's shared memory load
yi, xi, ci, fi = s[WW].op.axis
_, fi = s[WW].split(fi, factor=4)
s[WW].reorder(ty, tx, yi, xi, ci, fi)

生成CUDA内核

```func = tvm.build(s, [A, W, B], 'cuda')
ctx = tvm.gpu(0)
a_np = np.random.uniform(size=(in_size, in_size, in_channel, batch)).astype(A.dtype)
w_np = np.random.uniform(size=(kernel, kernel, in_channel, out_channel)).astype(W.dtype)
a = tvm.nd.array(a_np, ctx)
w = tvm.nd.array(w_np, ctx)
b = tvm.nd.array(np.zeros((out_size, out_size, out_channel, batch), dtype=B.dtype), ctx)
func(a, w, b)
evaluator = func.time_evaluator(func.entry_name, ctx, number=1)
print('Convolution: %f ms' % (evaluator(a, w, b).mean * 1e3))```

```Convolution: 37.071140 ms #1066
Convolution: 16.331274 ms #1080TI```