Authored-by: Eric Lunderberg
Notes summarizing discussion between @Lunderberg and @csullivan on 2022_10_25
From previous conversation, possibility of representing pad/crop separately from the layout transform. This would allow algebraic proofs to be done in the simpler coordinate system, before applying the layout transform.
However, not all types of padding can be represented in this manner. As an example, suppose we want to pad a buffer such that every 8th value is padding. This could be represented in one step with a padded transform, but would require three steps when the padding is introduced in separate steps.
# With padded transforms
transform_layout(index_map = lambda i: [i//7, (i%7)%8], pad_value=0)
# With pad/crop and bijective transforms
insert_pad_crop(new_shape = [7*ceildiv(A.shape, 7)])
transform_layout(index_map = lambda i: [i//7, i%7])
insert_pad_crop(new_shape = [buf.shape[0], 8])
Any cancellation of the second pad/crop would need to be done after the layout transform. Therefore, we can't get away from performing algebraic proofs within the transformed layout.
While this is a somewhat contrived example, it could easily occur in
practice. Support a conv1d with filter size 2 uses vector operations
of size 8. The implementation uses a sliding window of size 8, which
advances by 7 elements at a time. (Assume alignment restrictions are
handled by a cache_read
.) Each application of the vector operations
would produce 8 values, the last of which is junk. If the output of
the conv1d is then matrix multiplied by a constant matrix, the above example
could be applied to the constant matrix. This would result in a pad
value (zero) at every location corresponding to a junk value, which
could be used to vectorize the matrix multiplication.
For me, the key bit is that
crop
cannot be hoisted without first proving that the compact representation is a valid representation of the previous TIR. The cancellations listed would cancel out, but writing in that format assumes that we already made it past the step of extracting the compact representation. When hoisting multiple stages from a TIR function, they must be hoisted from the outside-in, soy_crop_1
must be hoisted first. In order to extract the compact representation ofy_crop_1
, we need to propagate the known values across the compute function, the TIR stage that will later be hoisted asy_crop_0
, and the TIR stage that will later be hoisted asy_transform
. Since this requires data-flow analysis across a layout transformation stage, we haven't avoided the propagation of buffer values across a layout transform.(Chris and I came up with this example last week, which is why I didn't mention it during our earlier conversation about separating pad/crop from bijective transformations.)