Complete code for the example below is available here: https://github.com/moderngpu/moderngpu/blob/master/src/moderngpu/kernel_segreduce.hxx This is to fascilitate the conversation of viewing streams as iterators, and I thought moderngpu
had a nice example of it in contrast to how iterators are used within CUTLASS
.
Notice how the lambda below is generating the values later consumed by transform_segreduce
.
template<typename launch_arg_t = empty_t, typename matrix_it,
typename columns_it, typename vector_it, typename segments_it,
typename output_it>
void spmv(matrix_it matrix, columns_it columns, vector_it vector,
int count, segments_it segments, int num_segments, output_it output,
context_t& context) {
typedef typename std::iterator_traits<matrix_it>::value_type type_t;
transform_segreduce<launch_arg_t>([=]MGPU_DEVICE(int index) {
return matrix[index] * ldg(vector + columns[index]); // sparse m * v.
}, count, segments, num_segments, output, plus_t<type_t>(),
(type_t)0, context);
}
The key element here is the make_load_iterator
wrapper around the lambda function f
, which returns the type_t
generated values that we care about.
template<typename launch_arg_t = empty_t, typename func_t,
typename segments_it, typename output_it, typename op_t, typename type_t>
void transform_segreduce(func_t f, int count, segments_it segments,
int num_segments, output_it output, op_t op, type_t init,
context_t& context) {
segreduce<launch_arg_t>(make_load_iterator<type_t>(f), count, segments,
num_segments, output, op, init, context);
}
Values produced by the make_load_iterator<type_t>(f)
are now accessible as an input iterator. Notice the pattern of load
appearing again in the code as well, which is operated on this input iterator.
template<typename launch_arg_t = empty_t, typename input_it,
typename segments_it, typename output_it, typename op_t, typename type_t>
void segreduce(input_it input, int count, segments_it segments,
int num_segments, output_it output, op_t op, type_t init,
context_t& context) {
...
merge_range_t merge_range = compute_merge_range(count, num_segments,
cta, nt * vt, mp_data[cta], mp_data[cta + 1]);
// Cooperatively load values from input into shared.
mem_to_shared<nt, vt, vt0>(input + merge_range.a_begin, tid,
merge_range.a_count(), shared.segreduce.values);
...
};
cta_launch<launch_t>(k_reduce, num_ctas, context);
if(num_ctas > 1)
detail::segreduce_fixup(output, carry_out_data, codes_data, num_ctas,
op, init, context);
}