neoblizz/stream_example.md Secret

## stream_example.md

      
    Raw
  

              stream_example.md
            
          
    stream as iterators

Complete code for the example below is available here: https://github.com/moderngpu/moderngpu/blob/master/src/moderngpu/kernel_segreduce.hxx This is to fascilitate the conversation of viewing streams as iterators, and I thought moderngpu had a nice example of it in contrast to how iterators are used within CUTLASS.
spmv - sparse matrix * vector.

Notice how the lambda below is generating the values later consumed by transform_segreduce.
template<typename launch_arg_t = empty_t, typename matrix_it,
  typename columns_it, typename vector_it, typename segments_it, 
  typename output_it>
void spmv(matrix_it matrix, columns_it columns, vector_it vector,
  int count, segments_it segments, int num_segments, output_it output,
  context_t& context) { 

  typedef typename std::iterator_traits<matrix_it>::value_type type_t;
  
  transform_segreduce<launch_arg_t>([=]MGPU_DEVICE(int index) {
    return matrix[index] * ldg(vector + columns[index]);    // sparse m * v.
  }, count, segments, num_segments, output, plus_t<type_t>(), 
    (type_t)0, context);
}
wrapper to segreduce.

The key element here is the make_load_iterator wrapper around the lambda function f, which returns the type_t generated values that we care about.
template<typename launch_arg_t = empty_t, typename func_t,
  typename segments_it, typename output_it, typename op_t, typename type_t>
void transform_segreduce(func_t f, int count, segments_it segments, 
  int num_segments, output_it output, op_t op, type_t init, 
  context_t& context) {

  segreduce<launch_arg_t>(make_load_iterator<type_t>(f), count, segments, 
    num_segments, output, op, init, context);
}
segmented reduction on an input iterator.

Values produced by the make_load_iterator<type_t>(f) are now accessible as an input iterator. Notice the pattern of load appearing again in the code as well, which is operated on this input iterator.
template<typename launch_arg_t = empty_t, typename input_it,
  typename segments_it, typename output_it, typename op_t, typename type_t>
void segreduce(input_it input, int count, segments_it segments, 
  int num_segments, output_it output, op_t op, type_t init, 
  context_t& context) {
  
    ...

    merge_range_t merge_range = compute_merge_range(count, num_segments, 
      cta, nt * vt, mp_data[cta], mp_data[cta + 1]);

    // Cooperatively load values from input into shared.
    mem_to_shared<nt, vt, vt0>(input + merge_range.a_begin, tid, 
      merge_range.a_count(), shared.segreduce.values);
      
    ...

  };
  
  cta_launch<launch_t>(k_reduce, num_ctas, context);

  if(num_ctas > 1)
    detail::segreduce_fixup(output, carry_out_data, codes_data, num_ctas,
      op, init, context);
}