danchitnis/mdspan-sycl.md

## mdspan-sycl.md

      
    Raw
  

              mdspan-sycl.md
            
          
    brainstorming for combined use of mdspan and sycl

Based on an example from Ronan's CppCon21 presentation slide #56.
# include <iostream>
# include <sycl/sycl.hpp>
int main () {
  sycl::buffer<int> v { 10 } ;
  
  auto run = [&] ( auto sel , auto work ) {
    sycl::queue { sel }.submit ( [&] ( auto& h ) {
      auto a = sycl::accessor { v , h } ;
      h.parallel_for( a.size() , [=] ( auto i ) { work( i , a ); });
    });
  };
  run( sycl::host_selector {} , [] ( auto i , auto a ) { a[i] = i; }) ; // CPU
  run( sycl::accelerator_selector {} , [] ( auto i , auto a ) { a[i] = 2*a [i]; }); // FPGA
  run( sycl::gpu_selector {} , [] ( auto i , auto a ) { a[i] = a[i] + 3; }); // GPU
  
  sycl::host_accessor acc { v } ;
  for ( int i = 0; i != v.size() ; ++i )
    std::cout << acc[i] << " , " ;
  std::cout << std::endl ;
}
The code above creates sycl queues, which run on CPU, GPU and FPGA. However, since the work inside each queue is unambiguous and can have only a single implementation, we can use mdspan
conversion using mdspan

The code above is converted into a hypothetical implemenation of mdspan and sycl
# include <iostream>
# include <mdspan>
# include <sycl/sycl.hpp>

int main () {
  mdspan::md<sycl::cpu, int> dcpu(10,1);
  mdspan::md<sycl::fpga, int> dfpga(10,1);
  mdspan::md<sycl::gpu, int> dgpu(10,1);
  
  dcpu = mdspan::arange(1, dcpu.shape[0])  //CPU
  dfpga = dcpu * 2; // fpga
  dgpu = dfpga + 3; //gpu
  
  for ( int i = 0; i < dgpu.shape[0] ; i++ )
    std::cout << dgpu[i] << " , " ;
  std::cout << std::endl ;
}
The code above is more consice, with clear intentions:
mdspan::md<CPU, int> dcpu(10,1);
The mdspan would take two templated parameters, one for the accelerator and another for the literal type. The array size is allocated similarly to vectors. In the case above, the d(10,1) is a vector. For example d(10,10) is a square matrix, and d(10,10,10) is a higher dimension matrix.
dcpu = mdspan::arange(1, dcpu.shape[0])
This is similar to np.arange() in numpy and Matlab's d=1:10. under the hood, this will invoke a sycl queue and submitt the d[i]=i. Notice that the arrange() always means d[i]=i and there is no ambugity about this accross languages. Other function could be defined such as mdspan::fill(n) which is d[i]=n.
dfpga = dcpu * 2; // fpga
The line above will automatically transfer the dcpu system's memory to the fpga's memory and perform the work in the fpga. The final value is left on the fpga until another transfer to cpu or gpu is invoked.
dgpu = dfpga + 3; //gpu
This transfer occurs between the fpga and gpu. This could be via the system's memory as an intermediate, but that would be part of sycl's implementation, not the language itself. Once the work is completed on the gpu, the results will stay there.
std::cout << dgpu[i] << " , " ;
The std::cout implies that the dgpu needs to be transferred to the system's memory. Hence a gpu to cpu transfer is invoked.
Some of these operations are inspired by Nvidia's Thrust library.
Conclusion and Remarks

The mdspan makes the code simpler and intention clear for numeric computations. Sycl implementations will move under the hood and be used by library developers to custom implement algorithms similar to the standard algorithm library. In conclusion:

mdspan provides numeric implementation
sycl provide algorithmic implementation
LLVM, oneAPI, etc will provide hardware-specific implementations