brainstorming for combined use of mdspan and sycl
Based on an example from Ronan's CppCon21 presentation slide #56.
# include <iostream>
# include <sycl/sycl.hpp>
int main () {
sycl::buffer<int> v { 10 } ;
auto run = [&] ( auto sel , auto work ) {
sycl::queue { sel }.submit ( [&] ( auto& h ) {
auto a = sycl::accessor { v , h } ;
h.parallel_for( a.size() , [=] ( auto i ) { work( i , a ); });
});
};
run( sycl::host_selector {} , [] ( auto i , auto a ) { a[i] = i; }) ; // CPU
run( sycl::accelerator_selector {} , [] ( auto i , auto a ) { a[i] = 2*a [i]; }); // FPGA
run( sycl::gpu_selector {} , [] ( auto i , auto a ) { a[i] = a[i] + 3; }); // GPU
sycl::host_accessor acc { v } ;
for ( int i = 0; i != v.size() ; ++i )
std::cout << acc[i] << " , " ;
std::cout << std::endl ;
}
The code above creates sycl queues, which run on CPU, GPU and FPGA. However, since the work inside each queue is unambiguous and can have only a single implementation, we can use mdspan
The code above is converted into a hypothetical implemenation of mdspan and sycl
# include <iostream>
# include <mdspan>
# include <sycl/sycl.hpp>
int main () {
mdspan::md<sycl::cpu, int> dcpu(10,1);
mdspan::md<sycl::fpga, int> dfpga(10,1);
mdspan::md<sycl::gpu, int> dgpu(10,1);
dcpu = mdspan::arange(1, dcpu.shape[0]) //CPU
dfpga = dcpu * 2; // fpga
dgpu = dfpga + 3; //gpu
for ( int i = 0; i < dgpu.shape[0] ; i++ )
std::cout << dgpu[i] << " , " ;
std::cout << std::endl ;
}
The code above is more consice, with clear intentions:
mdspan::md<CPU, int> dcpu(10,1);
The mdspan would take two templated parameters, one for the accelerator and another for the literal type. The array size is allocated similarly to vectors. In the case above, the d(10,1)
is a vector. For example d(10,10)
is a square matrix, and d(10,10,10)
is a higher dimension matrix.
dcpu = mdspan::arange(1, dcpu.shape[0])
This is similar to np.arange()
in numpy and Matlab's d=1:10
. under the hood, this will invoke a sycl queue and submitt the d[i]=i
. Notice that the arrange()
always means d[i]=i
and there is no ambugity about this accross languages. Other function could be defined such as mdspan::fill(n)
which is d[i]=n
.
dfpga = dcpu * 2; // fpga
The line above will automatically transfer the dcpu
system's memory to the fpga's memory and perform the work in the fpga. The final value is left on the fpga until another transfer to cpu or gpu is invoked.
dgpu = dfpga + 3; //gpu
This transfer occurs between the fpga and gpu. This could be via the system's memory as an intermediate, but that would be part of sycl's implementation, not the language itself. Once the work is completed on the gpu, the results will stay there.
std::cout << dgpu[i] << " , " ;
The std::cout
implies that the dgpu
needs to be transferred to the system's memory. Hence a gpu to cpu transfer is invoked.
Some of these operations are inspired by Nvidia's Thrust library.
The mdspan makes the code simpler and intention clear for numeric computations. Sycl implementations will move under the hood and be used by library developers to custom implement algorithms similar to the standard algorithm library. In conclusion:
- mdspan provides numeric implementation
- sycl provide algorithmic implementation
- LLVM, oneAPI, etc will provide hardware-specific implementations