When making an algorithm I at some point wanted to measure just how better it is than the serial method. With SIMD you can process multiple elements of an array in parallel at the granularity of the size of the vector registers. So if I have an algorithm that can process 16
,8
,4
,2
,1
elements in parallel, what is the most optimal way I could fire off each algorithm to process an array? Since this is basically a greedy algorithm, divide N
(the size of the array) by 16
, and then divide whats left-over by 8
, and divide whats left-over by 4
, and so on until you've touched every part of the array. So you'd fire off the 16
algorithm one as much as you can, before having to resort to the smaller ones, and eventually reaching 1
where you're processing one element at a time.
Say you have an array of 91
elements, and can onl