Skip to content

Instantly share code, notes, and snippets.

@pavanky
Created April 17, 2012 17:54
Show Gist options
  • Save pavanky/2407814 to your computer and use it in GitHub Desktop.
Save pavanky/2407814 to your computer and use it in GitHub Desktop.
ArrayFire vs Thrust
I have to be frank here, this is going to be
- criticism of thrust
- Showing off ArrayFire (of which I am a core developer)
*Criticism of thrust*
They do a good job at optimizing parallel algorithms for vector inputs.
They use data level parallelism (among other things) to parllelize algorithms that work really well for large, vector inputs.
But they fail to improve upon it and go all the way to perfom true data level parallelism. i.e. a large number of small problems.
This second case is useful in many real world applications and ArrayFire provides solutions in this regard (look gfor[1], a parallel for loop).
*Plug for ArrayFire[2]*
What should have been a simple call to a reduction and a scan instead becomes 4 algorithms (one of which is a costly sort) and 3 memory copies.
Here is how the code works in ArrayFire:
array cell_indices(num_particles, 1, dev_particle_cell_indices, afDevicePointer);
array particle_counts = zeros(num_cells);
gfor(array i, num_cells) // Parallel for loop
particle_counts(i) = sum(cell_indices == i);
array particle_offsets = accum(particle_counts); // Inclusive sum
--
Setup
I am using talonmies code[3] to benchmark against arrayfire.
I am using a similar graphics card (gts 360m) on Linux 64 (cuda 4.1 / gcc 4.7).
You can find the full benchmark code over here[4].
Benchmark 1
With num_particles = 2000 and num_cells = 1500 (like the original problem)
$ ./a.out
Thrust time taken: 0.002384
ArrayFire time taken: 0.000131
ArrayFire is 18 times faster
Benchmark 2
With num_particles = 10000 and num_cells = 2000 (like talonmies' test case)
$ ./a.out
Thrust time taken: 0.002920
ArrayFire time taken: 0.000132
ArrayFire is 22 times faster
Benchmark 3
With num_particles = 50000 and num_cells = 5000 (just a larger test case)
$ ./a.out
Thrust time taken: 0.003596
ArrayFire time taken: 0.000157
ArrayFire is 23 times faster
Notes
Thrust requires you to rewrite your code
Thrust provides a speed up of ~320x over the original code
ArrayFire requires little re-writing of code (change for to gfor)
ArrayFire is 18-23 times faster (effectively ~7300x over the original code)
ArrayFire scales better (run time increased by 50% for thrust, 15% for ArrayFire)
Conclusion
Thrust does indeed provide a decent speed up if you can re-write your problem.
But this is not always feasible and is non trivial for more complex problems.
The numbers indicate there is scope for much higher performance (because of the high degree of data parallelism) which is simply not being acheived in thrust.
ArrayFire utilizes the parallel resources in a much more effecient manner and the times indicate that the gpu is still not saturated.
You may want to write your own custom cuda code or use ArrayFire.
I just wanted to point out that sometimes using thrust is not an option because it is practically useless at large number of small problems.
[1]: http://www.accelereyes.com/arrayfire/c/page_gfor.htm
[2]: http://www.accelereyes.com/arrayfire/c/
[3]: http://stackoverflow.com/a/10162898/535516
[4]: https://gist.github.com/2396436
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment