Created
April 17, 2012 17:54
-
-
Save pavanky/2407814 to your computer and use it in GitHub Desktop.
ArrayFire vs Thrust
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
I have to be frank here, this is going to be | |
- criticism of thrust | |
- Showing off ArrayFire (of which I am a core developer) | |
*Criticism of thrust* | |
They do a good job at optimizing parallel algorithms for vector inputs. | |
They use data level parallelism (among other things) to parllelize algorithms that work really well for large, vector inputs. | |
But they fail to improve upon it and go all the way to perfom true data level parallelism. i.e. a large number of small problems. | |
This second case is useful in many real world applications and ArrayFire provides solutions in this regard (look gfor[1], a parallel for loop). | |
*Plug for ArrayFire[2]* | |
What should have been a simple call to a reduction and a scan instead becomes 4 algorithms (one of which is a costly sort) and 3 memory copies. | |
Here is how the code works in ArrayFire: | |
array cell_indices(num_particles, 1, dev_particle_cell_indices, afDevicePointer); | |
array particle_counts = zeros(num_cells); | |
gfor(array i, num_cells) // Parallel for loop | |
particle_counts(i) = sum(cell_indices == i); | |
array particle_offsets = accum(particle_counts); // Inclusive sum | |
-- | |
Setup | |
I am using talonmies code[3] to benchmark against arrayfire. | |
I am using a similar graphics card (gts 360m) on Linux 64 (cuda 4.1 / gcc 4.7). | |
You can find the full benchmark code over here[4]. | |
Benchmark 1 | |
With num_particles = 2000 and num_cells = 1500 (like the original problem) | |
$ ./a.out | |
Thrust time taken: 0.002384 | |
ArrayFire time taken: 0.000131 | |
ArrayFire is 18 times faster | |
Benchmark 2 | |
With num_particles = 10000 and num_cells = 2000 (like talonmies' test case) | |
$ ./a.out | |
Thrust time taken: 0.002920 | |
ArrayFire time taken: 0.000132 | |
ArrayFire is 22 times faster | |
Benchmark 3 | |
With num_particles = 50000 and num_cells = 5000 (just a larger test case) | |
$ ./a.out | |
Thrust time taken: 0.003596 | |
ArrayFire time taken: 0.000157 | |
ArrayFire is 23 times faster | |
Notes | |
Thrust requires you to rewrite your code | |
Thrust provides a speed up of ~320x over the original code | |
ArrayFire requires little re-writing of code (change for to gfor) | |
ArrayFire is 18-23 times faster (effectively ~7300x over the original code) | |
ArrayFire scales better (run time increased by 50% for thrust, 15% for ArrayFire) | |
Conclusion | |
Thrust does indeed provide a decent speed up if you can re-write your problem. | |
But this is not always feasible and is non trivial for more complex problems. | |
The numbers indicate there is scope for much higher performance (because of the high degree of data parallelism) which is simply not being acheived in thrust. | |
ArrayFire utilizes the parallel resources in a much more effecient manner and the times indicate that the gpu is still not saturated. | |
You may want to write your own custom cuda code or use ArrayFire. | |
I just wanted to point out that sometimes using thrust is not an option because it is practically useless at large number of small problems. | |
[1]: http://www.accelereyes.com/arrayfire/c/page_gfor.htm | |
[2]: http://www.accelereyes.com/arrayfire/c/ | |
[3]: http://stackoverflow.com/a/10162898/535516 | |
[4]: https://gist.github.com/2396436 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment