pavanky/sotext

## sotext
I have to be frank here, this is going to be

    - criticism of thrust
    - Showing off ArrayFire (of which I am a core developer)

*Criticism of thrust*

They do a good job at optimizing parallel algorithms for vector inputs.
They use data level parallelism (among other things) to parllelize algorithms that work really well for large, vector inputs.
But they fail to improve upon it and go all the way to perfom true data level parallelism. i.e. a large number of small problems.

This second case is useful in many real world applications and ArrayFire provides solutions in this regard (look gfor[1], a parallel for loop).

*Plug for ArrayFire[2]*

What should have been a simple call to a reduction and a scan instead becomes 4 algorithms (one of which is a costly sort) and 3 memory copies.

Here is how the code works in ArrayFire:

array cell_indices(num_particles, 1, dev_particle_cell_indices, afDevicePointer);
array particle_counts = zeros(num_cells);

gfor(array i, num_cells) // Parallel for loop
        particle_counts(i) = sum(cell_indices == i);

array particle_offsets = accum(particle_counts); // Inclusive sum

--

Setup

    I am using talonmies code[3] to benchmark against arrayfire.
    I am using a similar graphics card (gts 360m) on Linux 64 (cuda 4.1 / gcc 4.7).
    You can find the full benchmark code over here[4].

Benchmark 1

With num_particles = 2000 and num_cells = 1500 (like the original problem)

$ ./a.out
Thrust time taken: 0.002384
ArrayFire time taken: 0.000131

ArrayFire is 18 times faster

Benchmark 2

With num_particles = 10000 and num_cells = 2000 (like talonmies' test case)

$ ./a.out
Thrust time taken: 0.002920
ArrayFire time taken: 0.000132

ArrayFire is 22 times faster

Benchmark 3

With num_particles = 50000 and num_cells = 5000 (just a larger test case)

$ ./a.out
Thrust time taken: 0.003596
ArrayFire time taken: 0.000157

ArrayFire is 23 times faster

Notes

    Thrust requires you to rewrite your code
    Thrust provides a speed up of ~320x over the original code
    ArrayFire requires little re-writing of code (change for to gfor)
    ArrayFire is 18-23 times faster (effectively ~7300x over the original code)
    ArrayFire scales better (run time increased by 50% for thrust, 15% for ArrayFire)

Conclusion

Thrust does indeed provide a decent speed up if you can re-write your problem.
But this is not always feasible and is non trivial for more complex problems.
The numbers indicate there is scope for much higher performance (because of the high degree of data parallelism) which is simply not being acheived in thrust.

ArrayFire utilizes the parallel resources in a much more effecient manner and the times indicate that the gpu is still not saturated.

You may want to write your own custom cuda code or use ArrayFire.
I just wanted to point out that sometimes using thrust is not an option because it is practically useless at large number of small problems.

  [1]: http://www.accelereyes.com/arrayfire/c/page_gfor.htm
  [2]: http://www.accelereyes.com/arrayfire/c/
  [3]: http://stackoverflow.com/a/10162898/535516
  [4]: https://gist.github.com/2396436
	I have to be frank here, this is going to be

	- criticism of thrust
	- Showing off ArrayFire (of which I am a core developer)

	Criticism of thrust

	They do a good job at optimizing parallel algorithms for vector inputs.
	They use data level parallelism (among other things) to parllelize algorithms that work really well for large, vector inputs.
	But they fail to improve upon it and go all the way to perfom true data level parallelism. i.e. a large number of small problems.

	This second case is useful in many real world applications and ArrayFire provides solutions in this regard (look gfor[1], a parallel for loop).

	Plug for ArrayFire[2]

	What should have been a simple call to a reduction and a scan instead becomes 4 algorithms (one of which is a costly sort) and 3 memory copies.

	Here is how the code works in ArrayFire:

	array cell_indices(num_particles, 1, dev_particle_cell_indices, afDevicePointer);
	array particle_counts = zeros(num_cells);

	gfor(array i, num_cells) // Parallel for loop
	particle_counts(i) = sum(cell_indices == i);

	array particle_offsets = accum(particle_counts); // Inclusive sum

	--

	Setup

	I am using talonmies code[3] to benchmark against arrayfire.
	I am using a similar graphics card (gts 360m) on Linux 64 (cuda 4.1 / gcc 4.7).
	You can find the full benchmark code over here[4].

	Benchmark 1

	With num_particles = 2000 and num_cells = 1500 (like the original problem)

	$ ./a.out
	Thrust time taken: 0.002384
	ArrayFire time taken: 0.000131

	ArrayFire is 18 times faster

	Benchmark 2

	With num_particles = 10000 and num_cells = 2000 (like talonmies' test case)

	$ ./a.out
	Thrust time taken: 0.002920
	ArrayFire time taken: 0.000132

	ArrayFire is 22 times faster

	Benchmark 3

	With num_particles = 50000 and num_cells = 5000 (just a larger test case)

	$ ./a.out
	Thrust time taken: 0.003596
	ArrayFire time taken: 0.000157

	ArrayFire is 23 times faster

	Notes

	Thrust requires you to rewrite your code
	Thrust provides a speed up of ~320x over the original code
	ArrayFire requires little re-writing of code (change for to gfor)
	ArrayFire is 18-23 times faster (effectively ~7300x over the original code)
	ArrayFire scales better (run time increased by 50% for thrust, 15% for ArrayFire)

	Conclusion

	Thrust does indeed provide a decent speed up if you can re-write your problem.
	But this is not always feasible and is non trivial for more complex problems.
	The numbers indicate there is scope for much higher performance (because of the high degree of data parallelism) which is simply not being acheived in thrust.

	ArrayFire utilizes the parallel resources in a much more effecient manner and the times indicate that the gpu is still not saturated.

	You may want to write your own custom cuda code or use ArrayFire.
	I just wanted to point out that sometimes using thrust is not an option because it is practically useless at large number of small problems.

	[1]: http://www.accelereyes.com/arrayfire/c/page_gfor.htm
	[2]: http://www.accelereyes.com/arrayfire/c/
	[3]: http://stackoverflow.com/a/10162898/535516
	[4]: https://gist.github.com/2396436