dhermes/non_branching_max.txt Secret

## non_branching_max.txt
http://stackoverflow.com/questions/427477/fastest-way-to-clamp-a-real-fixed-floating-point-value
https://devtalk.nvidia.com/default/topic/514408/min-max-and-sign-functions-in-cuda-do-they-exist-if-so-where-/

https://en.wikipedia.org/wiki/Algorithm_%28C%2B%2B%29
https://en.wikipedia.org/wiki/C_mathematical_functions#Overview_of_functions
http://en.cppreference.com/w/c/numeric/math/fmax


$ find /usr/ | grep 'algorithm\.h$'
/usr/include/CGAL/algorithm.h


http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#mathematical-functions-appendix

Amazing! using fmin and fmax cut my computation time from 4.1-4.2 ms
to 3.1-3.2, and their use isn't the major part of the computations!


http://stackoverflow.com/questions/16584558/the-difference-between-max-and-fmax-cross-platform-compiling


The actual difference is, that fmin and fmax are mathematical functions
working on floating point numbers and originating from C99 (and might be
implemented intrisically by actual specialized CPU instructions where possible),
while min and max are general algorithms usable on any type
supporting < (and are probably just a simple (b<a) ? b : a instead of a
floating point instruction, though an implementation could even do that
with a specialization of min and max, but I doubt this).


http://gpuray.blogspot.com/2009/07/cuda-warps-and-branching.html


http://www.informit.com/articles/article.aspx?p=2103809&seqNum=4

Some conditional operations are so common that they are supported natively
by the hardware. Minimum and maximum operations are supported for both
integer and floating-point operands and are translated to a single
instruction. Additionally, floating-point instructions include modifiers
that can negate or take the absolute value of a source operand.

The compiler does a good job of detecting when min/max operations
are being expressed, but if you want to take no chances, call the
min()/max() intrinsics for integers or fmin()/fmax()
for floating-point values.


======================================================
https://devtalk.nvidia.com/default/topic/496548/are-max-a-b-and-min-a-b-divergent-/

The standard CPU implementation seems to be:
(b<a) ? a : b;
which is clearly divergent, but I'd like to know if CUDA does anything
clever to get around it.
======================================================

http://stackoverflow.com/a/16659263/1068170

maxsd   %xmm0, %xmm1    # d, min
movapd  %xmm2, %xmm0    # max, max
minsd   %xmm1, %xmm0    # min, max
ret

maxsd   %xmm0, %xmm1
minsd   %xmm1, %xmm2
movaps  %xmm2, %xmm0
ret

GENERATED ASSEMBLY (sm_1x, sm_2x)
======================================================

https://gist.github.com/dhermes/c79846c6074b938b2e10
	http://stackoverflow.com/questions/427477/fastest-way-to-clamp-a-real-fixed-floating-point-value
	https://devtalk.nvidia.com/default/topic/514408/min-max-and-sign-functions-in-cuda-do-they-exist-if-so-where-/

	https://en.wikipedia.org/wiki/Algorithm_%28C%2B%2B%29
	https://en.wikipedia.org/wiki/C_mathematical_functions#Overview_of_functions
	http://en.cppreference.com/w/c/numeric/math/fmax


	$ find /usr/ \| grep 'algorithm\.h$'
	/usr/include/CGAL/algorithm.h


	http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#mathematical-functions-appendix

	Amazing! using fmin and fmax cut my computation time from 4.1-4.2 ms
	to 3.1-3.2, and their use isn't the major part of the computations!


	http://stackoverflow.com/questions/16584558/the-difference-between-max-and-fmax-cross-platform-compiling


	The actual difference is, that fmin and fmax are mathematical functions
	working on floating point numbers and originating from C99 (and might be
	implemented intrisically by actual specialized CPU instructions where possible),
	while min and max are general algorithms usable on any type
	supporting < (and are probably just a simple (b<a) ? b : a instead of a
	floating point instruction, though an implementation could even do that
	with a specialization of min and max, but I doubt this).



	http://gpuray.blogspot.com/2009/07/cuda-warps-and-branching.html



	http://www.informit.com/articles/article.aspx?p=2103809&seqNum=4

	Some conditional operations are so common that they are supported natively
	by the hardware. Minimum and maximum operations are supported for both
	integer and floating-point operands and are translated to a single
	instruction. Additionally, floating-point instructions include modifiers
	that can negate or take the absolute value of a source operand.

	The compiler does a good job of detecting when min/max operations
	are being expressed, but if you want to take no chances, call the
	min()/max() intrinsics for integers or fmin()/fmax()
	for floating-point values.


	======================================================
	https://devtalk.nvidia.com/default/topic/496548/are-max-a-b-and-min-a-b-divergent-/

	The standard CPU implementation seems to be:
	(b<a) ? a : b;
	which is clearly divergent, but I'd like to know if CUDA does anything
	clever to get around it.
	======================================================

	http://stackoverflow.com/a/16659263/1068170

	maxsd %xmm0, %xmm1 # d, min
	movapd %xmm2, %xmm0 # max, max
	minsd %xmm1, %xmm0 # min, max
	ret

	maxsd %xmm0, %xmm1
	minsd %xmm1, %xmm2
	movaps %xmm2, %xmm0
	ret

	GENERATED ASSEMBLY (sm_1x, sm_2x)
	======================================================

	https://gist.github.com/dhermes/c79846c6074b938b2e10