serge-rgb/Milton SSE, first pass

## Milton SSE, first pass
I have spent a lot of time doing algorithmic optimizations on Milton's
rasterizer.  I am sure there must be many more things that I can do, but right
it is at a state where it is almost Good Enough. With multi-threading enabled,
the renderer is fast enough for comfortable use on any reasonably modern CPU.

That said, I would like it to be as smooth as possible, so it is time to do
some SSE optimizations.

I determined the bottleneck by profiling with Very Sleepy; the rasterizer
spends most of its time calculating the closest point in a stroke to the pixel
being rendered (the pixel has already been converted to a point in the canvas).

* Optimizing nearest-point calculation:

I have set up Milton so that every run does the same job: It a reasonably
complex drawing of a dog =)

Cycle counts are measured with the rdtsc instruction. I'm taking a simple
arithmetic mean with the number of times the calculation is done.

Every measure I write down is what the average converges to.

Also, I am making sure that the machine state doesn't change too much. No
opening extra browser tabs or anything. The only thing that is changing is the
code.

Compiler: MSVC Visual Studio 2013

==============================
==== Run #1  : 627 cycles ====
==============================

There are a lot of abstractions in the code, so I am going to remove all
function calls and flatten it out.

===============================
==== Run #2  : ~400 cycles ====
===============================

(Don't have the cycle count, but it was about a 1.5x speedup)

I was sort of expecting a speed up like this after watching the Handmade Hero
SSE episodes. All I did was copy and paste a function instead of calling it and
convert vector structs into floats...

This is my first time doing this sort of optimization, so I am going to follow
the way Casey did it and go with baby steps: Load four points at a time and
loop, then substitute the loops with SIMD instructions.


================================
==== Run #2  : >1000 cycles   ====
================================

Wow... This took a while. I had a nasty bug that I could not solve. Anyways.
It's slow because everything is in 4-wide for loops, ready to add SIMD.


================================
==== Run #3  : 350 cycles   ====
================================

A lot of the code has been SIMDfied, but there's still some tougher stuff to
tackle. I'm going to do some more fine grained profiling to see how much faster
I can get. There's an instruction similar to rdtsc that puts a fence on
instruction reordering. I probably need that one...

Shaved it to ~346 by avoiding many floating point conversions. Going to bed

================================
==== Run #4  : 145 cycles   ====
================================

Good morning, world! The thing that I thought was tough was removing a branch.
Today started with the benefits of a fresh mind. Removing the branch was
fortunately easy. Down to 145 cycles! This is a 4.32x speedup. The reason that
it is higher than 4x is because of the extra boost that comes from removing
"heavy abstractions" like vector structs ;-)


The battle is not over, though.  We optimized the bottleneck, but the function
that contains it is still a hot stop. Let's measure the whole thing.

The whole function, which is called many times, averages 1.48 million
instructions with SSE disabled. With SSE: 587,000 instructions. The total speed
up is 2.5 We can do better.

Focusing on the per-pixel loop:

Instruction cost per loop: SSE Disabled : 1383 cycles SSE Enabled  :  492
cycles

This is good news. The speed up for the pixel loop is 2.81; just a little more
than the overall 2.72 for this function.

Of the 492 cycles, 112 are spent doing anti-aliasing and alpha blending. That
code is very flat, so I expect to get a 4x speedup


=====================================
==== MSAA Run #0  : 112 cycles   ====
=====================================

Now to convert stuff. The tough part was learning the "move and shuffle" trick
to add all the 32 bit elements inside a __m128


=====================================
==== MSAA Run #1  : 89 cycles   ====
=====================================

A 1.25 speedup :( Not what I had in mind

The cost per loop is now at 461 cycles.


Tally: (extra means MSA)

NO_SSE:
      [MEASURE] render_canvas total: 32158656. Avg: 3561000299
      [MEASURE] total meta avg: 1445986 (1413970)
      [MEASURE] block render meta avg: 1348
      [MEASURE] extra render meta avg: 147
SSE:
      [MEASURE] render_canvas total: 17665380. Avg: 1590474304
      [MEASURE] total meta avg: 565998 (559989)
      [MEASURE] block render meta avg: 463
      [MEASURE] extra render meta avg: 88

Almost a 3x speedup in block rendering and a 2.23 overall speed up in canvas
rendering. It is very noticeable. The app is more than comfortable at 2560x1600
(with my Core i7...). At more common resolutions it should run smoothly.
Enough SSE for now; there are new bottlenecks, now that this has been nailed
down =)
	I have spent a lot of time doing algorithmic optimizations on Milton's
	rasterizer. I am sure there must be many more things that I can do, but right
	it is at a state where it is almost Good Enough. With multi-threading enabled,
	the renderer is fast enough for comfortable use on any reasonably modern CPU.

	That said, I would like it to be as smooth as possible, so it is time to do
	some SSE optimizations.

	I determined the bottleneck by profiling with Very Sleepy; the rasterizer
	spends most of its time calculating the closest point in a stroke to the pixel
	being rendered (the pixel has already been converted to a point in the canvas).

	* Optimizing nearest-point calculation:

	I have set up Milton so that every run does the same job: It a reasonably
	complex drawing of a dog =)

	Cycle counts are measured with the rdtsc instruction. I'm taking a simple
	arithmetic mean with the number of times the calculation is done.

	Every measure I write down is what the average converges to.

	Also, I am making sure that the machine state doesn't change too much. No
	opening extra browser tabs or anything. The only thing that is changing is the
	code.

	Compiler: MSVC Visual Studio 2013

	==============================
	==== Run #1 : 627 cycles ====
	==============================

	There are a lot of abstractions in the code, so I am going to remove all
	function calls and flatten it out.

	===============================
	==== Run #2 : ~400 cycles ====
	===============================

	(Don't have the cycle count, but it was about a 1.5x speedup)

	I was sort of expecting a speed up like this after watching the Handmade Hero
	SSE episodes. All I did was copy and paste a function instead of calling it and
	convert vector structs into floats...

	This is my first time doing this sort of optimization, so I am going to follow
	the way Casey did it and go with baby steps: Load four points at a time and
	loop, then substitute the loops with SIMD instructions.


	================================
	==== Run #2 : >1000 cycles ====
	================================

	Wow... This took a while. I had a nasty bug that I could not solve. Anyways.
	It's slow because everything is in 4-wide for loops, ready to add SIMD.


	================================
	==== Run #3 : 350 cycles ====
	================================

	A lot of the code has been SIMDfied, but there's still some tougher stuff to
	tackle. I'm going to do some more fine grained profiling to see how much faster
	I can get. There's an instruction similar to rdtsc that puts a fence on
	instruction reordering. I probably need that one...

	Shaved it to ~346 by avoiding many floating point conversions. Going to bed

	================================
	==== Run #4 : 145 cycles ====
	================================

	Good morning, world! The thing that I thought was tough was removing a branch.
	Today started with the benefits of a fresh mind. Removing the branch was
	fortunately easy. Down to 145 cycles! This is a 4.32x speedup. The reason that
	it is higher than 4x is because of the extra boost that comes from removing
	"heavy abstractions" like vector structs ;-)


	The battle is not over, though. We optimized the bottleneck, but the function
	that contains it is still a hot stop. Let's measure the whole thing.

	The whole function, which is called many times, averages 1.48 million
	instructions with SSE disabled. With SSE: 587,000 instructions. The total speed
	up is 2.5 We can do better.

	Focusing on the per-pixel loop:

	Instruction cost per loop: SSE Disabled : 1383 cycles SSE Enabled : 492
	cycles

	This is good news. The speed up for the pixel loop is 2.81; just a little more
	than the overall 2.72 for this function.

	Of the 492 cycles, 112 are spent doing anti-aliasing and alpha blending. That
	code is very flat, so I expect to get a 4x speedup


	=====================================
	==== MSAA Run #0 : 112 cycles ====
	=====================================

	Now to convert stuff. The tough part was learning the "move and shuffle" trick
	to add all the 32 bit elements inside a __m128


	=====================================
	==== MSAA Run #1 : 89 cycles ====
	=====================================

	A 1.25 speedup :( Not what I had in mind

	The cost per loop is now at 461 cycles.


	Tally: (extra means MSA)

	NO_SSE:
	[MEASURE] render_canvas total: 32158656. Avg: 3561000299
	[MEASURE] total meta avg: 1445986 (1413970)
	[MEASURE] block render meta avg: 1348
	[MEASURE] extra render meta avg: 147
	SSE:
	[MEASURE] render_canvas total: 17665380. Avg: 1590474304
	[MEASURE] total meta avg: 565998 (559989)
	[MEASURE] block render meta avg: 463
	[MEASURE] extra render meta avg: 88

	Almost a 3x speedup in block rendering and a 2.23 overall speed up in canvas
	rendering. It is very noticeable. The app is more than comfortable at 2560x1600
	(with my Core i7...). At more common resolutions it should run smoothly.
	Enough SSE for now; there are new bottlenecks, now that this has been nailed
	down =)