allanmac/float3 SoA to AoS

## float3 SoA to AoS
===============================================================================================

Load three arrays (x, y and z) in SoA order, repack them and store them in AoS order.

Strategy: each warp permutes its load lane with:

   (rowNum + (laneId() * 3)) & 31

This will convert SoA into AoS but with x/y/z staggered across rows of registers.

===============================================================================================

0-31:

 0  -  -  3  -  -  6  -  -  9  -  - 12  -  - 15  -  - 18  -  - 21  -  - 24  -  - 27  -  - 30  -
 -  1  -  -  4  -  -  7  -  - 10  -  - 13  -  - 16  -  - 19  -  - 22  -  - 25  -  - 28  -  - 31
 -  -  2  -  -  5  -  -  8  -  - 11  -  - 14  -  - 17  -  - 20  -  - 23  -  - 26  -  - 29  -  -

0-63:

 0 33  -  3 36  -  6 39  -  9 42  - 12 45  - 15 48  - 18 51  - 21 54  - 24 57  - 27 60  - 30 63
 -  1 34  -  4 37  -  7 40  - 10 43  - 13 46  - 16 49  - 19 52  - 22 55  - 25 58  - 28 61  - 31
32  -  2 35  -  5 38  -  8 41  - 11 44  - 14 47  - 17 50  - 20 53  - 23 56  - 26 59  - 29 62  -

0-93:

 0 33 66  3 36 69  6 39 72  9 42 75 12 45 78 15 48 81 18 51 84 21 54 87 24 57 90 27 60 93 30 63
64  1 34 67  4 37 70  7 40 73 10 43 76 13 46 79 16 49 82 19 52 85 22 55 88 25 58 91 28 61 94 31
32 65  2 35 68  5 38 71  8 41 74 11 44 77 14 47 80 17 50 83 20 53 86 23 56 89 26 59 92 29 62 95

===============================================================================================

Permutation vector for each lane:

mod3=0 mod3=1 mod3=2
------ ------ ------
  0      1      2
  2      0      1
  1      2      0

if (laneIsMod0)  xchg(r1,r2);
if (laneIsMod1)  xchg(r0,r1);
if (laneIsMod2)  xchg(r0,r2);

At this point the 3 x 32 float rows are in float3 order.

Or just use two SELP ops (p ? a : b) to select which register to store out to device or host mem.

===============================================================================================

If there is no need to expose the float3 you can simplify any future
load/store by packing a 3x32 float "block" into a 2x64 + 1x32 form.

If this is acceptable, then another variant of the above permutation
strategy would only permute the x/y rows and leave z intact.

All of this could've been avoided if the original source of the SoA
arrays interleaved the a/b rows followed by the c row.

For example:

typedef union
{
  struct {
    float   x;
    float   y;
    float   z;
  };

  struct {
    float2  xy;
    float   z;
  } block;

} bfloat3;
	===============================================================================================

	Load three arrays (x, y and z) in SoA order, repack them and store them in AoS order.

	Strategy: each warp permutes its load lane with:

	(rowNum + (laneId() * 3)) & 31

	This will convert SoA into AoS but with x/y/z staggered across rows of registers.

	===============================================================================================

	0-31:

	0 - - 3 - - 6 - - 9 - - 12 - - 15 - - 18 - - 21 - - 24 - - 27 - - 30 -
	- 1 - - 4 - - 7 - - 10 - - 13 - - 16 - - 19 - - 22 - - 25 - - 28 - - 31
	- - 2 - - 5 - - 8 - - 11 - - 14 - - 17 - - 20 - - 23 - - 26 - - 29 - -

	0-63:

	0 33 - 3 36 - 6 39 - 9 42 - 12 45 - 15 48 - 18 51 - 21 54 - 24 57 - 27 60 - 30 63
	- 1 34 - 4 37 - 7 40 - 10 43 - 13 46 - 16 49 - 19 52 - 22 55 - 25 58 - 28 61 - 31
	32 - 2 35 - 5 38 - 8 41 - 11 44 - 14 47 - 17 50 - 20 53 - 23 56 - 26 59 - 29 62 -

	0-93:

	0 33 66 3 36 69 6 39 72 9 42 75 12 45 78 15 48 81 18 51 84 21 54 87 24 57 90 27 60 93 30 63
	64 1 34 67 4 37 70 7 40 73 10 43 76 13 46 79 16 49 82 19 52 85 22 55 88 25 58 91 28 61 94 31
	32 65 2 35 68 5 38 71 8 41 74 11 44 77 14 47 80 17 50 83 20 53 86 23 56 89 26 59 92 29 62 95

	===============================================================================================

	Permutation vector for each lane:

	mod3=0 mod3=1 mod3=2
	------ ------ ------
	0 1 2
	2 0 1
	1 2 0

	if (laneIsMod0) xchg(r1,r2);
	if (laneIsMod1) xchg(r0,r1);
	if (laneIsMod2) xchg(r0,r2);

	At this point the 3 x 32 float rows are in float3 order.

	Or just use two SELP ops (p ? a : b) to select which register to store out to device or host mem.

	===============================================================================================

	If there is no need to expose the float3 you can simplify any future
	load/store by packing a 3x32 float "block" into a 2x64 + 1x32 form.

	If this is acceptable, then another variant of the above permutation
	strategy would only permute the x/y rows and leave z intact.

	All of this could've been avoided if the original source of the SoA
	arrays interleaved the a/b rows followed by the c row.

	For example:

	typedef union
	{
	struct {
	float x;
	float y;
	float z;
	};

	struct {
	float2 xy;
	float z;
	} block;

	} bfloat3;