Andreas237/XoroshiroNOT.c

## XoroshiroNOT.c
/*
XoroshiroNOT PRELIMINARY BETA RELEASE v0.10 (for testing only, as defects unknown to the author may exist, thus all below claims must be independently verified).
XoroshiroNOT Pseudo-Random Number Generator, © Christopher Rutz, All Rights Reserved. Contact information below.
XoroshiroNOT is an extrapolation of xoroshiro++, by S. Vigna and D. Blackman.
XoroshiroNOT may also be referred to as Xoroshiro++- or Xoroshiroppm.
XoroshiroNOT purports to improve upon their work by over-provisioning the underlying xoroshiro state by 50% and flipping the rotate/shift direction, thus providing these many features and characteristics:
	1. Near-ideal pseudo randomness within a given period that is (perhaps uniquely) explicitly defined by, but greater than that inherent in, the underlying (xoroshiro) base generator randomness and period:
		A. Randomness period = wordsize^2 - 1, which is easily testable with 8-bit byte and 16-bit word size versions.
		B. 16-bit word, 32-bit dword and 64-bit qword versions pass TestU01 BigCrush with a near-ideal failure rate (for most values of D) of ~50 failures per 100 BigCrush runs, using both forward and reverse bits.
		C. 32-bit dword and 64-bit qword versions pass PractRand to at least 32TB (normal testing limit, fwd/rev), gjrand --ten-tera (fwd/rev), and Hamming Weight Dependency test to 5.0E14 (stopped testing, fwd only), all with no imminent sign of failure.
		D. 16-bit word version passes both PractRand and gjrand to 8GB (size defined by randomness period), without systematic error using forward and reverse bits.
		E. No excessive 'FREQ' warnings during PractRand testing that are endemic in many other algorithms.
	2. Provides two versions:
		A. An extremely fast, fully 1-Dimensional, protracted equidistributed version (i.e., no missing 0's over the defined full period) that is fully 2-D with respect to identical pairs (when the default E = 1 is used).
		B. A slightly slower, fully 1-D and near-fully 2-D, protracted equidistributed version (i.e., 2^(n/3) pairs occur once less often over the defined full period).
	3. Does not use variables or perform math with a word size larger than the underlying xoroshiro (e.g., no 128-bit math on a 64-bit variable).
	4. Only uses XOR, addition, subtraction and shift/rotate instructions:
		A. No multiplication (other than that implied by power of 2 bit-shift operations), thus easily implemented in hardware.
		B. No 'magic' constants (other than those used for shift/rotate and when the default E = 1 is used to complete the NOT).
	5. Extremely fast, compact and comprehensible code. (Code below and benchmarks in XoroshiroNOT_benchmarks.txt)
	6. Not accidentally or very-trivially invertible (though not intended or designed for use as a CSPRNG).
	7. Nearly eliminates all traces of sparse set-bit state recovery (even better with the 2-D version).
	8. Maximally optimized for Intel CPU pipelines under Linux GCC (-O3) and Windows MSVC x86/64 (/Ox) (e.g. 1-D version is nearly as fast as xoroshiro** in most usage scenarios).
	9. Stream selection by normal jump of the underlying xoroshiro state is the recommended method (jump code not yet available).
	10. Supports extended stream selection (via the extended state s[2]):
		A. Although correlated (thus not recommended for serious work), extended streams bias the correlation toward the least significant bits.
		B. Lower correlation between streams is generally achieved successively (e.g., by reaching the end of the underlying xoroshiro stream; jump code not yet available).
		C. Randomness tests encompassing two successive streams will fail (usually only) on a GAP test of the least significant bit (e.g. PractRand),
			whereas xoroshiro++ would exhibit multiple failures ~16x-32x sooner (in the 16-bit version).
		D. Randomness tests encompassing four successive streams will additionally fail on floating-point tests of the 4 least significant bits.
		E. The above mentioned failures are substantially fewer and later in comparison to those that occur in a default xoroshiro++ implementation.
	11. Extremely configurable (thus obfuscatable) via:
		A. Alternate A/B/C parameters (using other reversed xoroshiro triplets, if the un-reversed triplets are of sufficiently high-quality).
		B. The D parameter, with any value from 1 to a maximum value of W - 2 in the 2-D version, and odd from 1 to W - 3 in the 1-D version. The maximum value of D is normally limited
			by the quality of the underlying xoroshiro triplet.
		C. Non-inversion of high-quality triplets, reversing the B shift direction (however proper selection and full-testing of the D parameter is absolutely required to prevent a standing-wave effect on some bits).
		D. The E parameter (normally 1, but always odd, with maximum benefit to sparse state recovery when E >= 50% set bits).
		E. Swapping s[0] with s[1] in first two lines of the function code.
		F. The W parameter (and associated uintx) for different state / output word sizes.
XoroshiroNOT has these caveats:
	1. It is new and not independently verified (even though it is based on the well studied mathematics of xoroshiro).
	2. It uses an additional fully-seedable state variable that extends the full period (for speed, improving randomness, and maintaining/ehnancing equidistribution), but the period of verifiable randomness (which exceeds
		that of xoroshiro itself) extends only through to that of the underlying xoroshiro state.
	3. The ++, and therefore my ++- improvement to xoroshiro may prove to be mathematically intractable (though perhaps less likely so with the triplet inversion present).
	4. The 2-D version is slower on newer CPUs than comparable forms of xoshiro/xoroshiro released by Vigna and Blackman (though faster than many common PRNGs... see the XoroshiroNOT_benchmarks.txt file for details).
	5. The faster 1-D version is more prone to struggle to reach the defined randomness period at some (usually > W/2) D values:
		(e.g., 'A = 2, B = 2, C = 9, D = 11, E = 1, W = 16' 1-D fails PractRand 'stdin16 -tf 2 -te 1' at 4GB, not the expected 16G, but only when all bits are reversed, yet same data passes BigCrush with no endemic failures).
	6. Protracted equidistribution employed by this type of PRNG is a new, or at least non-traditional concept. It attempts to balance the double-edge sword of meeting accepted equidistribution requirements
		against the fact that such a generator cannot possibly meet randomness criteria at its full period. Thus, a compromise is made in the interest of maximizing randomness over a predictable period such
		that irrational judgments need not be made (e.g., only use SQRT(period) values, etc.).
	7. The 16-bit word output version, though passing BigCrush, shows weakness on a few statistical tests when a meta-analysis of multiple BigCrush runs is performed. However, this is a blessing-in-disguise, since BigCrush may be
		consuming more data than the defined randomess period can deliver, and thus makes a good case for the expected behavior of the 32-bit word output and greater versions.
XoroshiroNOT has this errata:
	1. The NOT in XoroshiroNOT has at least one obvious meaning, but is actually derived from the micro-transactional subtraction (used for performance reasons) that,
		when E = 1, simplifies to a ones-compliment bit inversion (i.e., a Boolean NOT, in some programming languages).
	2. XoroshiroNOT accecpts the logical proposition that there is a quantum-nature to xoroshiro, where randomness is present more so within some bits, but entangled more so between bit pairs in other bits,
		and thus attempts to share this bias across all bits by bubbling a net-effect of excess carry bits upward (i.e., from lsb to msb, which is why the inverted xoroshiro is used).
	3. Xoroshiro was chosen as the platform for my code base due to its amazing speed, complementary state variables, low average bit discard rate, and minimal mathematical artifacts.
	4. Questions, comments, suggestions and/or donations (via PayPal) regarding this work, or to support its further development, may be submitted to XoroshiroNOT@comcast.net.
*/

//Currently v0.10 with updated note on 8-bit output version
//Previously changed recommended D constant, updated docs, added enums, removed const,
//static/inline added back in properly and removed unnecessary parens, added notes on proper reversal of B shift direction

// XoroshiroNOT Pseudo-Polymorphic Code:
#include <stdint.h>

//8-bit byte output generator with defined full period = (2^24 - 2^8) and defined randomness period = (2^16 - 1)
//Both 1D/2D pass PractRand 0.94 {stdin8 -tf 2 -te 1 -tlmin 16KB} to at least 64KB output, per defined randomness period specification.
//Both 1D/2D pass gjrand mcp --tiny (10MB) on ~8 of 13 tests, which is testing well beyond the defined randomness period.
//#define uintx uint8_t
//enum {A = 4, B = 7, C = 5, D = 5, E = 1, W = 8};

//16-bit word output generator with defined full period = (2^48 - 2^16) and defined randomness period = (2^32 - 1)
//#define uintx uint16_t
//enum {A = 2, B = 2, C = 9, D = 9, E = 1, W = 16}; // preferred and recommended for most values of D from 1 to 14

//32-bit dword output generator with defined full period = (2^96 - 2^32) and defined randomness period = (2^64 - 1)
//#define uintx uint32_t
//enum {A = 6, B = 9, C = 19, D = 17, E = 1, W = 32};

//64-bit qword output generator with defined full period = (2^192 - 2^64) and defined randomness period = (2^128 - 1)
#define uintx uint64_t
enum {A = 40, B = 16, C = 27, D = 33, E = 1, W = 64}; // inverted v1.00
//enum {A =  9, B = 14, C = 28, D = 33, E = 1, W = 64}; // inverted v0.01

//128-bit oword output generator with defined full period = (2^384 - 2^128) and defined randomness period = (2^256 - 1)
//#define uintx uint128_t
//enum {A = 23, B = 36, C = 58, D = 65, E = 1, W = 128}; // untested

//Code section LOCKED for beta-testing
static uintx s[3] = {1, 0, 0}; // must set at least 1 bit in either of first two state variables
static uintx const_odd = E | 1; // per-instance odd decrement, not declared as an actual const or part of state (for better CPU pipeline performance)

static inline uintx rotl(const uintx x, int k) {
	return (x << k) | (x >> (W - k)); }

//Near-fully two-dimensionally equidistributed (over full period) version
static inline uintx xoroshiroNOT2d(void) {
	uintx s1 = s[0] ^ s[1];
	s[2] = rotl(s[0] + s[1], D) + (s[2] ^ s[0]) - s1;
	s[0] = rotl(s[0], A) ^ s1 ^ (s1 >> B);  //note reversed B shift direction required for inverted constants
	s[1] = rotl(s1, C);
	return s[2] -= const_odd; }

//Fully one-dimensionally equidistributed (over full period) version, D must be odd
static inline uintx xoroshiroNOT1d(void) {
	uintx tmp = s[0] ^ s[2];
	s[2] = rotl(s[0] + s[1], D) + tmp - s[1] - const_odd;
	uintx s1 = s[0] ^ s[1];
	s[0] = rotl(s[0], A) ^ s1 ^ (s1 >> B); //note reversed B shift direction required for inverted constants
	s[1] = rotl(s1, C);
	return s[2]; }
//Code section UNLOCKED

//Code section below reserved for future expansione (e.g. seed, jump, etc.)

## XoroshiroNOT_benchmarks.txt
XoroshiroNOT 64-bit output benchmark comparisons on Xeon W3690 @ 3.80GHz
MSVC My Benchmark (clustered executable)
Store result in array
xoroshiro128+ : 5.16019 GB/s in 22.6504 seconds
xoroshiro128**: 4.82434 GB/s in 24.2273 seconds
xoroshiroNOT2d: 3.90328 GB/s in 29.9441 seconds
xoroshiroNOT1d: 4.59434 GB/s in 25.4401 seconds
Add result to array (penalty to larger/complex code and/or poor pipeline optimization)
xoroshiro128+ : 5.13748 GB/s in 22.7506 seconds
xoroshiro128**: 4.69707 GB/s in 24.8837 seconds
xoroshiroNOT2d: 3.95083 GB/s in 29.5838 seconds
xoroshiroNOT1d: 4.96070 GB/s in 23.5613 seconds (this looks great!)

GCC Vigna Benchmark (individual executables, 10 Billion iterations, modified to add Static to e variable)
Add result to variable (penalty to larger/complex code and/or poor pipeline optimization)
xoroshiro128+ : 12.938 s, 6.184 GB/s, 0.773 words/ns, 1.294 ns/word
xoroshiro128**: 14.672 s, 5.453 GB/s, 0.682 words/ns, 1.467 ns/word
xoroshiroNOT1d: 14.891 s, 5.373 GB/s, 0.672 words/ns, 1.489 ns/word (fixed: accidently benchmarked 2d version)
xoroshiroNOT2d: 17.172 s, 4.659 GB/s, 0.582 words/ns, 1.717 ns/word
pcg64-speed   : 18.266 s, 4.380 GB/s, 0.547 words/ns, 1.827 ns/word (64-bit state version)

GCC Lemire Benchmark (clustered executable, modified to fix non-inlining of function calls and added Static to xoroshiro128+/xorshift128+ state)
Store result in array
lehmer64      : 0.70 cycles per byte
xorshift128+  : 0.76 cycles per byte
xoroshiro128+ : 0.61 cycles per byte
xoroshiro128**: 0.57 cycles per byte (this was unexpected!)
splitmix64    : 0.89 cycles per byte
pcg64         : 1.06 cycles per byte (128-bit state version)
xorshift1024* : 0.85 cycles per byte
xorshift1024+ : 0.80 cycles per byte
xoroshiroNOT2d: 0.84 cycles per byte
xoroshiroNOT1d: 0.66 cycles per byte (looks good!)
Add result to array (penalty to larger/complex code and/or poor pipeline optimization)
lehmer64      : 0.83 cycles per byte
xorshift128+  : 0.83 cycles per byte
xoroshiro128+ : 0.83 cycles per byte
xoroshiro128**: 0.77 cycles per byte (this was unexpected!)
splitmix64    : 1.03 cycles per byte
pcg64         : 1.16 cycles per byte (128-bit state version)
xorshift1024* : 0.99 cycles per byte
xorshift1024+ : 0.91 cycles per byte
xoroshiroNOT2d: 0.93 cycles per byte
xoroshiroNOT1d: 0.82 cycles per byte (looks good!)

MSVC 11 year old Xeon DP 2.8GHz
Add result to array
xoroshiro128+ : 1.232850 GB/s in  9.4805 seconds
xoroshiro128**: 0.823447 GB/s in 14.1941 seconds
xoroshiroNOT2d: 0.823637 GB/s in 14.1908 seconds
xoroshiroNOT1d: 0.994528 GB/s in 11.7524 seconds
	/*
	XoroshiroNOT PRELIMINARY BETA RELEASE v0.10 (for testing only, as defects unknown to the author may exist, thus all below claims must be independently verified).
	XoroshiroNOT Pseudo-Random Number Generator, © Christopher Rutz, All Rights Reserved. Contact information below.
	XoroshiroNOT is an extrapolation of xoroshiro++, by S. Vigna and D. Blackman.
	XoroshiroNOT may also be referred to as Xoroshiro++- or Xoroshiroppm.
	XoroshiroNOT purports to improve upon their work by over-provisioning the underlying xoroshiro state by 50% and flipping the rotate/shift direction, thus providing these many features and characteristics:
	1. Near-ideal pseudo randomness within a given period that is (perhaps uniquely) explicitly defined by, but greater than that inherent in, the underlying (xoroshiro) base generator randomness and period:
	A. Randomness period = wordsize^2 - 1, which is easily testable with 8-bit byte and 16-bit word size versions.
	B. 16-bit word, 32-bit dword and 64-bit qword versions pass TestU01 BigCrush with a near-ideal failure rate (for most values of D) of ~50 failures per 100 BigCrush runs, using both forward and reverse bits.
	C. 32-bit dword and 64-bit qword versions pass PractRand to at least 32TB (normal testing limit, fwd/rev), gjrand --ten-tera (fwd/rev), and Hamming Weight Dependency test to 5.0E14 (stopped testing, fwd only), all with no imminent sign of failure.
	D. 16-bit word version passes both PractRand and gjrand to 8GB (size defined by randomness period), without systematic error using forward and reverse bits.
	E. No excessive 'FREQ' warnings during PractRand testing that are endemic in many other algorithms.
	2. Provides two versions:
	A. An extremely fast, fully 1-Dimensional, protracted equidistributed version (i.e., no missing 0's over the defined full period) that is fully 2-D with respect to identical pairs (when the default E = 1 is used).
	B. A slightly slower, fully 1-D and near-fully 2-D, protracted equidistributed version (i.e., 2^(n/3) pairs occur once less often over the defined full period).
	3. Does not use variables or perform math with a word size larger than the underlying xoroshiro (e.g., no 128-bit math on a 64-bit variable).
	4. Only uses XOR, addition, subtraction and shift/rotate instructions:
	A. No multiplication (other than that implied by power of 2 bit-shift operations), thus easily implemented in hardware.
	B. No 'magic' constants (other than those used for shift/rotate and when the default E = 1 is used to complete the NOT).
	5. Extremely fast, compact and comprehensible code. (Code below and benchmarks in XoroshiroNOT_benchmarks.txt)
	6. Not accidentally or very-trivially invertible (though not intended or designed for use as a CSPRNG).
	7. Nearly eliminates all traces of sparse set-bit state recovery (even better with the 2-D version).
	8. Maximally optimized for Intel CPU pipelines under Linux GCC (-O3) and Windows MSVC x86/64 (/Ox) (e.g. 1-D version is nearly as fast as xoroshiro** in most usage scenarios).
	9. Stream selection by normal jump of the underlying xoroshiro state is the recommended method (jump code not yet available).
	10. Supports extended stream selection (via the extended state s[2]):
	A. Although correlated (thus not recommended for serious work), extended streams bias the correlation toward the least significant bits.
	B. Lower correlation between streams is generally achieved successively (e.g., by reaching the end of the underlying xoroshiro stream; jump code not yet available).
	C. Randomness tests encompassing two successive streams will fail (usually only) on a GAP test of the least significant bit (e.g. PractRand),
	whereas xoroshiro++ would exhibit multiple failures ~16x-32x sooner (in the 16-bit version).
	D. Randomness tests encompassing four successive streams will additionally fail on floating-point tests of the 4 least significant bits.
	E. The above mentioned failures are substantially fewer and later in comparison to those that occur in a default xoroshiro++ implementation.
	11. Extremely configurable (thus obfuscatable) via:
	A. Alternate A/B/C parameters (using other reversed xoroshiro triplets, if the un-reversed triplets are of sufficiently high-quality).
	B. The D parameter, with any value from 1 to a maximum value of W - 2 in the 2-D version, and odd from 1 to W - 3 in the 1-D version. The maximum value of D is normally limited
	by the quality of the underlying xoroshiro triplet.
	C. Non-inversion of high-quality triplets, reversing the B shift direction (however proper selection and full-testing of the D parameter is absolutely required to prevent a standing-wave effect on some bits).
	D. The E parameter (normally 1, but always odd, with maximum benefit to sparse state recovery when E >= 50% set bits).
	E. Swapping s[0] with s[1] in first two lines of the function code.
	F. The W parameter (and associated uintx) for different state / output word sizes.
	XoroshiroNOT has these caveats:
	1. It is new and not independently verified (even though it is based on the well studied mathematics of xoroshiro).
	2. It uses an additional fully-seedable state variable that extends the full period (for speed, improving randomness, and maintaining/ehnancing equidistribution), but the period of verifiable randomness (which exceeds
	that of xoroshiro itself) extends only through to that of the underlying xoroshiro state.
	3. The ++, and therefore my ++- improvement to xoroshiro may prove to be mathematically intractable (though perhaps less likely so with the triplet inversion present).
	4. The 2-D version is slower on newer CPUs than comparable forms of xoshiro/xoroshiro released by Vigna and Blackman (though faster than many common PRNGs... see the XoroshiroNOT_benchmarks.txt file for details).
	5. The faster 1-D version is more prone to struggle to reach the defined randomness period at some (usually > W/2) D values:
	(e.g., 'A = 2, B = 2, C = 9, D = 11, E = 1, W = 16' 1-D fails PractRand 'stdin16 -tf 2 -te 1' at 4GB, not the expected 16G, but only when all bits are reversed, yet same data passes BigCrush with no endemic failures).
	6. Protracted equidistribution employed by this type of PRNG is a new, or at least non-traditional concept. It attempts to balance the double-edge sword of meeting accepted equidistribution requirements
	against the fact that such a generator cannot possibly meet randomness criteria at its full period. Thus, a compromise is made in the interest of maximizing randomness over a predictable period such
	that irrational judgments need not be made (e.g., only use SQRT(period) values, etc.).
	7. The 16-bit word output version, though passing BigCrush, shows weakness on a few statistical tests when a meta-analysis of multiple BigCrush runs is performed. However, this is a blessing-in-disguise, since BigCrush may be
	consuming more data than the defined randomess period can deliver, and thus makes a good case for the expected behavior of the 32-bit word output and greater versions.
	XoroshiroNOT has this errata:
	1. The NOT in XoroshiroNOT has at least one obvious meaning, but is actually derived from the micro-transactional subtraction (used for performance reasons) that,
	when E = 1, simplifies to a ones-compliment bit inversion (i.e., a Boolean NOT, in some programming languages).
	2. XoroshiroNOT accecpts the logical proposition that there is a quantum-nature to xoroshiro, where randomness is present more so within some bits, but entangled more so between bit pairs in other bits,
	and thus attempts to share this bias across all bits by bubbling a net-effect of excess carry bits upward (i.e., from lsb to msb, which is why the inverted xoroshiro is used).
	3. Xoroshiro was chosen as the platform for my code base due to its amazing speed, complementary state variables, low average bit discard rate, and minimal mathematical artifacts.
	4. Questions, comments, suggestions and/or donations (via PayPal) regarding this work, or to support its further development, may be submitted to XoroshiroNOT@comcast.net.
	*/

	//Currently v0.10 with updated note on 8-bit output version
	//Previously changed recommended D constant, updated docs, added enums, removed const,
	//static/inline added back in properly and removed unnecessary parens, added notes on proper reversal of B shift direction

	// XoroshiroNOT Pseudo-Polymorphic Code:
	#include <stdint.h>

	//8-bit byte output generator with defined full period = (2^24 - 2^8) and defined randomness period = (2^16 - 1)
	//Both 1D/2D pass PractRand 0.94 {stdin8 -tf 2 -te 1 -tlmin 16KB} to at least 64KB output, per defined randomness period specification.
	//Both 1D/2D pass gjrand mcp --tiny (10MB) on ~8 of 13 tests, which is testing well beyond the defined randomness period.
	//#define uintx uint8_t
	//enum {A = 4, B = 7, C = 5, D = 5, E = 1, W = 8};

	//16-bit word output generator with defined full period = (2^48 - 2^16) and defined randomness period = (2^32 - 1)
	//#define uintx uint16_t
	//enum {A = 2, B = 2, C = 9, D = 9, E = 1, W = 16}; // preferred and recommended for most values of D from 1 to 14

	//32-bit dword output generator with defined full period = (2^96 - 2^32) and defined randomness period = (2^64 - 1)
	//#define uintx uint32_t
	//enum {A = 6, B = 9, C = 19, D = 17, E = 1, W = 32};

	//64-bit qword output generator with defined full period = (2^192 - 2^64) and defined randomness period = (2^128 - 1)
	#define uintx uint64_t
	enum {A = 40, B = 16, C = 27, D = 33, E = 1, W = 64}; // inverted v1.00
	//enum {A = 9, B = 14, C = 28, D = 33, E = 1, W = 64}; // inverted v0.01

	//128-bit oword output generator with defined full period = (2^384 - 2^128) and defined randomness period = (2^256 - 1)
	//#define uintx uint128_t
	//enum {A = 23, B = 36, C = 58, D = 65, E = 1, W = 128}; // untested

	//Code section LOCKED for beta-testing
	static uintx s[3] = {1, 0, 0}; // must set at least 1 bit in either of first two state variables
	static uintx const_odd = E \| 1; // per-instance odd decrement, not declared as an actual const or part of state (for better CPU pipeline performance)

	static inline uintx rotl(const uintx x, int k) {
	return (x << k) \| (x >> (W - k)); }

	//Near-fully two-dimensionally equidistributed (over full period) version
	static inline uintx xoroshiroNOT2d(void) {
	uintx s1 = s[0] ^ s[1];
	s[2] = rotl(s[0] + s[1], D) + (s[2] ^ s[0]) - s1;
	s[0] = rotl(s[0], A) ^ s1 ^ (s1 >> B); //note reversed B shift direction required for inverted constants
	s[1] = rotl(s1, C);
	return s[2] -= const_odd; }

	//Fully one-dimensionally equidistributed (over full period) version, D must be odd
	static inline uintx xoroshiroNOT1d(void) {
	uintx tmp = s[0] ^ s[2];
	s[2] = rotl(s[0] + s[1], D) + tmp - s[1] - const_odd;
	uintx s1 = s[0] ^ s[1];
	s[0] = rotl(s[0], A) ^ s1 ^ (s1 >> B); //note reversed B shift direction required for inverted constants
	s[1] = rotl(s1, C);
	return s[2]; }
	//Code section UNLOCKED

	//Code section below reserved for future expansione (e.g. seed, jump, etc.)
	XoroshiroNOT 64-bit output benchmark comparisons on Xeon W3690 @ 3.80GHz
	MSVC My Benchmark (clustered executable)
	Store result in array
	xoroshiro128+ : 5.16019 GB/s in 22.6504 seconds
	xoroshiro128**: 4.82434 GB/s in 24.2273 seconds
	xoroshiroNOT2d: 3.90328 GB/s in 29.9441 seconds
	xoroshiroNOT1d: 4.59434 GB/s in 25.4401 seconds
	Add result to array (penalty to larger/complex code and/or poor pipeline optimization)
	xoroshiro128+ : 5.13748 GB/s in 22.7506 seconds
	xoroshiro128**: 4.69707 GB/s in 24.8837 seconds
	xoroshiroNOT2d: 3.95083 GB/s in 29.5838 seconds
	xoroshiroNOT1d: 4.96070 GB/s in 23.5613 seconds (this looks great!)

	GCC Vigna Benchmark (individual executables, 10 Billion iterations, modified to add Static to e variable)
	Add result to variable (penalty to larger/complex code and/or poor pipeline optimization)
	xoroshiro128+ : 12.938 s, 6.184 GB/s, 0.773 words/ns, 1.294 ns/word
	xoroshiro128**: 14.672 s, 5.453 GB/s, 0.682 words/ns, 1.467 ns/word
	xoroshiroNOT1d: 14.891 s, 5.373 GB/s, 0.672 words/ns, 1.489 ns/word (fixed: accidently benchmarked 2d version)
	xoroshiroNOT2d: 17.172 s, 4.659 GB/s, 0.582 words/ns, 1.717 ns/word
	pcg64-speed : 18.266 s, 4.380 GB/s, 0.547 words/ns, 1.827 ns/word (64-bit state version)

	GCC Lemire Benchmark (clustered executable, modified to fix non-inlining of function calls and added Static to xoroshiro128+/xorshift128+ state)
	Store result in array
	lehmer64 : 0.70 cycles per byte
	xorshift128+ : 0.76 cycles per byte
	xoroshiro128+ : 0.61 cycles per byte
	xoroshiro128**: 0.57 cycles per byte (this was unexpected!)
	splitmix64 : 0.89 cycles per byte
	pcg64 : 1.06 cycles per byte (128-bit state version)
	xorshift1024* : 0.85 cycles per byte
	xorshift1024+ : 0.80 cycles per byte
	xoroshiroNOT2d: 0.84 cycles per byte
	xoroshiroNOT1d: 0.66 cycles per byte (looks good!)
	Add result to array (penalty to larger/complex code and/or poor pipeline optimization)
	lehmer64 : 0.83 cycles per byte
	xorshift128+ : 0.83 cycles per byte
	xoroshiro128+ : 0.83 cycles per byte
	xoroshiro128**: 0.77 cycles per byte (this was unexpected!)
	splitmix64 : 1.03 cycles per byte
	pcg64 : 1.16 cycles per byte (128-bit state version)
	xorshift1024* : 0.99 cycles per byte
	xorshift1024+ : 0.91 cycles per byte
	xoroshiroNOT2d: 0.93 cycles per byte
	xoroshiroNOT1d: 0.82 cycles per byte (looks good!)

	MSVC 11 year old Xeon DP 2.8GHz
	Add result to array
	xoroshiro128+ : 1.232850 GB/s in 9.4805 seconds
	xoroshiro128**: 0.823447 GB/s in 14.1941 seconds
	xoroshiroNOT2d: 0.823637 GB/s in 14.1908 seconds
	xoroshiroNOT1d: 0.994528 GB/s in 11.7524 seconds