dosentmatter/analysis.txt

## analysis.txt
I don't think Solution 2 is always better than Solution 3

First of all, it doesn't matter unless n is large, since both are O(n).

For both Solutions 2 and 3, you can skip operations when there are non-zeros at the beginning of the array by doing the following:
For Solution 2, you can skip moving when `lastNonZeroFoundAt == i`
For Solution 3, you can skip swapping when `lastNonZeroFoundAt == cur`

For Solution 2, even if you do a memset, that is an O(n) because it has to write each element to 0.
For Solution 3, we can say that swap is a 3 cost operation since it does 3 writes including the temporary variable.

---

Let's calculate the cost for the general case:
Assumptions:
  1. Each loop iteration costs 1
  2. Reading is 0 cost
  3. Comparisons are 0 cost
  4. Incrementing `lastNonZeroFoundAt` is 0 cost since both solutions increment it by the same amount (ie. the number of non-zeros) anyway
  5. Writing is 1 cost
  6. Swapping is 3 costs because it consists of 3 writes.

Let n be the number of elements in the array
Let C be the number of zeros.

Let k be the number of non-zeros in the incorrect location.

To explain k:
[1, 2, 3] - k = 0 because no zeros
[0, 1, 2, 3] - k = 3 because 1, 2, 3 need to be moved to indices 0, 1, 2 respectively
[0, 0, 1, 2, 3] - k = 3, because 1, 2, 3 need to be moved to indices 0, 1, 2 respectively
[0, 1, 2, 0, 3] - k = 3, because 1, 2, 3 need to be moved to indices 0, 1, 2 respectively

Notice that another way to think about k is "the number of non-zeros with at
least one zero before it".  This is because if there is at least one zero
before a non-zero, it is out of place. More than one zero before a non-zero
does not change anything. That will only put more zeros at the end. This
knowledge will come in handy later.

---

Cost of Solution 2 for the general case:
For the first loop, we have:
Cost(first loop) = Cost(first loop iterations) + Cost(first loop moves ie. writes)
since k of the elements need to be moved:
Cost(first loop) = Cost(n iterations) + Cost(k moves)
Cost(first loop) = n + k

For the second loop (memset), we have:
Cost(second loop) = Cost(second loop iterations) + Cost(second loop writes)
Cost(second loop) = Cost(C iterations) + Cost(C writes)
Cost(second loop) = C + C
Cost(second loop) = 2*C

So in total,
Cost(Solution 2) = Cost(first loop) + Cost(second loop)
Cost(Solution 2) = n + k + 2*C


Cost of Solution 3 for the general case:
Cost(Solution 3) = Cost(loop iterations) + Cost(loop swaps)
Cost(Solution 3) = Cost(n iterations) + Cost(k swaps)
Since a swap costs 3,
Cost(Solution 3) = n + 3*k

---

So let's see when Solution 3 is better than or equal to Solution 2:
Cost(Solution 2) >= Cost(Solution 3)
n + k + 2*C >= n + 3*k
2*C >= 2*k
C >= k
k <= C

So this means that Cost(Solution 3) <= Cost(Solution 2) when k <= C.
k <= C means the number of non-zeros in the incorrect location is <= the number of zeros.

It would take O(n) to determine k or C.

For C, you O(n) loop to count number of zeros.
For k, you O(n) loop to find the first zero and then count the number of non-zeros found afterwards.
Even though you know the optimal Solution to use after counting, it might not
be worth it to do an O(n) count, so let's do some probability.

---

So now let's do some probability and calculate what we expect k or C to be
given the following input:
1. We generate an array of length n.
2. Each element is generated independently with a probability p to be a non-zero, else zero.

What is the expected value of C and k? That is E[C] and E[k]?
If E[k] <= E[C], probability would suggest Solution 3 would be better

---

E[C]:
Let C_i be a random variable on whether the element at index i is a zero.
Because of Linearity of Expectation,
E[C] = E[C_0] + E[C_1] + ... E[C_n-1]

There is a (1 - p) chance to generate a zero.
Since each element can contribute at most 1 zero with probability (1 - p).
E[C_i] = (1 - p) * 1

E[C] = (1 - p) * 1 + (1 - p) * 1 + ... } n times
E[C] = (1 - p) * n

---

E[k]:
This one is a little trickier

Again, we can use Linearity of Expecation,
Let k_i be a random variable on whether the element at index i is a non-zero in the incorrect location.
Mentioned above, I said another way to think about this is,
"a non-zero with at least one zero before it"

E[k] = E[k_0] + E[k_1] + ... E[k_n-1]

There is a p chance to generate a non-zero
Each element can contribute at most 1 incorrect non-zero, but what is the
probability of an incorrect non-zero?
The probability of an incorrect non-zero, is the probability of generating a
non-zero with at least one zero before it.

Let P mean "the probability of"
P(having at least one zero) = (1 - P(having all non-zeros))

  P(incorrect non-zero at index i)
= P(non-zero at index i with at least one zero before it)
Since the following two are independent events, they can be factored:
= P(non-zero at index i AND having at least one zero before index i)
= P(non-zero at index i) * P(having at least one zero before index i)
Since i is indexed starting at 0, there are i elements before index i and we can plug into above formulas:
= P(non-zero) * P(having at least one zero in i elements)
= P(non-zero) * (1 - P(having i non-zeros))
= p * (1 - p**i)

Since each element can contribute at most 1 incorrect non-zero,
E[k_i] = (p * (1 - p**i)) * 1
E[k_i] = p * (1 - p**i)
E[k_i] = p - p**(i + 1)

E[k] = E[k_0] + E[k_1] + ... E[k_n-1]
E[k] = (p - p**(0 + 1)) + (p - p**(1+1)) + ... (p - p**(n - 1 + 1))
E[k] = (p - p**1) + (p - p**2) + ... (p - p**n)

---

I took some time to analyze this with graphs and stuff, so I will just prove some assumptions:

p is a probability so it is in the range [0, 1]

Proposition: "If 0 <= p <= 0.5, then E[k] <= E[C]"

Lets prove by induction on n:
Proof:
Let P(n), where n is the number of elements in the array, be the statement that E[k] <= E[C]

Base Case:
Let's special case when n=0 since the forumla for E[k] above doesn't work for n=0
When n=0,
E[C] = (1 - p) * 0 = 0, or you think of it as 0 because there are no zeros in an empty array
E[k] = 0, because there are no incorrect non-zeros in an empty array
So for n=0, E[k] <= E[C]

Let's start induction from n=1,
When n=1,
E[C] = (1 - p) * 1 = 1 - p
E[k] = (p - p**1) = 0
So for n=1, E[k] <= E[C]

Inductive step:
Assume that P(x) is true for some arbitrary x >= 1
We have to prove P(x + 1) is true.

P(x + 1) means
E[k_x+1 elements] < E[C_x+1 elements]
(p - p**1) + (p - p**2) + ... + (p - p**x) + (p - p**(x + 1)) <= (1 - p) * (x + 1)
(p - p**1) + (p - p**2) + ... + (p - p**x) + (p - p**(x + 1)) <= (1 - p) * x + (1 - p)

The first chunk of each side is just P(x), I will put them in square brackets:

[ (p - p**1) + (p - p**2) + ... + (p - p**x) ] + (p - p**(x + 1)) <= [ (1 - p) * x ] + (1 - p)
The part in the square brackets
[ (p - p**1) + (p - p**2) + ... + (p - p**x) ] <= [ (1 - p) * x ]
(p - p**1) + (p - p**2) + ... + (p - p**x) <= (1 - p) * x
E[k_x elements] <= E[C_x elements]
is just P(x), which we have assumed to be true.

How can we use P(x) to prove P(x + 1)?
We can use this fact:
If a <= b is true,
a + x <= b + y is also true if x <= y

In our case,
a = E[k_x elements] = (p - p**1) + (p - p**2) + ... + (p - p**x) = left terms inside square brackets
b = E[C_x elements] = (1 - p) * x = right terms inside square brackets
x = (p - p**(x + 1)) = left term outside square brackets
y = (1 - p) = right term outside square brackets

This leaves us with having to prove the following in order to prove P(x + 1).
Note that this is just the expected value of the next term added when adding
one more element to the array.
(p - p**(x + 1)) <= (1 - p)
p - p**(x + 1) <= 1 - p

2*p <= 1 + p**(x + 1)

Since 0 <= p <= 0.5, the greatest 2*p can be is 1, and 1 + p**(x + 1) is always greater than 1.
This means that the original x <= y inequality is true and P(x + 1) is true.

Thus, P(n) is true for n being any integer >= 0.
In other words:
"If 0 <= p <= 0.5 and n is an integer >= 0, then E[k] <= E[C]"
This means that for 0 <= p <= 0.5, Solution 3 is always better.

---

How about 0.5 < p <= 1?

Assumptions:
  1. n is very large. This is an okay assumption since we only care about
     optimizing Solution 2/3 when n is large.

Proposition: "If 0.5 < p <= 1, then E[k] >= E[C]"

We will special case p=1.

E[C] = (1 - p) * n
= (1 - 1) * n
= (0) * n
= 0

E[k] = (p - p**1) + (p - p**2) + ... (p - p**n)
= (1 - 1**1) + (1 - 1**2) + ... (1 - 1**n)
= (1 - 1) + (1 - 1) + ... (1 - 1)
= (0) + (0) + ... (0)
= 0

When p=1, the array is filled with non-zeros. No work is done and E[k] = E[C],
but E[k] >= E[C] still holds true.


Now let's exclude p=1 and prove the rest where 0.5 < p < 1.
For this proposition, I will not do a formal proof. I will only analyze E[k] and E[C]
and show that E[k] grows faster than E[C] for large n.

Let's start with E[C]
E[C] = (1 - p) * n
Since 0.5 < p < 1, we know that 0 < (1 - p) < 0.5
As n grows large, we keep adding more 0 < (1 - p) < 0.5 terms.

How about E[k]?
E[k] = (p - p**1) + (p - p**2) + ... (p - p**n)

Keep in mind 0.5 < p < 1.
For earlier terms such as (p - p**1) or (p - p**2), the terms will be pretty small, because
the exponent is small, so the subtraction is large.
However for later terms, and when n is large, (p - p**i) can be approximated as just (p),
since p**i ~= 0, where i is large. This is because 0.5 < p < 1, so each multiple of p decreases p**i.

When n is large, there will be many terms that will be approximated as just p.
As n grows large, we keep adding more approximate 0.5 < p < 1 terms.

Compare the statements above about E[C] and E[k].
For large n, E[C] keeps adding 0 < (1 - p) < 0.5 terms
while E[k] keeps adding 0.5 < p < 1 terms

The added E[k] terms are larger than the E[C] terms.

This means the proposition holds true for large n, since E[k] grows faster than E[C] as you increase n.
Note that for low n, it's possible for the opposite to be true. E[C] might be
greater. You can graph it out on a calculator to see visually. But for low n, optimization is unnecessary.

This means for large n and 0.5 < p <= 1, Solution 2 is always better.

---

So what have we learned?

Use Solution 3 for 0 <= p <= 0.5
Use Solution 2 for 0.5 < p <= 1 (for large n)

---

How do I apply this to the general case?

In the general case, there are a lot of unknowns. If there are some knowns, you
can use the above findings to determine the optimal algorithm.

For the general case, you are given a large n-size array. You don't know the p
value used to generate the array. You don't know the value of C or k.

It's not worth determining C or k because that would take O(n).
You can determine the approximation of p by sampling a few values and averaging
by how many non-zeros you see.

How many values should you sample? We can use the Chebyshev theorem:
See Chapter 8.1 of
https://www.dartmouth.edu/~chance/teaching_aids/books_articles/probability_book/amsbook.mac.pdf
and 5a of
https://inst.eecs.berkeley.edu/~ee126/fa19/hw/hw5-sol.pdf

P(|X - u| >= E) <= V(X) / E**2

Let X be a Bernoulli trial with X_i as the random variable for each coin flip.
There is a p chance for the coin to flip heads (non-zero element)
and a (1 - p) chance for tails (zero element)

S_n = X_0 + X_1 + ... X_n-1

P(|S_n / n - p| >= E) <= V(S_n / n) / E**2
P(|S_n / n - p| >= E) <= Var(S_n) / (n**2 * E**2)
P(|S_n / n - p| >= E) <= n * Var(X_i) / (n**2 * E**2)
P(|S_n / n - p| >= E) <= Var(X_i) / (n * E**2)

Var(X_i) = E[X_i**2] - E[X]**2
         = [ (1 - p) * 0**2 + p * 1**2 ] - [ (1 - p) * 0 + p * 1 ] ** 2
         = [ p * 1**2 ] - [ p * 1 ] ** 2
         = [ p ] - [ p ] ** 2
         = p - p**2
         = p * (1 - p)

P(|S_n / n - p| >= E) <= Var(X_i) / (n * E**2)
P(|S_n / n - p| >= E) <= p * (1 - p) / (n * E**2)

We set d = p * (1 - p) / (n * E ** 2)

n * E ** 2 = p * (1 - p) / d
n = p * (1 - p) / (d * E**2)

We can use this formula to determine a good value for n, the sample size.

E is the error threshold of estimated probability vs actual p:
|S_n / n - p| >= E

d is the probability of the estimated probability breaching the error threshold, E
P(|S_n / n - p| >= E) <= d

Formula:
n = p * (1 - p) / (d * E**2)
To be more concrete, we can maximize p * (1 - p) to be the maximum, which is when p = 0.5
n = 0.5 * (0.5) / (d * E**2)
n = 1 / (4 * d * E**2)

we can set E = 0.1 and d = 0.1
n = 1 / (4 * 0.1 * 0.1**2)
n = 250 elements to sample

For large n, 250 elements isn't much. You can adjust E and d depending how accurate you need to be.
Keep in mind, the cutoff to determining if you should use Solution 2 or 3 is at p = 0.5.
So your E should be small enough to distinguish the cutoff, but large enough so n isn't too large to sample.
Set d depending on how accurate you want your estimate to be

If n gets too large, you can also set a 10% array size limit to the sample size n.
	I don't think Solution 2 is always better than Solution 3

	First of all, it doesn't matter unless n is large, since both are O(n).

	For both Solutions 2 and 3, you can skip operations when there are non-zeros at the beginning of the array by doing the following:
	For Solution 2, you can skip moving when `lastNonZeroFoundAt == i`
	For Solution 3, you can skip swapping when `lastNonZeroFoundAt == cur`

	For Solution 2, even if you do a memset, that is an O(n) because it has to write each element to 0.
	For Solution 3, we can say that swap is a 3 cost operation since it does 3 writes including the temporary variable.

	---

	Let's calculate the cost for the general case:
	Assumptions:
	1. Each loop iteration costs 1
	2. Reading is 0 cost
	3. Comparisons are 0 cost
	4. Incrementing `lastNonZeroFoundAt` is 0 cost since both solutions increment it by the same amount (ie. the number of non-zeros) anyway
	5. Writing is 1 cost
	6. Swapping is 3 costs because it consists of 3 writes.

	Let n be the number of elements in the array
	Let C be the number of zeros.

	Let k be the number of non-zeros in the incorrect location.

	To explain k:
	[1, 2, 3] - k = 0 because no zeros
	[0, 1, 2, 3] - k = 3 because 1, 2, 3 need to be moved to indices 0, 1, 2 respectively
	[0, 0, 1, 2, 3] - k = 3, because 1, 2, 3 need to be moved to indices 0, 1, 2 respectively
	[0, 1, 2, 0, 3] - k = 3, because 1, 2, 3 need to be moved to indices 0, 1, 2 respectively

	Notice that another way to think about k is "the number of non-zeros with at
	least one zero before it". This is because if there is at least one zero
	before a non-zero, it is out of place. More than one zero before a non-zero
	does not change anything. That will only put more zeros at the end. This
	knowledge will come in handy later.

	---

	Cost of Solution 2 for the general case:
	For the first loop, we have:
	Cost(first loop) = Cost(first loop iterations) + Cost(first loop moves ie. writes)
	since k of the elements need to be moved:
	Cost(first loop) = Cost(n iterations) + Cost(k moves)
	Cost(first loop) = n + k

	For the second loop (memset), we have:
	Cost(second loop) = Cost(second loop iterations) + Cost(second loop writes)
	Cost(second loop) = Cost(C iterations) + Cost(C writes)
	Cost(second loop) = C + C
	Cost(second loop) = 2*C

	So in total,
	Cost(Solution 2) = Cost(first loop) + Cost(second loop)
	Cost(Solution 2) = n + k + 2*C


	Cost of Solution 3 for the general case:
	Cost(Solution 3) = Cost(loop iterations) + Cost(loop swaps)
	Cost(Solution 3) = Cost(n iterations) + Cost(k swaps)
	Since a swap costs 3,
	Cost(Solution 3) = n + 3*k

	---

	So let's see when Solution 3 is better than or equal to Solution 2:
	Cost(Solution 2) >= Cost(Solution 3)
	n + k + 2C >= n + 3k
	2C >= 2k
	C >= k
	k <= C

	So this means that Cost(Solution 3) <= Cost(Solution 2) when k <= C.
	k <= C means the number of non-zeros in the incorrect location is <= the number of zeros.

	It would take O(n) to determine k or C.

	For C, you O(n) loop to count number of zeros.
	For k, you O(n) loop to find the first zero and then count the number of non-zeros found afterwards.
	Even though you know the optimal Solution to use after counting, it might not
	be worth it to do an O(n) count, so let's do some probability.

	---

	So now let's do some probability and calculate what we expect k or C to be
	given the following input:
	1. We generate an array of length n.
	2. Each element is generated independently with a probability p to be a non-zero, else zero.

	What is the expected value of C and k? That is E[C] and E[k]?
	If E[k] <= E[C], probability would suggest Solution 3 would be better

	---

	E[C]:
	Let C_i be a random variable on whether the element at index i is a zero.
	Because of Linearity of Expectation,
	E[C] = E[C_0] + E[C_1] + ... E[C_n-1]

	There is a (1 - p) chance to generate a zero.
	Since each element can contribute at most 1 zero with probability (1 - p).
	E[C_i] = (1 - p) * 1

	E[C] = (1 - p) * 1 + (1 - p) * 1 + ... } n times
	E[C] = (1 - p) * n

	---

	E[k]:
	This one is a little trickier

	Again, we can use Linearity of Expecation,
	Let k_i be a random variable on whether the element at index i is a non-zero in the incorrect location.
	Mentioned above, I said another way to think about this is,
	"a non-zero with at least one zero before it"

	E[k] = E[k_0] + E[k_1] + ... E[k_n-1]

	There is a p chance to generate a non-zero
	Each element can contribute at most 1 incorrect non-zero, but what is the
	probability of an incorrect non-zero?
	The probability of an incorrect non-zero, is the probability of generating a
	non-zero with at least one zero before it.

	Let P mean "the probability of"
	P(having at least one zero) = (1 - P(having all non-zeros))

	P(incorrect non-zero at index i)
	= P(non-zero at index i with at least one zero before it)
	Since the following two are independent events, they can be factored:
	= P(non-zero at index i AND having at least one zero before index i)
	= P(non-zero at index i) * P(having at least one zero before index i)
	Since i is indexed starting at 0, there are i elements before index i and we can plug into above formulas:
	= P(non-zero) * P(having at least one zero in i elements)
	= P(non-zero) * (1 - P(having i non-zeros))
	= p * (1 - p**i)

	Since each element can contribute at most 1 incorrect non-zero,
	E[k_i] = (p * (1 - p*i)) 1
	E[k_i] = p * (1 - p**i)
	E[k_i] = p - p**(i + 1)

	E[k] = E[k_0] + E[k_1] + ... E[k_n-1]
	E[k] = (p - p(0 + 1)) + (p - p(1+1)) + ... (p - p**(n - 1 + 1))
	E[k] = (p - p1) + (p - p2) + ... (p - p**n)

	---

	I took some time to analyze this with graphs and stuff, so I will just prove some assumptions:

	p is a probability so it is in the range [0, 1]

	Proposition: "If 0 <= p <= 0.5, then E[k] <= E[C]"

	Lets prove by induction on n:
	Proof:
	Let P(n), where n is the number of elements in the array, be the statement that E[k] <= E[C]

	Base Case:
	Let's special case when n=0 since the forumla for E[k] above doesn't work for n=0
	When n=0,
	E[C] = (1 - p) * 0 = 0, or you think of it as 0 because there are no zeros in an empty array
	E[k] = 0, because there are no incorrect non-zeros in an empty array
	So for n=0, E[k] <= E[C]

	Let's start induction from n=1,
	When n=1,
	E[C] = (1 - p) * 1 = 1 - p
	E[k] = (p - p**1) = 0
	So for n=1, E[k] <= E[C]

	Inductive step:
	Assume that P(x) is true for some arbitrary x >= 1
	We have to prove P(x + 1) is true.

	P(x + 1) means
	E[k_x+1 elements] < E[C_x+1 elements]
	(p - p1) + (p - p2) + ... + (p - px) + (p - p(x + 1)) <= (1 - p) * (x + 1)
	(p - p1) + (p - p2) + ... + (p - px) + (p - p(x + 1)) <= (1 - p) * x + (1 - p)

	The first chunk of each side is just P(x), I will put them in square brackets:

	[ (p - p1) + (p - p2) + ... + (p - px) ] + (p - p(x + 1)) <= [ (1 - p) * x ] + (1 - p)
	The part in the square brackets
	[ (p - p1) + (p - p2) + ... + (p - p*x) ] <= [ (1 - p) x ]
	(p - p1) + (p - p2) + ... + (p - p*x) <= (1 - p) x
	E[k_x elements] <= E[C_x elements]
	is just P(x), which we have assumed to be true.

	How can we use P(x) to prove P(x + 1)?
	We can use this fact:
	If a <= b is true,
	a + x <= b + y is also true if x <= y

	In our case,
	a = E[k_x elements] = (p - p1) + (p - p2) + ... + (p - p**x) = left terms inside square brackets
	b = E[C_x elements] = (1 - p) * x = right terms inside square brackets
	x = (p - p**(x + 1)) = left term outside square brackets
	y = (1 - p) = right term outside square brackets

	This leaves us with having to prove the following in order to prove P(x + 1).
	Note that this is just the expected value of the next term added when adding
	one more element to the array.
	(p - p**(x + 1)) <= (1 - p)
	p - p**(x + 1) <= 1 - p

	2p <= 1 + p*(x + 1)

	Since 0 <= p <= 0.5, the greatest 2p can be is 1, and 1 + p*(x + 1) is always greater than 1.
	This means that the original x <= y inequality is true and P(x + 1) is true.

	Thus, P(n) is true for n being any integer >= 0.
	In other words:
	"If 0 <= p <= 0.5 and n is an integer >= 0, then E[k] <= E[C]"
	This means that for 0 <= p <= 0.5, Solution 3 is always better.

	---

	How about 0.5 < p <= 1?

	Assumptions:
	1. n is very large. This is an okay assumption since we only care about
	optimizing Solution 2/3 when n is large.

	Proposition: "If 0.5 < p <= 1, then E[k] >= E[C]"

	We will special case p=1.

	E[C] = (1 - p) * n
	= (1 - 1) * n
	= (0) * n
	= 0

	E[k] = (p - p1) + (p - p2) + ... (p - p**n)
	= (1 - 11) + (1 - 12) + ... (1 - 1**n)
	= (1 - 1) + (1 - 1) + ... (1 - 1)
	= (0) + (0) + ... (0)
	= 0

	When p=1, the array is filled with non-zeros. No work is done and E[k] = E[C],
	but E[k] >= E[C] still holds true.


	Now let's exclude p=1 and prove the rest where 0.5 < p < 1.
	For this proposition, I will not do a formal proof. I will only analyze E[k] and E[C]
	and show that E[k] grows faster than E[C] for large n.

	Let's start with E[C]
	E[C] = (1 - p) * n
	Since 0.5 < p < 1, we know that 0 < (1 - p) < 0.5
	As n grows large, we keep adding more 0 < (1 - p) < 0.5 terms.

	How about E[k]?
	E[k] = (p - p1) + (p - p2) + ... (p - p**n)

	Keep in mind 0.5 < p < 1.
	For earlier terms such as (p - p1) or (p - p2), the terms will be pretty small, because
	the exponent is small, so the subtraction is large.
	However for later terms, and when n is large, (p - p**i) can be approximated as just (p),
	since pi ~= 0, where i is large. This is because 0.5 < p < 1, so each multiple of p decreases pi.

	When n is large, there will be many terms that will be approximated as just p.
	As n grows large, we keep adding more approximate 0.5 < p < 1 terms.

	Compare the statements above about E[C] and E[k].
	For large n, E[C] keeps adding 0 < (1 - p) < 0.5 terms
	while E[k] keeps adding 0.5 < p < 1 terms

	The added E[k] terms are larger than the E[C] terms.

	This means the proposition holds true for large n, since E[k] grows faster than E[C] as you increase n.
	Note that for low n, it's possible for the opposite to be true. E[C] might be
	greater. You can graph it out on a calculator to see visually. But for low n, optimization is unnecessary.

	This means for large n and 0.5 < p <= 1, Solution 2 is always better.

	---

	So what have we learned?

	Use Solution 3 for 0 <= p <= 0.5
	Use Solution 2 for 0.5 < p <= 1 (for large n)

	---

	How do I apply this to the general case?

	In the general case, there are a lot of unknowns. If there are some knowns, you
	can use the above findings to determine the optimal algorithm.

	For the general case, you are given a large n-size array. You don't know the p
	value used to generate the array. You don't know the value of C or k.

	It's not worth determining C or k because that would take O(n).
	You can determine the approximation of p by sampling a few values and averaging
	by how many non-zeros you see.

	How many values should you sample? We can use the Chebyshev theorem:
	See Chapter 8.1 of
	https://www.dartmouth.edu/~chance/teaching_aids/books_articles/probability_book/amsbook.mac.pdf
	and 5a of
	https://inst.eecs.berkeley.edu/~ee126/fa19/hw/hw5-sol.pdf

	P(\|X - u\| >= E) <= V(X) / E**2

	Let X be a Bernoulli trial with X_i as the random variable for each coin flip.
	There is a p chance for the coin to flip heads (non-zero element)
	and a (1 - p) chance for tails (zero element)

	S_n = X_0 + X_1 + ... X_n-1

	P(\|S_n / n - p\| >= E) <= V(S_n / n) / E**2
	P(\|S_n / n - p\| >= E) <= Var(S_n) / (n*2 E**2)
	P(\|S_n / n - p\| >= E) <= n * Var(X_i) / (n*2 E**2)
	P(\|S_n / n - p\| >= E) <= Var(X_i) / (n * E**2)

	Var(X_i) = E[X_i2] - E[X]2
	= [ (1 - p) * 0*2 + p 1*2 ] - [ (1 - p) 0 + p * 1 ] ** 2
	= [ p * 1*2 ] - [ p 1 ] ** 2
	= [ p ] - [ p ] ** 2
	= p - p**2
	= p * (1 - p)

	P(\|S_n / n - p\| >= E) <= Var(X_i) / (n * E**2)
	P(\|S_n / n - p\| >= E) <= p * (1 - p) / (n * E**2)

	We set d = p * (1 - p) / (n * E ** 2)

	n * E ** 2 = p * (1 - p) / d
	n = p * (1 - p) / (d * E**2)

	We can use this formula to determine a good value for n, the sample size.

	E is the error threshold of estimated probability vs actual p:
	\|S_n / n - p\| >= E

	d is the probability of the estimated probability breaching the error threshold, E
	P(\|S_n / n - p\| >= E) <= d

	Formula:
	n = p * (1 - p) / (d * E**2)
	To be more concrete, we can maximize p * (1 - p) to be the maximum, which is when p = 0.5
	n = 0.5 * (0.5) / (d * E**2)
	n = 1 / (4 * d * E**2)

	we can set E = 0.1 and d = 0.1
	n = 1 / (4 * 0.1 * 0.1**2)
	n = 250 elements to sample

	For large n, 250 elements isn't much. You can adjust E and d depending how accurate you need to be.
	Keep in mind, the cutoff to determining if you should use Solution 2 or 3 is at p = 0.5.
	So your E should be small enough to distinguish the cutoff, but large enough so n isn't too large to sample.
	Set d depending on how accurate you want your estimate to be

	If n gets too large, you can also set a 10% array size limit to the sample size n.