zach-klippenstein/Parallel Mergesort from Blocking Sources

## Parallel Mergesort from Blocking Sources
Problem: N independent sources of elements that need to be sorted. Each source has an unknown length,
and reading the next element takes a long time. Here are some approaches to sorting this:

1. Basic, naïve mergesort:
if (lists.size > 2) {
  merge_all(for (list1, list2 <- lists.grouped_by_2) {
    yield merge(list1, list2)
  }
} else {
  yield min(list1.peek, list2.peek)
  get_the_next_item_from_the_consumed_list
}

2. Naïve buffered (X items):
// Always precaches X items from a list when the cache is empty.
// Only caches as required.
// Only advantage is that read overhead is minimized
if (lists.size > 2) {
  merge_all(for (list1, list2 <- lists.grouped_by_2) {
    yield merge(list1, list2)
  }
} else {
  yield min(list1.peek, list2.peek)
  if (consumed_list.is_empty) {
    consumed_list.preload_X_items
  }
  get_the_next_item_from_the_consumed_list
}

3. Grouped buffered (X items):
// Same as 2, but whenever one list is preloaded, preload all lists.
// Advantage: amortization??
if (lists.size > 2) {
  merge_all(for (list1, list2 <- lists.grouped_by_2) {
    yield merge(list1, list2)
  }
} else {
  yield min(list1.peek, list2.peek)
  if (consumed_list.is_empty) {
    lists.filter_by_lists_with_fewer_than_X_items.each { preload_to_X_items }
  }
  get_the_next_item_from_the_consumed_list
}

4. Parallel grouped buffered:
// Each time any list's buffer is depleted, preload all lists in parallel.
// This is where I expect to see the biggest performance increase: each load
// takes the time of the longest load, but you get all the other lists topped up for free.
// In a perfect scenario of 10 lists where they're all evenly striped, you only pay the
// price of preloading 1 list every 100 elements.
if (lists.size > 2) {
  merge_all(for (list1, list2 <- lists.grouped_by_2) {
    yield merge(list1, list2)
  }
} else {
  yield min(list1.peek, list2.peek)
  if (consumed_list.is_empty) {
    lists.filter_by_lists_with_fewer_than_X_items.each_in_parallel {
      preload_to_X_items
    }
  }
  get_the_next_item_from_the_consumed_list
}

A large part of the algorithm is also determining which lists to pre-load, and by how
much. Since pre-load is initiated when a buffer is completely empty, you're guaranteed
to always pay the price of at least one full buffer load. Assuming all your sources
have roughly the same latency, you might as well load as many as possible every time.
And since you're always pre-loading at least one buffer all the way, you might as well
pre-load every buffer that is less than full, assuming preload_X_items takes less or
equal time than preload_X+Y_items (where Y>0).
	Problem: N independent sources of elements that need to be sorted. Each source has an unknown length,
	and reading the next element takes a long time. Here are some approaches to sorting this:

	1. Basic, naïve mergesort:
	if (lists.size > 2) {
	merge_all(for (list1, list2 <- lists.grouped_by_2) {
	yield merge(list1, list2)
	}
	} else {
	yield min(list1.peek, list2.peek)
	get_the_next_item_from_the_consumed_list
	}

	2. Naïve buffered (X items):
	// Always precaches X items from a list when the cache is empty.
	// Only caches as required.
	// Only advantage is that read overhead is minimized
	if (lists.size > 2) {
	merge_all(for (list1, list2 <- lists.grouped_by_2) {
	yield merge(list1, list2)
	}
	} else {
	yield min(list1.peek, list2.peek)
	if (consumed_list.is_empty) {
	consumed_list.preload_X_items
	}
	get_the_next_item_from_the_consumed_list
	}

	3. Grouped buffered (X items):
	// Same as 2, but whenever one list is preloaded, preload all lists.
	// Advantage: amortization??
	if (lists.size > 2) {
	merge_all(for (list1, list2 <- lists.grouped_by_2) {
	yield merge(list1, list2)
	}
	} else {
	yield min(list1.peek, list2.peek)
	if (consumed_list.is_empty) {
	lists.filter_by_lists_with_fewer_than_X_items.each { preload_to_X_items }
	}
	get_the_next_item_from_the_consumed_list
	}

	4. Parallel grouped buffered:
	// Each time any list's buffer is depleted, preload all lists in parallel.
	// This is where I expect to see the biggest performance increase: each load
	// takes the time of the longest load, but you get all the other lists topped up for free.
	// In a perfect scenario of 10 lists where they're all evenly striped, you only pay the
	// price of preloading 1 list every 100 elements.
	if (lists.size > 2) {
	merge_all(for (list1, list2 <- lists.grouped_by_2) {
	yield merge(list1, list2)
	}
	} else {
	yield min(list1.peek, list2.peek)
	if (consumed_list.is_empty) {
	lists.filter_by_lists_with_fewer_than_X_items.each_in_parallel {
	preload_to_X_items
	}
	}
	get_the_next_item_from_the_consumed_list
	}

	A large part of the algorithm is also determining which lists to pre-load, and by how
	much. Since pre-load is initiated when a buffer is completely empty, you're guaranteed
	to always pay the price of at least one full buffer load. Assuming all your sources
	have roughly the same latency, you might as well load as many as possible every time.
	And since you're always pre-loading at least one buffer all the way, you might as well
	pre-load every buffer that is less than full, assuming preload_X_items takes less or
	equal time than preload_X+Y_items (where Y>0).