Instantly share code, notes, and snippets.

# nex3/problem.md Last active Jun 1, 2016

Define two sets of intervals to be equivalent relative to a set of numbers (the base) if both sets' intervals cover the exact same set of numbers in the base. For example, given the base [0, 1, 3, 4, 6], {[0, 1], [3, 6]} is equivalent to {[0, 3], [4, 6]} but not to {[1, 3], [4, 6]} because the latter doesn't cover 0.

Define a set of intervals to be maximal relative to a base if there's no equivalent set with fewer intervals. Equivalently, a set of intervals is maximal iff, for any two intervals in the set, there exists at least one number in the base between the upper bound of one interval and the lower bound of the other.

For example, given the base [0, 1, 3, 4, 6], the set of ranges {[0, 1], [4, 6]} is maximal because 3 is between 1 and 4, but {[0, 1], [3, 6]} is not because there's nothing in the base between 1 and 3.

### Level 1

You are given a (potentially very large) ordered list of floats—the base—and a list of intervals, each of which may be open or closed. Return a maximal set of intervals that's equivalent to the input list. These intervals should be ordered first by minimum, then by maximum.

List<Interval> maximize(List<float> base, List<Interval> intervals);

Note: In practice, the input of this algorithm comes from humans and the output is intended to be viewed by them. While not strictly necessary for correctness, this means that there's a lot of value in preserving the original bounds of the ranges where possible rather than expanding them eagerly.

#### Best solution so far

For simplicity, we assume all intervals are open and that no interval properly contains any other. We can easily generalize this avoid these assumptions.

maximize(base, intervals):
sort intervals

result = []
previous = first element of intervals
for interval in intervals[1..-1]:
mid = binary search base for the least upper bound of previous
if mid is null or below interval:
append previous to result
previous = interval
else:
previous = [previous.min, interval.max]

append previous to result
return result

This is O(|intervals|⋅log(|base|)) (technically it's O(|intervals|⋅log(|intervals|) + |intervals|⋅log(|base|)) but |intervals| << |base|).

### Level 2

The same base will be used with many different lists of intervals. See if you can make the algorithm more efficient by pre-processing the base somehow. Note that because the base may be very large, it would be great if maximize() weren't O(|base|).

class Maximizer {
Maximizer(List<float> base);

List<Interval> maximize(List<Interval> intervals);
}

#### Best solution so far

We make the same simplifying assumptions as above.

class Maximizer:
upperBounds = empty map

maximize(base, intervals):
sort intervals

result = []
previous = first element of intervals
for interval in intervals[1..-1]:
if upperBounds contains key interval.max:
mid = upperBounds[previous.max]
else:
mid = binary search base for the least upper bound of previous
upperBounds[previous.max] = mid

if mid is null: # previous has no upper bound in base
break loop
else if mid is null or below interval:
append previous to result
previous = interval
else:
previous = [previous.min, interval.max]

append previous to result
return result

This is O(|intervals|⋅log(|base|)) for all-new intervals, but if all the intervals are from outputs of maximize() (which they will often be in practice), it's only O(|intervals|⋅log(|intervals|)). In addition, the number of binary searches necessary scales with the number of new intervals only, meaning for example adding a single new interval to a maximized set is O(|intervals|⋅log(|intervals|) + log(|base|)).

### Level 3

We often end up merging the results of previous calls to maximize(). See if you can make the algorithm more efficient by avoiding re-checking intervals that are known to already be part of the same maximized set. We represent this by passing a List<List<Interval>>, where each sub-list is guaranteed to be maximized.

class Maximizer {
Merger(List<float> values);

List<Interval> maximize(List<List<Interval>> intervalLists);
}

### Bonus

I also want to perform intersection and difference operations on lists of intervals. I suspect that as long as the inputs to these operations are maximized, the outputs from naive algorithms will be as well. Either prove that this is true, or give a counterexample and provide union() and difference() algorithms that preserve maximization.

### rchurchley commented May 19, 2016

 I'm not entirely sure whether I'm thinking straight at the end of the day, but do I have these assumptions correct? By "ordered" you mean the list values comes pre-sorted. "As few distinct intervals as possible" is equivalent to: Given any two intervals A and B in the output list, there is at least one integer i that lies "between" A and B which is in values but not in any of the intervals. I'm not sure if that second assumption is justified, but it's at least plausible to me. I'll go with it for now. :) My first thought is to start by ordering the intervals by their left endpoints. If values were a list of consecutive integers, we could easily tell whether we can merge two consecutive intervals, and do something like this. intervals.sort() A = intervals.pop() # leftmost input interval while intervals: B = intervals.pop() # Assume all intervals are half-open of the form [i, j). # Similar tests work for other types of intervals. if B.left_endpoint <= A.right_endpoint: # A and B overlap or squeeze up to each other A = union(A, B) else: # there is a number after A and before B, so we can't merge them output_intervals.append(A) A = B output_intervals.append(A) return output_intervals (Insert caveat about untested code here.) Of course, the values don't have to be consecutive, so this code won't produce the optimal solution. For example, it fails to merge the intervals [1,3) and [4, 7) even when values misses the number 3. But in this case, the intervals [1, 3) and [1, 4) are actually the same as far as we're concerned, as they both contain the same values. If we began with [1, 4) instead of [1, 3), we'd have gotten the right answer. I think we might be able to use this idea to find an efficient solution in general. Given an interval of the form [a, b), let aa = max(x for x in values if x < a) bb = min(x for x in values if x >= b) We might as well replace [a, b) with [aa, bb) since they contain the same values. Although I've written their definitions inefficiently, I'm pretty sure we can find the new endpoints aa and bb with binary search. So we can preprocess all the intervals into this form using only O(k log n) steps, where k is the number of intervals and n the number of integers. Assuming I haven't made an "ass out of u and me", we can now apply the same algorithm I described above for consecutive-integer values. If A and B are consecutive intervals in the left-endpoint ordering and the right endpoint of A is strictly smaller than the left endpoint of B, then there is in fact an integer in values that is missed by A and B, showing that they can't be merged. The total runtime of this algorithm should be O(k (log k + log n)), a pretty efficient solution to at least Level 1.
Owner

### nex3 commented May 20, 2016

 I updated the gist to be more clear about the exact bounds of the problem. @rchurchley your assumptions were correct, and I think your algorithm may work with some tweaks, although I'm worried that eagerly expanding intervals may produce worse output from a human perspective.

### rchurchley commented May 20, 2016 • edited

 Yes, if it's important in practice to maintain the bounds of the intervals for human readability, then it's best to have the algorithm do the binary searches in the comparisons rather than changing the intervals in preprocessing. Thanks @nex3 for checking, cleaning up, and implementing the idea! I'm not entirely sure at the moment how to preprocess the base to speed this up, although fortunately the algorithm already only has a logarithmic dependence on |base|. I'll have to think a bit more about Level 2. For Level 3, we should be able to adjust the algorithm to avoid the binary search when we already know that two consecutive intervals can't be merged. Instead of sorting all of the intervals, let's keep the intervalLists together and organize them into a min heap, so we can efficiently find the leftmost interval as well as the intervalList that contains it. Then, when we apply the above idea, we can tell if two intervals have already been checked and not bother with the binary search. Rough pseudocode sketch: mergedInterval(A, B): mid = binary search base for the least upper bound of A if mid is below B: return null else: return [A.min, B.max] maximize(base, intervalLists): create a min heap from intervalLists result = [] lastInterval = null while heap is non-empty: # find and remove the leftmost interval in the collection currentList = pop from heap currentInterval = pop from currentList insert currentList back into heap if lastInterval == null: lastList = currentList lastInterval = currentInterval continue if currentList == lastList: # lastInterval and currentInterval are guaranteed to be unmergeable append lastInterval to result lastInterval = currentInterval else: merged = checkAndMerge(lastInterval, currentInterval) if mergedInterval == null: append lastInterval to result lastInterval = currentInterval else: lastInterval = mergedInterval append currentInterval to result return result This takes O(|intervalList| log |intervalLists|) time to construct the heap, and maintaining the heap involves at most O(|intervals| log |intervalLists|) time worth of re-insertions. In the worst case scenario, we'll still take O(|intervals| log |base|) time, but the runtime can be as good as O(|intervalLists| log |base|) if the maximized sub-lists are arranged nicely. I should probably point out that the above code assumes that no interval properly contains another (otherwise, then the union of A and B is not necessarily [A.min, B.max]). But I think this can be dealt with either by preprocessing the intervals to remove redundancies or by adding a couple extra cases to the algorithm. Does this look reasonable to you? I'm very curious to see if you (or others) know of a different solution, or a target runtime in mind we might be able to achieve.
Owner

### nex3 commented May 20, 2016

 Yeah, that does look good! It can probably also be combined with the upper-bound-caching in the solution I just posted.