Skip to content

Instantly share code, notes, and snippets.

@nex3
Last active June 1, 2016 23:08
Show Gist options
  • Star 2 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save nex3/f4d0e2a9267d1b8cfdb5132b760d0111 to your computer and use it in GitHub Desktop.
Save nex3/f4d0e2a9267d1b8cfdb5132b760d0111 to your computer and use it in GitHub Desktop.

Define two sets of intervals to be equivalent relative to a set of numbers (the base) if both sets' intervals cover the exact same set of numbers in the base. For example, given the base [0, 1, 3, 4, 6], {[0, 1], [3, 6]} is equivalent to {[0, 3], [4, 6]} but not to {[1, 3], [4, 6]} because the latter doesn't cover 0.

Define a set of intervals to be maximal relative to a base if there's no equivalent set with fewer intervals. Equivalently, a set of intervals is maximal iff, for any two intervals in the set, there exists at least one number in the base between the upper bound of one interval and the lower bound of the other.

For example, given the base [0, 1, 3, 4, 6], the set of ranges {[0, 1], [4, 6]} is maximal because 3 is between 1 and 4, but {[0, 1], [3, 6]} is not because there's nothing in the base between 1 and 3.

Level 1

You are given a (potentially very large) ordered list of floats—the base—and a list of intervals, each of which may be open or closed. Return a maximal set of intervals that's equivalent to the input list. These intervals should be ordered first by minimum, then by maximum.

List<Interval> maximize(List<float> base, List<Interval> intervals);

Note: In practice, the input of this algorithm comes from humans and the output is intended to be viewed by them. While not strictly necessary for correctness, this means that there's a lot of value in preserving the original bounds of the ranges where possible rather than expanding them eagerly.

Best solution so far

For simplicity, we assume all intervals are open and that no interval properly contains any other. We can easily generalize this avoid these assumptions.

maximize(base, intervals):
  sort intervals

  result = []
  previous = first element of intervals
  for interval in intervals[1..-1]:
    mid = binary search base for the least upper bound of previous
    if mid is null or below interval:
      append previous to result
      previous = interval
    else:
      previous = [previous.min, interval.max]
      
  append previous to result
  return result

This is O(|intervals|⋅log(|base|)) (technically it's O(|intervals|⋅log(|intervals|) + |intervals|⋅log(|base|)) but |intervals| << |base|).

Level 2

The same base will be used with many different lists of intervals. See if you can make the algorithm more efficient by pre-processing the base somehow. Note that because the base may be very large, it would be great if maximize() weren't O(|base|).

class Maximizer {
  Maximizer(List<float> base);

  List<Interval> maximize(List<Interval> intervals);
}

Best solution so far

We make the same simplifying assumptions as above.

class Maximizer:
  upperBounds = empty map
  
  maximize(base, intervals):
    sort intervals
    
    result = []
    previous = first element of intervals
    for interval in intervals[1..-1]:
      if upperBounds contains key interval.max:
        mid = upperBounds[previous.max]
      else:
        mid = binary search base for the least upper bound of previous
        upperBounds[previous.max] = mid

      if mid is null: # previous has no upper bound in base
        break loop
      else if mid is null or below interval:
        append previous to result
        previous = interval
      else:
        previous = [previous.min, interval.max]
        
    append previous to result
    return result

This is O(|intervals|⋅log(|base|)) for all-new intervals, but if all the intervals are from outputs of maximize() (which they will often be in practice), it's only O(|intervals|⋅log(|intervals|)). In addition, the number of binary searches necessary scales with the number of new intervals only, meaning for example adding a single new interval to a maximized set is O(|intervals|⋅log(|intervals|) + log(|base|)).

Level 3

We often end up merging the results of previous calls to maximize(). See if you can make the algorithm more efficient by avoiding re-checking intervals that are known to already be part of the same maximized set. We represent this by passing a List<List<Interval>>, where each sub-list is guaranteed to be maximized.

class Maximizer {
  Merger(List<float> values);

  List<Interval> maximize(List<List<Interval>> intervalLists);
}

Bonus

I also want to perform intersection and difference operations on lists of intervals. I suspect that as long as the inputs to these operations are maximized, the outputs from naive algorithms will be as well. Either prove that this is true, or give a counterexample and provide union() and difference() algorithms that preserve maximization.

@rchurchley
Copy link

I'm not entirely sure whether I'm thinking straight at the end of the day, but do I have these assumptions correct?

  1. By "ordered" you mean the list values comes pre-sorted.
  2. "As few distinct intervals as possible" is equivalent to: Given any two intervals A and B in the output list, there is at least one integer i that lies "between" A and B which is in values but not in any of the intervals.

I'm not sure if that second assumption is justified, but it's at least plausible to me. I'll go with it for now. :)

My first thought is to start by ordering the intervals by their left endpoints. If values were a list of consecutive integers, we could easily tell whether we can merge two consecutive intervals, and do something like this.

intervals.sort()

A = intervals.pop()  # leftmost input interval

while intervals:
    B = intervals.pop()

    # Assume all intervals are half-open of the form [i, j).
    # Similar tests work for other types of intervals.

    if B.left_endpoint <= A.right_endpoint:
        # A and B overlap or squeeze up to each other
        A = union(A, B)
    else:
        # there is a number after A and before B, so we can't merge them
        output_intervals.append(A)
        A = B

output_intervals.append(A)
return output_intervals

(Insert caveat about untested code here.)

Of course, the values don't have to be consecutive, so this code won't produce the optimal solution. For example, it fails to merge the intervals [1,3) and [4, 7) even when values misses the number 3. But in this case, the intervals [1, 3) and [1, 4) are actually the same as far as we're concerned, as they both contain the same values. If we began with [1, 4) instead of [1, 3), we'd have gotten the right answer.

I think we might be able to use this idea to find an efficient solution in general. Given an interval of the form [a, b), let

aa = max(x for x in values if x < a)
bb = min(x for x in values if x >= b)

We might as well replace [a, b) with [aa, bb) since they contain the same values. Although I've written their definitions inefficiently, I'm pretty sure we can find the new endpoints aa and bb with binary search. So we can preprocess all the intervals into this form using only O(k log n) steps, where k is the number of intervals and n the number of integers.

Assuming I haven't made an "ass out of u and me", we can now apply the same algorithm I described above for consecutive-integer values. If A and B are consecutive intervals in the left-endpoint ordering and the right endpoint of A is strictly smaller than the left endpoint of B, then there is in fact an integer in values that is missed by A and B, showing that they can't be merged. The total runtime of this algorithm should be O(k (log k + log n)), a pretty efficient solution to at least Level 1.

@nex3
Copy link
Author

nex3 commented May 20, 2016

I updated the gist to be more clear about the exact bounds of the problem. @rchurchley your assumptions were correct, and I think your algorithm may work with some tweaks, although I'm worried that eagerly expanding intervals may produce worse output from a human perspective.

@rchurchley
Copy link

rchurchley commented May 20, 2016

Yes, if it's important in practice to maintain the bounds of the intervals for human readability, then it's best to have the algorithm do the binary searches in the comparisons rather than changing the intervals in preprocessing. Thanks @nex3 for checking, cleaning up, and implementing the idea!

I'm not entirely sure at the moment how to preprocess the base to speed this up, although fortunately the algorithm already only has a logarithmic dependence on |base|. I'll have to think a bit more about Level 2.

For Level 3, we should be able to adjust the algorithm to avoid the binary search when we already know that two consecutive intervals can't be merged. Instead of sorting all of the intervals, let's keep the intervalLists together and organize them into a min heap, so we can efficiently find the leftmost interval as well as the intervalList that contains it. Then, when we apply the above idea, we can tell if two intervals have already been checked and not bother with the binary search.

Rough pseudocode sketch:

mergedInterval(A, B):
  mid = binary search base for the least upper bound of A
  if mid is below B:
    return null
  else:
    return [A.min, B.max]


maximize(base, intervalLists):
  create a min heap from intervalLists

  result = []
  lastInterval = null
  while heap is non-empty:
    # find and remove the leftmost interval in the collection
    currentList = pop from heap
    currentInterval = pop from currentList
    insert currentList back into heap

    if lastInterval == null:
      lastList = currentList
      lastInterval = currentInterval
      continue

    if currentList == lastList:
      # lastInterval and currentInterval are guaranteed to be unmergeable
      append lastInterval to result
      lastInterval = currentInterval
    else:
      merged = checkAndMerge(lastInterval, currentInterval)
      if mergedInterval == null:
        append lastInterval to result
        lastInterval = currentInterval
      else:
        lastInterval = mergedInterval

  append currentInterval to result
  return result

This takes O(|intervalList| log |intervalLists|) time to construct the heap, and maintaining the heap involves at most O(|intervals| log |intervalLists|) time worth of re-insertions. In the worst case scenario, we'll still take O(|intervals| log |base|) time, but the runtime can be as good as O(|intervalLists| log |base|) if the maximized sub-lists are arranged nicely.

I should probably point out that the above code assumes that no interval properly contains another (otherwise, then the union of A and B is not necessarily [A.min, B.max]). But I think this can be dealt with either by preprocessing the intervals to remove redundancies or by adding a couple extra cases to the algorithm.

Does this look reasonable to you? I'm very curious to see if you (or others) know of a different solution, or a target runtime in mind we might be able to achieve.

@nex3
Copy link
Author

nex3 commented May 20, 2016

Yeah, that does look good! It can probably also be combined with the upper-bound-caching in the solution I just posted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment