Skip to content

Instantly share code, notes, and snippets.

@softprops
Last active August 29, 2015 13:56
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save softprops/8873329 to your computer and use it in GitHub Desktop.
Save softprops/8873329 to your computer and use it in GitHub Desktop.
wanted: a way to collapse the longer part of a set of long tailed data
/** a shortener for things that may have `long tails` */
case class Tailed[A, B: Numeric](
population: Iterable[A], index: A => B, update: (A, B) => A) {
private val num = implicitly[Numeric[B]]
private lazy val indexed = population.groupBy(index)
def shorten(threshold: Double = .03) = {
val total = num.toDouble(indexed.keys.reduce(num.plus))
def accept(index: B) = (num.toDouble(index) / total) > threshold
val accepted = indexed.filterKeys(accept).values.flatten
val collapsed = indexed.filterKeys(!accept(_)).values.flatten.reduce {
(a, b) => update(a, num.plus(index(a), index(b)))
}
Tailed(accepted ++ Seq(collapsed), index, update)
}
}
@johnynek
Copy link

johnynek commented Feb 8, 2014

This algorithm is not entirely clear to me. For instance, if index is a trivial function and maps everything to one key,
def accept(index: B) = 1 > threshold which is true in this case. If on the other hand, index maps onto one of 1/(threshold - eps) keys, then accept(i: B) = false.

in the first case, you will accept everything, in the second, you will collapse everything in some general way.

how different is what you want from this?

def takeLowest(weight: Double, items: Iterable[T])(fn: T => Double): Iterable[T] = {
  val total = items.map(fn).sum
  items.map { t => (t, fn(t)) }
    .toList
    .sortBy(_._2)
    .scanLeft(None: Option[(T, Double)]) { case (optTsum, (t, w)) =>
      optTsum match {
        case Some((_, sum)) => (t, sum + w)
        case None => (t, w)
      }
    }
    .collect { case Some(tw) => tw }
    .takeWhile(_._2 < (total * weight))
    .map(_._1)
}

@softprops
Copy link
Author

For context a have a listing of things that have counted sums. What I want is to collapse the long tail of things that have a long tail of sums. B is a sum. Update takes the sums of two things and reduces something that will be collapsed into one. I then take the list of things with relevant sums and appended the collapsed long tail. The result should be a report with a shorter tail.

@softprops
Copy link
Author

Indexd is the list of things indexed by sum

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment