Last active
August 29, 2015 13:56
-
-
Save softprops/8873329 to your computer and use it in GitHub Desktop.
wanted: a way to collapse the longer part of a set of long tailed data
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
/** a shortener for things that may have `long tails` */ | |
case class Tailed[A, B: Numeric]( | |
population: Iterable[A], index: A => B, update: (A, B) => A) { | |
private val num = implicitly[Numeric[B]] | |
private lazy val indexed = population.groupBy(index) | |
def shorten(threshold: Double = .03) = { | |
val total = num.toDouble(indexed.keys.reduce(num.plus)) | |
def accept(index: B) = (num.toDouble(index) / total) > threshold | |
val accepted = indexed.filterKeys(accept).values.flatten | |
val collapsed = indexed.filterKeys(!accept(_)).values.flatten.reduce { | |
(a, b) => update(a, num.plus(index(a), index(b))) | |
} | |
Tailed(accepted ++ Seq(collapsed), index, update) | |
} | |
} |
For context a have a listing of things that have counted sums. What I want is to collapse the long tail of things that have a long tail of sums. B is a sum. Update takes the sums of two things and reduces something that will be collapsed into one. I then take the list of things with relevant sums and appended the collapsed long tail. The result should be a report with a shorter tail.
Indexd is the list of things indexed by sum
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
This algorithm is not entirely clear to me. For instance, if index is a trivial function and maps everything to one key,
def accept(index: B) = 1 > threshold
which is true in this case. If on the other hand, index maps onto one of 1/(threshold - eps) keys, thenaccept(i: B) = false
.in the first case, you will accept everything, in the second, you will collapse everything in some general way.
how different is what you want from this?