Skip to content

Instantly share code, notes, and snippets.

@orls
Created March 15, 2011 01:09
Show Gist options
  • Save orls/870165 to your computer and use it in GitHub Desktop.
Save orls/870165 to your computer and use it in GitHub Desktop.
Determining freq. distribution buckets

How to figure out numbers of buckets

Rough way to determine number of buckets/classes to split a dataset into, for frequency distribution analysis / histograms:

2ⁿ⁻¹ < α < 2ⁿ

where α is the size of the dataset.

In other words, find a power of 2 that yields a number just larger than the size of your dataset. Decrementing this power would yield a number smaller.

So:

  • 2ⁿ ≈ α
  • n ln2 ≈ lnα
  • n ≈ lnα / ln2

Example, for a dataset of 345,000 data points:

  • n ≈ ln345000 / ln2
  • n ≈ 18.396236836327623

The number of buckets should therefore be 19. 2^19 = 524288: larger than the dataset. But 2^18 = 262144 is smaller.

Notes

  • Wish I could remember where I found this.
  • It's a good rule of thumb for ballparking, but I've always ended up adding one or two buckets to it for a tighter fit.
  • Somewhere there's a more 'correct' algorithm for buckets of uneven widths
    • Maybe based on clustering/density of points within each bucket?
    • e.g. keep each bar's 'area' the same, so bars in sparse parts of the dataset don't get very tall
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment