Skip to content

Instantly share code, notes, and snippets.

@massiung
Last active March 28, 2020 12:27
Show Gist options
  • Save massiung/17f436ec8089df1c48482ca773e0add8 to your computer and use it in GitHub Desktop.
Save massiung/17f436ec8089df1c48482ca773e0add8 to your computer and use it in GitHub Desktop.
Presentation Kernel Density Estimation
title theme revealOptions
Kernel Density Estimation
solarized
transition
fade

Kernel Density Estimation


The Problem - Estimate Density

You have observed i.i.d. datapoints from an unknown distribution $X$ with probability density function $f$: $$x_0, \ldots, x_n \in \mathbb{R}.$$

You want to estimate the pdf $f$ to assess the likelihood of new observations (outlier detection) ...

or simply so you can plot it!


Attempt 1: Parametric

Assume you know at least the parametric family of $X$. For example, you assume that $X~N(\mu,\sigma)$ for unknown $\mu$ and $\sigma$.


alt text


You then have methods like maximum likelihood estimation:

https://www.statlect.com/fundamentals-of-statistics/normal-distribution-maximum-likelihood


Attempt 2: Histogram

If you don't want to assume a distribution family, you can try to plot the values directly - not so useful is it?

1d scatterplot


Better is to make a histogram: histogram


Problems with histograms:

  • Very coarse and not smooth
  • Heavily dependent on the number of bins
  • Outliers influence min/max and thus the bin choice

Attempt 3: Kernel Density Estimation

a.k.a. Parzen-Rosenblatt Method

Idea: density $\propto$ closeness to observed points $$ \hat{f_h}(x;\{x_i\}) = \frac{1}{nh}\sum_{i=1}^n K(\frac{x-x_i}{h}) $$

Here $K$ is a kernel - a distribution centered at the origin and $h$ is the bandwidth determining how smeared out $K$ gets for each point.

Note: Time for a picture


Histogram vs KDE

histogram vs KDE


KDE - Examples of kernels

Experimentally - Choice of kernel doesn't matter that much. Bandwidth is much more important!


<iframe height="600px" width="960px" src="https://en.wikipedia.org/wiki/Kernel_(statistics)#Kernel_functions_in_common_use"> </iframe>

KDE - Example uses




KDE - The importance of bandwidth

bandwidth


Bandwidth Selection - The plugin method

We might want to minimize the "squared error" - how close is $\hat{f_h}$ to $f$?

This is called the mean integrated squared error: $$ISE(h; \{x_i\} = \int (\hat{f} (x;\{x_i\}) - f(x))^2 dx$$ $$MISE(h) = \mathbb{E} ISE (h; \{x_i\})$$

Warning: $f$ is unknown!


Bandwidth Selection - The plugin method

Sketch

  1. Choose a pilot distribution for $f$, e.g. a quick KDE
  2. Taylor expand $MISE$
  3. Solve simpler optimization problem (polynomial)

This gives you a concrete formula!

Warning: This depends heavily on the choise of pilot function


Bandwidth Selection - Cross Validation

Sketch of algorithm

  1. Split examples into train $T$ and validation $V$
  2. Estimate $\hat{f_h}$ on the train set for a number of $h$'s.
  3. Compute likelihood of test for this model: $$L=\prod_{i\in V} \hat{f_h}(x_i ; \{x_j\}_{j\in T})$$
  4. Choose the $h$ with the highest likelihood

You might want to use leave-one-out validation or K-fold CV. Also this is very sensitive to outliers.


Tell me more!

There is so much more to say and learn!

  • How does this work in more dimensions?
  • How do you use this for outlier detection?
  • What are the confidence bounds?
  • ...

References

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment