massiung/kde.md

## kde.md

      
    Raw
  

              kde.md
            
          
  title
  theme
  revealOptions
  
  
  Kernel Density Estimation
  solarized
  
  
  transition
  
  
  fade
  
  
Kernel Density Estimation


The Problem - Estimate Density

You have observed i.i.d. datapoints from an unknown distribution $X$ with probability density function $f$:
$$x_0, \ldots, x_n \in \mathbb{R}.$$
You want to estimate the pdf $f$ to assess the likelihood of new observations (outlier detection) ... 
or simply so you can plot it! 

Attempt 1: Parametric

Assume you know at least the parametric family of $X$. For example, you assume that $X~N(\mu,\sigma)$ for unknown $\mu$ and $\sigma$.


You then have methods like maximum likelihood estimation:
https://www.statlect.com/fundamentals-of-statistics/normal-distribution-maximum-likelihood

Attempt 2: Histogram

If you don't want to assume a distribution family, you can try to plot the values directly - not so useful is it?


Better is to make a histogram:


Problems with histograms:


Very coarse and not smooth
Heavily dependent on the number of bins
Outliers influence min/max and thus the bin choice


Attempt 3: Kernel Density Estimation

a.k.a. Parzen-Rosenblatt Method
Idea: density $\propto$ closeness to observed points
$$ \hat{f_h}(x;\{x_i\}) = \frac{1}{nh}\sum_{i=1}^n K(\frac{x-x_i}{h}) $$
Here $K$ is a kernel - a distribution centered at the origin and $h$ is the bandwidth determining how smeared out $K$ gets for each point.
Note: Time for a picture

Histogram vs KDE


KDE - Examples of kernels


Experimentally - Choice of kernel doesn't matter that much. Bandwidth is much more important!


<iframe 
    height="600px"
    width="960px"
    src="https://en.wikipedia.org/wiki/Kernel_(statistics)#Kernel_functions_in_common_use">
</iframe>

KDE - Example uses


KDE - The importance of bandwidth


Bandwidth Selection - The plugin method

We might want to minimize the "squared error" - how close is $\hat{f_h}$ to $f$?
This is called the mean integrated squared error:
$$ISE(h; \{x_i\} = \int (\hat{f} (x;\{x_i\}) - f(x))^2 dx$$
$$MISE(h) = \mathbb{E} ISE (h; \{x_i\})$$

Warning: $f$ is unknown!


Bandwidth Selection - The plugin method

Sketch

Choose a pilot distribution for $f$, e.g. a quick KDE
Taylor expand $MISE$

Solve simpler optimization problem (polynomial)

This gives you a concrete formula!

Warning: This depends heavily on the choise of pilot function


Bandwidth Selection - Cross Validation

Sketch of algorithm

Split examples into train $T$ and validation $V$

Estimate $\hat{f_h}$ on the train set for a number of $h$'s.
Compute likelihood of test for this model:
$$L=\prod_{i\in V} \hat{f_h}(x_i ; \{x_j\}_{j\in T})$$

Choose the $h$ with the highest likelihood


You might want to use leave-one-out validation or K-fold CV. Also this is very sensitive to outliers.


Tell me more!

There is so much more to say and learn!

How does this work in more dimensions?
How do you use this for outlier detection?
What are the confidence bounds?
...


References


Nice visuals: https://mathisonian.github.io/kde/
Wikipedia: https://en.wikipedia.org/wiki/Kernel_density_estimation
Overview paper: https://arxiv.org/abs/1704.03924