denis-bz/PCA-beer.md

## PCA-beer.md

      
    Raw
  

              PCA-beer.md
            
          
    1000 beers with PCA

This plot shows 1000 different beers mapped to 2d with
PCA:

No surprise:
most beers in this database are rated pretty high (red and yellow),
or pretty low (blue, purple).
The steps:


get a csv of beers under
http://docs.yhathq.com/scienceops/deploying-models/examples/python/deploy-a-beer-recommender.html
. Each beer has 5 scores, for

overall  aroma  appearance  palate  taste

from many reviewers.
(85 % of all scores are 3 3.5 4 4.5 -- scores are quantized.)


get German Pilsener only, 22155 rows


average each beer's scores over all reviewers -> 1373 beers x 5 scores each.
Take the best 1000.


compute cosine distances, 1000 x 1000. (Don't forget to centre the data first.)
cosdist centred (1000, 5): quantiles: 0 0.16 0.42   1   1.6 1.8 2


run PCA -> 1000 x 2 vecs
PCA eigenvalues %: [ 50  79  86  92  97 100
PCA variance %:    [ 72  97  98  99 100 100
cosdist centred (1000, 2): quantiles: 0 7.9e-06 7.6e-05   0.99   2 2 2


plot.


Bottom line:  
145 of 1000 points have the same nearest neighbor with PCA 2 as with all 5 scores.
Comments
This beer database is not a good example, because
the scores are all close, 3 .. 4.5,
and (I believe) the different scores correlate highly.
(Other useful scores would be: pH, sweetness, and count N th beer tasted.)
Some reviewers tend to give high scores, some tend to low.
How should one correct for this ?
For
Nearest neighbor search
in dim <= 20 or so, KDTrees are easy to understand, and easy to weight features.
See also:

Dimensionality_reduction

http://google.com/search?q=nearest+neighbor+site:stackexchange.com
Comments are welcome; test cases for high-d -> 2d most welcome.
cheers

-- denis
Last change: 2016-02-25 Feb