Skip to content

Instantly share code, notes, and snippets.

@denis-bz
Created February 25, 2016 16:46
Show Gist options
  • Save denis-bz/b03afea634596fd872d4 to your computer and use it in GitHub Desktop.
Save denis-bz/b03afea634596fd872d4 to your computer and use it in GitHub Desktop.
1000 beers with PCA

1000 beers with PCA

This plot shows 1000 different beers mapped to 2d with PCA:

pils-scores-pca

No surprise: most beers in this database are rated pretty high (red and yellow), or pretty low (blue, purple).

The steps:

  1. get a csv of beers under http://docs.yhathq.com/scienceops/deploying-models/examples/python/deploy-a-beer-recommender.html . Each beer has 5 scores, for
    overall aroma appearance palate taste
    from many reviewers. (85 % of all scores are 3 3.5 4 4.5 -- scores are quantized.)

  2. get German Pilsener only, 22155 rows

  3. average each beer's scores over all reviewers -> 1373 beers x 5 scores each. Take the best 1000.

  4. compute cosine distances, 1000 x 1000. (Don't forget to centre the data first.)

    cosdist centred (1000, 5): quantiles: 0 0.16 0.42 1 1.6 1.8 2

  5. run PCA -> 1000 x 2 vecs

    PCA eigenvalues %: [ 50 79 86 92 97 100 PCA variance %: [ 72 97 98 99 100 100 cosdist centred (1000, 2): quantiles: 0 7.9e-06 7.6e-05 0.99 2 2 2

  6. plot.


Bottom line: 145 of 1000 points have the same nearest neighbor with PCA 2 as with all 5 scores.

Comments

This beer database is not a good example, because the scores are all close, 3 .. 4.5, and (I believe) the different scores correlate highly. (Other useful scores would be: pH, sweetness, and count N th beer tasted.)

Some reviewers tend to give high scores, some tend to low. How should one correct for this ?

For Nearest neighbor search in dim <= 20 or so, KDTrees are easy to understand, and easy to weight features.

See also:
Dimensionality_reduction
http://google.com/search?q=nearest+neighbor+site:stackexchange.com

Comments are welcome; test cases for high-d -> 2d most welcome.

cheers
-- denis

Last change: 2016-02-25 Feb

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment