This plot shows 1000 different beers mapped to 2d with PCA:
No surprise: most beers in this database are rated pretty high (red and yellow), or pretty low (blue, purple).
The steps:
-
get a csv of beers under http://docs.yhathq.com/scienceops/deploying-models/examples/python/deploy-a-beer-recommender.html . Each beer has 5 scores, for
overall aroma appearance palate taste
from many reviewers. (85 % of all scores are 3 3.5 4 4.5 -- scores are quantized.) -
get German Pilsener only, 22155 rows
-
average each beer's scores over all reviewers -> 1373 beers x 5 scores each. Take the best 1000.
-
compute cosine distances, 1000 x 1000. (Don't forget to centre the data first.)
cosdist centred (1000, 5): quantiles: 0 0.16 0.42 1 1.6 1.8 2
-
run PCA -> 1000 x 2 vecs
PCA eigenvalues %: [ 50 79 86 92 97 100 PCA variance %: [ 72 97 98 99 100 100 cosdist centred (1000, 2): quantiles: 0 7.9e-06 7.6e-05 0.99 2 2 2
-
plot.
Bottom line: 145 of 1000 points have the same nearest neighbor with PCA 2 as with all 5 scores.
Comments
This beer database is not a good example, because the scores are all close, 3 .. 4.5, and (I believe) the different scores correlate highly. (Other useful scores would be: pH, sweetness, and count N th beer tasted.)
Some reviewers tend to give high scores, some tend to low. How should one correct for this ?
For Nearest neighbor search in dim <= 20 or so, KDTrees are easy to understand, and easy to weight features.
See also:
Dimensionality_reduction
http://google.com/search?q=nearest+neighbor+site:stackexchange.com
Comments are welcome; test cases for high-d -> 2d most welcome.
cheers
-- denis
Last change: 2016-02-25 Feb