Here is where I will put my notes as I develop.
I did some vertical prototyping with rails. I haven't used rails before, so I'm slowly trying to get a hang on the framework.
I spent about four hours last night getting user authentication working. Users can now submit items. I also got the basics of voting done.
This morning I looked into building an item heirarchy so users can comment on submissions.
I also am slowly getting the design down, although it is still pretty ugly.
I've also been thinking about how to abstract out the list view to do bandit testing on all of the different sort options.
Between last night and this morning I did about eight hours of work.
Spent about an hour refactoring the template files and making sure that pages had the correct data. I still need to make sure only the right people can delete and edit their posts.
Researched validations in rails and made some more UI tweaks.
Met and talked with Clements. We discussed the need to be able to generate data over time, a time model for our site if you will. If we had 1000 users, submitting 5 links a day, who have 10 interests and voted on a selection (10) of the links in their interests, we would be in good shape.
Things to research:
- FFI - A method of calling libraries from other languages
- LDA - A possible algorithm based on vectors of interest.
- Vowpal Wabbit - An implementation of LDA
To have done by next meeting:
- A system generating the data above mentioned and possibly some look into the algorithms.
I also spent some time cleaning up the UI some more. Still needs a lot of work.
Wow, I really should have worked on this last week... Oh well. I didn't get much done today besides reading about how testing works in rails and learning about the TimeCop gem.
Met with clements. This weeks goals, get the users voting (by basically having the model "cheat") and get Vowpol Wabbit inside a ruby gem.
rake data:model
now generates users in a valid way.
- https://github.com/ealdent/lda-ruby
- https://github.com/JohnLangford/vowpal_wabbit
- http://rb-gsl.rubyforge.org/
- http://classifier.rubyforge.org/
More work on the data generation.
More work on the data generation.
Got data generation finished.
Spent a few hours comptemplating design changes.
- Maybe I should comments their own class?
- I feel like I am over complicating things.
- What's the best way to do trees in ruby?
- and represent that in the db?
- Look at Reddit and HackerNews, how should I style the page?
- http://cl.ly/6Mih and http://cl.ly/6Lbs
- What are their problems, how can I improve on them?
- Need some color
Problems with moving comments:
- Votes I need to make votes "smarter" aka more complicated.
I got vowpal wabbit compiled, but I can't figure out the input data it wants.
Met with clements yesterday, basically we need a clustering algorithm and a distance metric. Here are some things I am reading today at SHDH.
- http://en.wikipedia.org/wiki/K-nearest_neighbor_algorithm
- Winners of netflix comp: http://www2.research.att.com/~volinsky/papers/ieeecomputer.pdf
- http://en.wikipedia.org/wiki/K-means_clustering
- http://en.wikipedia.org/wiki/Cluster-weighted_modeling
I really need to stop putting such huge breaks between my work.
-
asymmetric distances are probably what we will be using. aka the distance from A to B is not necissarily the distance from B to A.
-
read http://www.devarticles.com/c/a/Ruby-on-Rails/Calculating-Statistics-with-Active-Record/3/, not really that useful.
-
there is a really poorly documented ruby gem called
hierclust
, http://hierclust.rubyforge.org/ -
there is also a decent looking implementation of k-means clustering https://github.com/reddavis/K-Means
-
"A simple measure is Manhattan distance, equal to the sum of absolute differences for each variable."
- What this means for us:
- If we both voted for it, 0
- If you submited it, and I voted for it, 1
- If I voted, but you did not, -1
- Hmm, may not be a good idea, becuase it is user centric. It would make every item have distances based on users.
- Maybe we want two clusters, one of items and one of users.
- What this means for us:
-
Was reading this http://biocomp.bioen.uiuc.edu/oscar/tools/Hierarchical_Clustering.html. It's an old copy of http://en.wikipedia.org/wiki/Cluster_analysis
-
Apparently I need to store my clusters as trees
- http://en.wikipedia.org/wiki/Dendrogram - I like the idea, but I don't know if I can break up my data like this.
- Oh wait, this is only if I'm doing heirarchical clustering.
- http://en.wikipedia.org/wiki/Hierarchical_clustering
-
fuzzy c-means clustering sounds useful, but requires me to know how many clusters I want.
- the same is true with k-means. c-means lets items belong to multiple clusters though.
-
Other links I was reading:
- http://en.wikipedia.org/wiki/Unsupervised_learning
- http://en.wikipedia.org/wiki/Machine_learning
- http://en.wikipedia.org/wiki/Recommender_systems
- http://en.wikipedia.org/wiki/Kernel_principal_component_analysis
- http://en.wikipedia.org/wiki/Formal_concept_analysis
- http://en.wikipedia.org/wiki/Bipartite_graph
-
Pearson Correlation?
- http://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient
- Just a distance metric, we use manhattan instead.
- http://en.wikipedia.org/wiki/Metric_(mathematics)#Examples
-
For now, I'm going to try building a system using item based filtering. The algorithm I found is what delicious used to use to recommend links. If this doesn't provide good results, I'll do more research into clustering.
Alright, time to implement. We are implementing clustering. From more of my reading it seems that collaborative filtering is just a type of clustering. The main thing is that we need to make sure we are implementing an unsupervised learning algorithm.
- The distance is currently the number of voters we share.
- We need to make it so we can select items sorted by date and distance
- so add a join table that stores two item ids, their distance, and date last computed. Cache for 15 minutes.
I talked to Clements today. I'm an idiot.
Each point in our graph isn't a one dimensial point, but rather n dimentional. I then take the euclidean distance (the difference between the two column vectors) and use that to mean the data.
Got basics of k-means clustering working. Only issue I'm still having is figuring out how to pick the correct item closest to the centroid.
Met with clements. He suggested storing the vector in the db, instead of associating it with a point. I would do this by storing rows which had user_id, cluster_id, mean for that cluster. I would then compute based off of that.
What I am currently doing is called k-metroid (or something like that) apparently.
To finish the project, I need to create a page that shows the reasonable recommendation the project is giving me. I am going to mail it to him and then if there is anything he does not understand, we will talk.
Finishing up code based on yesterdays notes. Need to test and write proof.