icco/.gitignore

## .gitignore
*swp

## notes.md

      
    Raw
  

              notes.md
            
          
    CSC 400 - bloomFilter

Here is where I will put my notes as I develop.
3.20.11 && random dates before

I did some vertical prototyping with rails. I haven't used rails before, so I'm slowly trying to get a hang on the framework.
3.30.11 && 3.31.11

I spent about four hours last night getting user authentication working. Users can now submit items. I also got the basics of voting done.
This morning I looked into building an item heirarchy so users can comment on submissions.
I also am slowly getting the design down, although it is still pretty ugly.
I've also been thinking about how to abstract out the list view to do bandit testing on all of the different sort options.
Between last night and this morning I did about eight hours of work.
4.1.11

Spent about an hour refactoring the template files and making sure that pages had the correct data. I still need to make sure only the right people can delete and edit their posts.
4.4.11

Researched validations in rails and made some more UI tweaks.
4.5.11

Met and talked with Clements. We discussed the need to be able to generate data over time, a time model for our site if you will. If we had 1000 users, submitting 5 links a day, who have 10 interests and voted on a selection (10) of the links in their interests, we would be in good shape.
Things to research:

FFI - A method of calling libraries from other languages
LDA - A possible algorithm based on vectors of interest.

http://en.wikipedia.org/wiki/Linear_discriminant_analysis


Vowpal Wabbit - An implementation of LDA

https://github.com/gparker/vowpal_wabbit


To have done by next meeting:

A system generating the data above mentioned and possibly some look into the algorithms.

I also spent some time cleaning up the UI some more. Still needs a lot of work.
4.16.11

Wow, I really should have worked on this last week... Oh well. I didn't get much done today besides reading about how testing works in rails and learning about the TimeCop gem.
4.19.11

Met with clements. This weeks goals, get the users voting (by basically having the model "cheat") and get Vowpol Wabbit inside a ruby gem.

rake data:model now generates users in a valid way.

4.22.11


https://github.com/ealdent/lda-ruby
https://github.com/JohnLangford/vowpal_wabbit
http://rb-gsl.rubyforge.org/
http://classifier.rubyforge.org/

4.25.11

More work on the data generation.
4.26.11

More work on the data generation.
4.28.11

Got data generation finished.
4.29.11

Spent a few hours comptemplating design changes.

Maybe I should comments their own class?

I feel like I am over complicating things.
What's the best way to do trees in ruby?

and represent that in the db?


Look at Reddit and HackerNews, how should I style the page?

http://cl.ly/6Mih and http://cl.ly/6Lbs
What are their problems, how can I improve on them?


Need some color

5.2.11

Problems with moving comments:

Votes I need to make votes "smarter" aka more complicated.

I got vowpal wabbit compiled, but I can't figure out the input data it wants.
5.7.11

Met with clements yesterday, basically we need a clustering algorithm and a distance metric. Here are some things I am reading today at SHDH.

http://en.wikipedia.org/wiki/K-nearest_neighbor_algorithm

http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.31.1422


Winners of netflix comp: http://www2.research.att.com/~volinsky/papers/ieeecomputer.pdf
http://en.wikipedia.org/wiki/K-means_clustering
http://en.wikipedia.org/wiki/Cluster-weighted_modeling

5.17.11

I really need to stop putting such huge breaks between my work.


asymmetric distances are probably what we will be using. aka the distance from A to B is not necissarily the distance from B to A.


read http://www.devarticles.com/c/a/Ruby-on-Rails/Calculating-Statistics-with-Active-Record/3/, not really that useful.


there is a really poorly documented ruby gem called hierclust, http://hierclust.rubyforge.org/


there is also a decent looking implementation of k-means clustering https://github.com/reddavis/K-Means


"A simple measure is Manhattan distance, equal to the sum of absolute differences for each variable."

What this means for us:

If we both voted for it, 0
If you submited it, and I voted for it, 1
If I voted, but you did not, -1


Hmm, may not be a good idea, becuase it is user centric. It would make every item have distances based on users.

Maybe we want two clusters, one of items and one of users.


Was reading this http://biocomp.bioen.uiuc.edu/oscar/tools/Hierarchical_Clustering.html. It's an old copy of http://en.wikipedia.org/wiki/Cluster_analysis


Apparently I need to store my clusters as trees

http://en.wikipedia.org/wiki/Dendrogram - I like the idea, but I don't know if I can break up my data like this.
Oh wait, this is only if I'm doing heirarchical clustering.
http://en.wikipedia.org/wiki/Hierarchical_clustering


fuzzy c-means clustering sounds useful, but requires me to know how many clusters I want.

the same is true with k-means. c-means lets items belong to multiple clusters though.


Other links I was reading:

http://en.wikipedia.org/wiki/Unsupervised_learning
http://en.wikipedia.org/wiki/Machine_learning
http://en.wikipedia.org/wiki/Recommender_systems
http://en.wikipedia.org/wiki/Kernel_principal_component_analysis
http://en.wikipedia.org/wiki/Formal_concept_analysis
http://en.wikipedia.org/wiki/Bipartite_graph


Pearson Correlation?

http://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient
Just a distance metric, we use manhattan instead.
http://en.wikipedia.org/wiki/Metric_(mathematics)#Examples


For now, I'm going to try building a system using item based filtering. The algorithm I found is what delicious used to use to recommend links. If this doesn't provide good results, I'll do more research into clustering.


5.23.11

Alright, time to implement. We are implementing clustering. From more of my reading it seems that collaborative filtering is just a type of clustering. The main thing is that we need to make sure we are implementing an unsupervised learning algorithm.

The distance is currently the number of voters we share.
We need to make it so we can select items sorted by date and distance
so add a join table that stores two item ids, their distance, and date last computed. Cache for 15 minutes.

5.31.11


http://vlm1.uta.edu/~athitsos/nearest_neighbors/

6.2.11

I talked to Clements today. I'm an idiot.
Each point in our graph isn't a one dimensial point, but rather n dimentional. I then take the euclidean distance (the difference between the two column vectors) and use that to mean the data.
6.6.11 - 6.8.11

Got basics of k-means clustering working. Only issue I'm still having is figuring out how to pick the correct item closest to the centroid.
6.8.11

Met with clements. He suggested storing the vector in the db, instead of associating it with a point. I would do this by storing rows which had user_id, cluster_id, mean for that cluster. I would then compute based off of that.
What I am currently doing is called k-metroid (or something like that) apparently.
To finish the project, I need to create a page that shows the reasonable recommendation the project is giving me. I am going to mail it to him and then if there is anything he does not understand, we will talk.
6.9.11

Finishing up code based on yesterdays notes. Need to test and write proof.

  
## syllabus.md

      
    Raw
  

              syllabus.md
            
          
    CSC 400 Rubric


Plain website 35 pts

Users can register and authenticate 5 pts
Users can submit URLs 5 pts
Users can comment on submitted URLs 5 pts
Users have a front page of URLs of interest 5 pts
Users can vote on their favorite posts 5 pts
Users can edit their submissions 5 pts
Users can be turned into moderators, who can delete posts 5 pts


Recommendation Algorithms 60 pts

Lab Notebook with bi weekly checkin. Tuesdays at 10am.


Final - Write a short paper explaining findings, decisions made, etc. 5 pts


Total: 100 pts