Skip to content

Instantly share code, notes, and snippets.

@tinychaos42
Created January 17, 2012 21:58
Show Gist options
  • Save tinychaos42/1629216 to your computer and use it in GitHub Desktop.
Save tinychaos42/1629216 to your computer and use it in GitHub Desktop.
Explanation of the clustering algorithm

Task 3: Article Clustering

The algorithm I used is basically the if-idf algorithm, which can be found here . The idea behind the algorithm is that for each term in each document, it calculates two frequencies. One is the term frequency, which is just literally the number of occurrences of the term in that specific document, ‘normalized’ by the length of the document. The second is the inverse document frequency, which is the relative frequency of the term in the whole document store, namely the logarithm of the size of the whole document store divided by the number of occurrences. After a certain amount of research I concluded that this algorithm is fairly ideal for the task’s purposes, can be programmed in a nice and readable way and not overly complex.

During the research I found two other options which I concluded either slightly irrelevant or too complex for the task. The first one was a Bag-of-words solution, which could have been combined with a Bayes-classifier. Based on the articles and the analysis, it seems that this solution would be ideal to classify documents based on two separate set of words, whether they are ‘closer’ to bag A or B. In our case we actually need to ‘create’ the bags, and likely we need more than two.

The second approach would be a complex Semantic analysis. In this case the word-paragraph connections are represented in matrixes and then the words – rows – are compared as vectors. The angle of those vectors are analyzed to deduct correlation between the words, basic clustering algorithms can be used putting the documents and these vectors as input data.

The implementation

I chose basic, non-OO PHP language for the implementation. At a first glance it looks fairly long, but I tried to ensure readability so it is easy to follow each step. The logic basically gathers the title and the content section of each document, and creates word-bags from them. After eliminating function words and unnecessary punctuations, it calculates the tf-idf section for each word in each document, then ranks the words. Followed by that, gets the top 15 word for each document, and checks the intersections between documents. If there is at least one matching word, the documents will just simply go to the same cluster. In the end the algorithm outputs the title list for each cluster, that can also be seen in the uploaded file as well.

Room for improvement

As (likely) every algorithm, this could be improved as well. I would consider these ones as the main options

  • rewriting in a more algorithm-friendly language, e.g C , or using more built-in PHP functions – loads of array operations are already covered
  • word stemming
  • running the algorithm twice and merging the results – for example the current run could be merged with another one where the top 20-30 words would be selected from each document, but for putting two documents to the same cluster, we would need a two-word match between the top-words
  • extending the function word list, particularly with words having many different meaning (I added ‘user’ and ‘service’ myself, because they were too common and causing inadequate results
  • the title words could have more weight – normally there is a good chance that a title contains an important keyword
  • even if the function word list is not extended, words with more different meaning could get a lower weight
  • the occurrence searching could be powered by a proper full-text engine
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment