tinychaos42/algorithm.textile

## algorithm.textile

      
    Raw
  

              algorithm.textile
            
          
    Task 3: Article Clustering

The algorithm I used is basically the if-idf algorithm, which can be found here . The idea behind the algorithm is that for each term in each document, it calculates two frequencies. One is the term frequency, which is just literally the number of occurrences of the term in that specific document, ‘normalized’ by the length of the document. The second is the inverse document frequency, which is the relative frequency of the term in the whole document store, namely the logarithm of the size of the whole document store divided by the number of occurrences. After a certain amount of research I concluded that this algorithm is fairly ideal for the task’s purposes, can be programmed in a nice and readable way and not overly complex.
During the research I found two other options which I concluded either slightly irrelevant or too complex for the task. The first one was a Bag-of-words solution, which could have been combined with a Bayes-classifier. Based on the articles and the analysis, it seems that this solution would be ideal to classify documents based on two separate set of words, whether they are ‘closer’ to bag A or B. In our case we actually need to ‘create’ the bags, and likely we need more than two.
The second approach would be a complex Semantic analysis. In this case the word-paragraph connections are represented in matrixes and then the words – rows – are compared as vectors. The angle of those vectors are analyzed to deduct correlation between the words, basic clustering algorithms can be used putting the documents and these vectors as input data.
The implementation

I chose basic, non-OO PHP language for the implementation. At a first glance it looks fairly long, but I tried to ensure readability so it is easy to follow each step. The logic basically gathers the title and the content section of each document, and creates word-bags from them. After eliminating function words and unnecessary punctuations, it calculates the tf-idf section for each word in each document, then ranks the words. Followed by that, gets the top 15 word for each document, and checks the intersections between documents. If there is at least one matching word, the documents will just simply go to the same cluster. In the end the algorithm outputs the title list for each cluster, that can also be seen in the uploaded file as well.
Room for improvement

As (likely) every algorithm, this could be improved as well. I would consider these ones as the main options

	rewriting in a more algorithm-friendly language, e.g C , or using more built-in PHP functions – loads of array operations are already covered
	word stemming
	running the algorithm twice and merging the results – for example the current run could be merged with another one where the top 20-30 words would be selected from each document, but for putting two documents to the same cluster, we would need a two-word match between the top-words
	extending the function word list, particularly with words having many different meaning (I added ‘user’ and ‘service’ myself, because they were too common and causing inadequate results
	the title words could have more weight – normally there is a good chance that a title contains an important keyword
	even if the function word list is not extended, words with more different meaning could get a lower weight
	the occurrence searching could be powered by a proper full-text engine