The algorithm I used is basically the if-idf algorithm, which can be found here . The idea behind the algorithm is that for each term in each document, it calculates two frequencies. One is the term frequency, which is just literally the number of occurrences of the term in that specific document, ‘normalized’ by the length of the document. The second is the inverse document frequency, which is the relative frequency of the term in the whole document store, namely the logarithm of the size of the whole document store divided by the number of occurrences. After a certain amount of research I concluded that this algorithm is fairly ideal for the task’s purposes, can be programmed in a nice and readable way and not overly complex.
During the research I found two other options which I concluded either slightly irrelevant or too complex for the task. The first one was a Bag-of-words solu