The idea here is to have a basic full text search engine, which accepts documents and creates a searchable structure.
I'm kinda using a tf-idf based idea, and took some inspiration from MySQL's internals for their Fulltext index. I added more robustness to the ranking algorithm by giving weightings a boost when an exact match of part of the query is found. Of course, it's not an exact science, and could do with a lot of improvement, but I think it's a pretty cool start.
Additionally, I've tried to write it using backwards compatible javascript for browser acceptance.
There are two objects: documents, and terms. Every time a document is inserted, the terms are tokenised and given weights. We also keep track of the word after each term to strengthen our ranking system. Documents are also stored, with their meta data, so that they can be retrieved with the results.
So we have:
document [ terms_with_weighting Object metadata Object ]
term { ...document_id [ ...list_of_next_terms String ] }
First, we tokenise the string. That removes a lot of the symbols and separate out the words, and then make them lowercase. The tokenisation I am using is very rudimentary here. In a real use-case, you might want to look at the newish "String.prototype.normalize" function to get rid of accents, and then simply remove all non a-zA-Z0-9' characters.
After this, we iterate through all of the terms, and give them a weight. The weight is awarded with the following formula: (log(t) + 1) / s
, where t
is the number of times the term appears, and s
is the sum of all log(t) + 1
for all terms in the document. This is known as term frequency, and you can read this more on the wikipedia td-idf page.
The terms and document data are then stored in the data structure.
Next, is the algorithm to find and rank terms. We simply tokenise the search input, and run through each term. After that, we calculate the inverse document frequency. This is the number of documents in the collection, subtracted by the number of documents containing the term, divided by the number of documents containing the term. Whew. I add a 1.01 to this number. Why? The +1 is to prevent a -Infinity for log 0, and the 0.01 extra is just to give a little weighting to terms that appear in all documents. Generally a +1 is simply added.
After that, I also log if there is a partial match for more than 1 of the search terms. Since we stored which terms came after each term in the document, we can easily check if the next word in our search term matches against any of those. If so, then we add to our partial match coefficient. Finally, when all of the weights are added together, we multiply the weights added together with a simply 1 + log(p + 1)
where p is our partial coefficient, or rather how many terms we found that match the same order as the search query.
Finally, we push this to the results along with our document metadata, and voila. We got a somewhat working full text search engine.
Enjoy!