Skip to content

Instantly share code, notes, and snippets.

@modamodamoda
Last active May 2, 2021 04:39
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save modamodamoda/1cd0c473336ab6a1f5e4c39233152f63 to your computer and use it in GitHub Desktop.
Save modamodamoda/1cd0c473336ab6a1f5e4c39233152f63 to your computer and use it in GitHub Desktop.
A very quick full text search engine

The idea here is to have a basic full text search engine, which accepts documents and creates a searchable structure.

I'm kinda using a tf-idf based idea, and took some inspiration from MySQL's internals for their Fulltext index. I added more robustness to the ranking algorithm by giving weightings a boost when an exact match of part of the query is found. Of course, it's not an exact science, and could do with a lot of improvement, but I think it's a pretty cool start.

Additionally, I've tried to write it using backwards compatible javascript for browser acceptance.

The data structure

There are two objects: documents, and terms. Every time a document is inserted, the terms are tokenised and given weights. We also keep track of the word after each term to strengthen our ranking system. Documents are also stored, with their meta data, so that they can be retrieved with the results.

So we have:

document [ terms_with_weighting Object metadata Object ]

term { ...document_id [ ...list_of_next_terms String ] }

The insertion algorithm

First, we tokenise the string. That removes a lot of the symbols and separate out the words, and then make them lowercase. The tokenisation I am using is very rudimentary here. In a real use-case, you might want to look at the newish "String.prototype.normalize" function to get rid of accents, and then simply remove all non a-zA-Z0-9' characters.

After this, we iterate through all of the terms, and give them a weight. The weight is awarded with the following formula: (log(t) + 1) / s, where t is the number of times the term appears, and s is the sum of all log(t) + 1 for all terms in the document. This is known as term frequency, and you can read this more on the wikipedia td-idf page.

The terms and document data are then stored in the data structure.

The find algorithm

Next, is the algorithm to find and rank terms. We simply tokenise the search input, and run through each term. After that, we calculate the inverse document frequency. This is the number of documents in the collection, subtracted by the number of documents containing the term, divided by the number of documents containing the term. Whew. I add a 1.01 to this number. Why? The +1 is to prevent a -Infinity for log 0, and the 0.01 extra is just to give a little weighting to terms that appear in all documents. Generally a +1 is simply added.

After that, I also log if there is a partial match for more than 1 of the search terms. Since we stored which terms came after each term in the document, we can easily check if the next word in our search term matches against any of those. If so, then we add to our partial match coefficient. Finally, when all of the weights are added together, we multiply the weights added together with a simply 1 + log(p + 1) where p is our partial coefficient, or rather how many terms we found that match the same order as the search query.

Finally, we push this to the results along with our document metadata, and voila. We got a somewhat working full text search engine.

Enjoy!

function Tokenise(str) {
return str.replace(/[\?.,\/#!$%\^&\*;:{}=\-_`~()\t\n\r]/g," ").replace(/\s\s+/g, ' ').trim().toLowerCase().split(' ');
}
function Collection(opts) {
this.opts = (opts === undefined) ? {} : opts;
this.terms = {};
this.documents = {};
}
Collection.prototype.updateTerm = function(term, k, refs) {
if(this.terms[term] === undefined) this.terms[term] = {};
this.terms[term][k] = refs;
}
Collection.prototype.document = function(key, props) {
var ts = {};
var refs = {};
for(var p in props) {
if(this.opts.__noindex && this.opts.__noindex.indexOf(p) !== -1) continue;
var t = Tokenise(props[p]); // Tokenise the terms
for(var i = 0; i < t.length; i++) {
// Add each to new terms
ts[t[i]] = (ts[t[i]] || 0) + (1 * this.opts[p] ? this.opts[p] : 1);
if(!refs[t[i]]) refs[t[i]] = [];
if(i + 1 < t.length && refs[t[i]].indexOf(t[i + 1]) === -1) {
refs[t[i]].push(t[i + 1]); // log next words (maybe make a dictionary in future, also need to take care of duplicates)
}
}
}
var log = 0;
for(var k in ts) {
log+= Math.log(ts[k]) + 1;
}
for(var k in ts) {
ts[k] = (Math.log(ts[k]) + 1) / log;
this.updateTerm(k, key, refs[k])
}
if(!this.opts.__keep)
this.documents[key] = [ts, props];
else {
var k = {};
for(var i = 0; i < this.opts.__keep.length; i++) {
k[this.opts.__keep[i]] = props[this.opts.__keep[i]];
}
this.documents[key] = [ts, k];
}
}
Collection.prototype.find = function(searchText) {
var t = Tokenise(searchText);
var found = {};
var doclength = Object.keys(this.documents).length;
for(var i = 0; i < t.length; i++) {
// scour through the ol' terms
for(var j in this.terms[t[i]]) {
var fr = ((doclength - Object.keys(this.terms[t[i]]).length) / Object.keys(this.terms[t[i]]).length) + 1.01; // any better idea? lol
var wt = this.documents[j][0][t[i]] * Math.log(fr);
if(found[j] === undefined) found[j] = [wt,0];
else found[j][0] += wt;
if(i + 1 < t.length && this.terms[t[i]][j].indexOf(t[i+1]) !== -1) found[j][1]++;
}
}
var results = [];
for(var d in found) {
results.push([found[d][0] * (1 + Math.log(found[d][1] + 1)), this.documents[d][1]]);
}
results.sort(function(a, b) {
if(a[0] < b[0]) return 1;
else if(a[0] > b[0]) return -1;
return 0;
}); // maybe do a binary insert instead of sorting after
return results;
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment