Skip to content

Instantly share code, notes, and snippets.

@olivernn
Created January 19, 2016 16:49
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save olivernn/7cd496f8654a0246c53c to your computer and use it in GitHub Desktop.
Save olivernn/7cd496f8654a0246c53c to your computer and use it in GitHub Desktop.
Better handling of English contractions in lunr.
lunr.contractionTrimmer = function (token) {
return token.replace(/('ve|n't|'d|'ll|'ve|'s|'re)$/, "")
}
lunr.Pipeline.registerFunction(lunr.stopWordFilter, 'contractionTrimmer')
var englishContractions = function (idx) {
idx.pipeline.after(lunr.trimmer, lunr.contractionTrimmer)
}
@j1m1lo
Copy link

j1m1lo commented May 18, 2017

I'm considering using this in our production environment.

Questions:

  • Is there a specific reason why your trimmer replaces n't, and not just 't?
  • My trimmer return token.replace(/('m|'ve|'t|'d|'ll|'ve|'s|'re)$/, "") also replaces "I'm" - seems to work alright. Is there a downside? Did you leave it out on purpose?

@albertsemple
Copy link

I took a bit of a blunderbust approach to this:

token.replace(/[^A-Za-z é]/g, "");

I had an issue that the possessive for of the surname "Burns" had been misspelt as "Burn's" in the corpus, and wanted to add tolerance for those kind of misspellings.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment