Skip to content

Instantly share code, notes, and snippets.

@connerxyz
Last active June 28, 2018 16:20
Show Gist options
  • Save connerxyz/505b53bc3ff6f1b52d7d65f8e7d72b8f to your computer and use it in GitHub Desktop.
Save connerxyz/505b53bc3ff6f1b52d7d65f8e7d72b8f to your computer and use it in GitHub Desktop.
Porter stemmer: Python's NLTK vs. Java's OpenNLP: How to get *almost* identical Porter stemmers

Porter stemmer: Python's NLTK vs. Java's OpenNLP:

How to get almost equivalent Porter stemmers between Java and Python.

One remaining difference is how they handle non-dictionary words ending in "s":

Assuming you tokenize the following string after lowercase normalization using [a-z]\w+...

"EMI"s request for an appeal should be denied until a final judgment is entered. Document filed by Michael Robertson. (js) (Entered: 02/01/2012)"

NLTK will map token js -> j where OpenNLP is js -> js.

Note: There may be other differences yet to be discovered.

"""
Java's OpenNLP uses the original Porter stemmer algorithm.
NLTK implements Martin Porter's *revised* version of the stemmer.
You must set mode to 'ORIGINAL_ALGORITHM' to use the original.
See http://www.nltk.org/api/nltk.stem.html
"""
import nltk
porter = nltk.stem.porter.PorterStemmer(mode='ORIGINAL_ALGORITHM')
import opennlp.tools.stemmer.PorterStemmer;
PorterStemmer stemmer = new PorterStemmer();
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment