Skip to content

Instantly share code, notes, and snippets.

@TheDhejavu
Last active December 17, 2020 09:56
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save TheDhejavu/ae8c1b71f7a85033ac37eae334cb0e3c to your computer and use it in GitHub Desktop.
Save TheDhejavu/ae8c1b71f7a85033ac37eae334cb0e3c to your computer and use it in GitHub Desktop.
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
from nltk.stem import LancasterStemmer
class PlagiarismChecker:
def prepare_content(self, content):
# STOP WORDS
stop_words = set(stopwords.words('english'))
# TOKENIZE
word_tokens = word_tokenize(content)
filtered_content = []
# STEMMING
porter = PorterStemmer()
for w in word_tokens:
if w not in stop_words:
w = w.lower()
word = porter.stem(w)
filtered_content.append(word)
return filtered_content
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment