Skip to content

Instantly share code, notes, and snippets.

@giuse
Created July 29, 2017 12:29
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save giuse/a53abf52ad39aa862e3376920dbdbac9 to your computer and use it in GitHub Desktop.
Save giuse/a53abf52ad39aa862e3376920dbdbac9 to your computer and use it in GitHub Desktop.
Word processing for a bayesian spam filter
def process_words
# Lowercase
words.each &:downcase!
# Remove punctuation
words.map! { |word| word.gsub(/[^a-z0-9\s]/i, '') }
# Remove numbers
words.map! { |word| word.gsub(/[0-9]/i, '') }
# Remove empty and single-letter words
words.reject! { |word| word.length < 2 }
# Remove stopwords
@words -= stopwords
# Stemming
words.map! { |word| stemmer.stem(word) }
end
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment