Skip to content

Instantly share code, notes, and snippets.

@rmarronnier
Created November 21, 2019 19:18
Show Gist options
  • Save rmarronnier/90aef3697ee7463a010b1196c632d8cc to your computer and use it in GitHub Desktop.
Save rmarronnier/90aef3697ee7463a010b1196c632d8cc to your computer and use it in GitHub Desktop.
require "./stemmer"
module Cadmium
class PorterStemmer < Stemmer
include Tokenizer::StopWords
property stop_words : Set(String)
def initialize
add_stop_words_list(:en)
@stop_words : Set(String) = @@loaded_stop_words[:en]
end
def self.stem(token)
token
end
def self.tokenize_and_stem(text, keep_stops = false)
stemmed_tokens = [] of String
lowercase_text = text.downcase
tokens = Cadmium::Tokenizer::Aggressive.new.tokenize(lowercase_text)
if keep_stops
tokens.each { |token| stemmed_tokens.push(stem(token)) }
else
tokens.each { |token| stemmed_tokens.push(stem(token)) unless @stop_words.includes?(token) }
end
stemmed_tokens
end
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment