Skip to content

Instantly share code, notes, and snippets.

@aermolaev
Created September 30, 2010 13:33
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save aermolaev/604575 to your computer and use it in GitHub Desktop.
Save aermolaev/604575 to your computer and use it in GitHub Desktop.
RussianStemmingAnalyzer
class RussianStemmingAnalyzer < Ferret::Analysis::Analyzer
include Ferret::Analysis
include ActionView::Helpers::SanitizeHelper
def initialize(stop_words = FULL_RUSSIAN_STOP_WORDS)
@stop_words = stop_words
@full_sanitizer = HTML::FullSanitizer.new
end
def token_stream(field, str)
ts = clean_html(str)
ts = StandardTokenizer.new(ts)
ts = LowerCaseFilter.new(ts)
ts = StopFilter.new(ts, @stop_words)
ts = StemFilter.new(ts, 'rus', 'UTF_8')
HyphenFilter.new(ts)
end
protected
def strip_entities(str)
str.gsub(/&([a-z][a-z0-9]+|#[0-9a-f]+);/, ' ')
end
def clean_html(str)
@full_sanitizer.sanitize(strip_entities(str))
end
end
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment