Skip to content

Instantly share code, notes, and snippets.

@samsondav
Created December 4, 2014 19:26
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save samsondav/e1951194a287e46cbda2 to your computer and use it in GitHub Desktop.
Save samsondav/e1951194a287e46cbda2 to your computer and use it in GitHub Desktop.
Tokenize an abitrary Spanish string into words
def self.tokenize_spanish(text)
text = text.sub(/(https?:\/\/[\S]*[$\z\s])/i, '') # strip urls
words = text.scan(/[\w@#ñÑáÁéÉíÍóÓúÚü]+/) # tokenize into words
words.map! {|word| word.downcase} # lowercase everything
words.reject! {|word| word.length < 4} # reject words with 3 or fewer characters
return words
end
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment