Skip to content

Instantly share code, notes, and snippets.

@robenalt
Created August 3, 2010 16:38
Show Gist options
  • Save robenalt/506697 to your computer and use it in GitHub Desktop.
Save robenalt/506697 to your computer and use it in GitHub Desktop.
detect languages and tokenize via core foundation
framework 'Foundation'
class String
def language
CFStringTokenizerCopyBestStringLanguage(self, CFRangeMake(0, self.size))
end
def tokens
str_array = []
stok = CFStringTokenizerCreate(nil,self,[0,self.length],0,nil)
CFStringTokenizerGetCurrentTokenRange(stok)
has_next = CFStringTokenizerAdvanceToNextToken(stok)
until has_next == 0 do
range = CFStringTokenizerGetCurrentTokenRange(stok)
str_array << self[range.location,range.length]
has_next = CFStringTokenizerAdvanceToNextToken(stok)
end
str_array
end
end
puts "Bonne année!".language
# => "fr"
puts "Happy new year!".language
# => "en"
puts "¡Feliz año nuevo!".language
# => "es"
puts "Felice anno nuovo!".language
# => "it"
puts "أعياد سعيدة".language
# => "ar"
puts "@umran_chaelle ne oluo ümran nedir olay hattı kırmalar falan".language
puts "明けましておめでとうございます。".language
puts "明けましておめでとうございます。".tokens
puts "".language
puts "".tokens
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment