Skip to content

Instantly share code, notes, and snippets.

Embed
What would you like to do?
Unicode normalization in Ruby sucks
decomposed ="e\xCC\x81"
puts "1: #{decomposed.chars.inspect} // #{decomposed.chars.length}"
puts "2: #{decomposed.chars.normalize(:c).inspect} // #{decomposed.chars.normalize(:c).length}"
puts " 2a: #{decomposed.chars.normalize(:c)[0]}, #{decomposed.chars.normalize(:c)[1]}"
puts " 2b: #{decomposed.chars.normalize(:c).unpack('U*').pack('U')} // #{decomposed.chars.normalize(:c).unpack('U*').pack('U').length}"
puts " 2c: #{decomposed.chars.normalize(:c).unpack('U*').collect{|cp| [cp].pack('U') unless cp.to_s.blank? }} // #{decomposed.chars.normalize(:c).unpack('U*').collect{|cp| [cp].pack('U') unless cp.to_s.blank? }.length}"
puts "3: #{decomposed.chars.normalize(:c).to_s.inspect} // #{decomposed.chars.normalize(:c).to_s.length}"
puts "4: #{decomposed.chars.normalize(:c).to_s.unpack('U'*decomposed.chars.normalize(:c).to_s.length).collect {|x| x.to_s 16}}"
puts "5: #{ActiveSupport::Multibyte::Chars.new(decomposed).length}"
1: #<ActiveSupport::Multibyte::Chars:0x3e4ed0c @string="é"> // 2
2: #<ActiveSupport::Multibyte::Chars:0x3e4ea50 @string="é"> // 1
2a: 233,
2b: é // 2
2c: é // 1
3: "é" // 2
4: e9
5: 2
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.