Skip to content

Instantly share code, notes, and snippets.

@DavidEGrayson
Created May 3, 2012 17:37
Show Gist options
  • Save DavidEGrayson/2587499 to your computer and use it in GitHub Desktop.
Save DavidEGrayson/2587499 to your computer and use it in GitHub Desktop.
Ruby UTF8 parser (for fun)
# -*- coding: utf-8 -*-
# TODO: throw exceptions if bytes.next returns an unexpected type of byte
def utf8_parse(string)
return enum_for(:utf8_parse, string) unless block_given?
bytes = string.bytes
while true
byte = bytes.next
yield case byte
when 0x00..0x7F then byte
when 0xC0..0xDF then (byte-0xC0 << 6) | (bytes.next-0x80)
when 0xE0..0xEF then (byte-0xE0 << 12) | (bytes.next-0x80 << 6) | (bytes.next-0x80)
else raise "Invalid byte #{byte}"
end
end
rescue StopIteration
end
str = "abcd¢世界"
p str.codepoints.to_a
p utf8_parse(str).to_a
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment