Skip to content

Instantly share code, notes, and snippets.

@jcheng5
Created July 28, 2011 00:00
Show Gist options
  • Save jcheng5/1110629 to your computer and use it in GitHub Desktop.
Save jcheng5/1110629 to your computer and use it in GitHub Desktop.
Generate ranges for Unicode character classes
#!/usr/bin/ruby
require 'pp'
Invalid = 0
NonAlpha = -1
Alpha = 1
NumericClasses = ['Nd']
AlphaClasses = ['Lu', 'Ll', 'Lt', 'Lm', 'Lo', 'Nl']
AlphanumericClasses = NumericClasses + AlphaClasses
# http://www.unicode.org/Public/UNIDATA/UnicodeData.txt
data = File.read('UnicodeData.txt')
data = data.split("\n")
data.reject! {|x| x !~ /;/}
result = Array.new(0x10000, Invalid)
data.each do |line|
chunks = line.split(';', 4)
code = chunks[0].hex
break if code > 0xFFFF
if AlphanumericClasses.include?(chunks[2])
result[code] = Alpha
else
result[code] = NonAlpha
end
end
# i = 0
# encoded = ""
# while (i < result.size)
# x = 0
# for j in 0...8
# x += 1 << j if result[i + j] == Alpha
# end
# encoded += [x].pack('C')
# i += 8
# end
#
# def isalpha(encoded, code_point)
# unencoded = encoded.unpack('C*')
# code_point = code_point.unpack('C')[0]
# index = code_point / 8
# bit = code_point % 8
# (unencoded[index] >> bit) % 2 == 1
# end
#
# print encoded
current_range = nil
ranges = []
result.each_with_index do |value, i|
if value == 1
if !current_range
current_range = [i, i]
end
current_range[1] = i
elsif value == -1
ranges << current_range if current_range
current_range = nil
end
end
ranges << current_range if current_range
ranges.each do |range|
print "{0x#{range[0].to_s(16).upcase}, 0x#{range[1].to_s(16).upcase}}, "
end
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment