Skip to content

Instantly share code, notes, and snippets.

@luikore
Last active August 29, 2015 14:15
Show Gist options
  • Save luikore/da7450d68b7511104040 to your computer and use it in GitHub Desktop.
Save luikore/da7450d68b7511104040 to your computer and use it in GitHub Desktop.
Regexp to Match Fullwidth Characters
# generate a regexp for match full-width characters
# data from
#
# A ; Ambiguous 不确定
# F ; Fullwidth 全宽
# H ; Halfwidth 半宽
# N ; Neutral 中性
# Na ; Narrow 窄
# W ; Wide 宽
#
# see also
# https://docs.python.org/2/library/unicodedata.html
#
# for computing width of a certain string, see the solution of urwid:
# http://likang.me/blog/2012/04/13/calculate-character-width-in-python/
# https://github.com/wardi/urwid/blob/master/urwid/old_str_util.py
require 'open-uri'
data = `curl ftp://ftp.unicode.org/Public/UNIDATA/EastAsianWidth.txt`
lines = data.lines.map{|l| l.sub(/#.*$/, '') }.select{|l| !l.empty? }.map &:strip
res = []
lines.each{|l|
if l =~ /;[FW]/
res << l.sub(/;[FW]/, '').gsub('..', '-').gsub(/(\w+)/, '\\u{\1}')
end
}
puts "/[#{res.join}]/"
# /[\u{1100}-\u{115F}\u{2329}\u{232A}\u{2E80}-\u{2E99}\u{2E9B}-\u{2EF3}\u{2F00}-\u{2FD5}\u{2FF0}-\u{2FFB}\u{3000}\u{3001}-\u{3003}\u{3004}\u{3005}\u{3006}\u{3007}\u{3008}\u{3009}\u{300A}\u{300B}\u{300C}\u{300D}\u{300E}\u{300F}\u{3010}\u{3011}\u{3012}-\u{3013}\u{3014}\u{3015}\u{3016}\u{3017}\u{3018}\u{3019}\u{301A}\u{301B}\u{301C}\u{301D}\u{301E}-\u{301F}\u{3020}\u{3021}-\u{3029}\u{302A}-\u{302D}\u{302E}-\u{302F}\u{3030}\u{3031}-\u{3035}\u{3036}-\u{3037}\u{3038}-\u{303A}\u{303B}\u{303C}\u{303D}\u{303E}\u{3041}-\u{3096}\u{3099}-\u{309A}\u{309B}-\u{309C}\u{309D}-\u{309E}\u{309F}\u{30A0}\u{30A1}-\u{30FA}\u{30FB}\u{30FC}-\u{30FE}\u{30FF}\u{3105}-\u{312D}\u{3131}-\u{318E}\u{3190}-\u{3191}\u{3192}-\u{3195}\u{3196}-\u{319F}\u{31A0}-\u{31BA}\u{31C0}-\u{31E3}\u{31F0}-\u{31FF}\u{3200}-\u{321E}\u{3220}-\u{3229}\u{322A}-\u{3247}\u{3250}\u{3251}-\u{325F}\u{3260}-\u{327F}\u{3280}-\u{3289}\u{328A}-\u{32B0}\u{32B1}-\u{32BF}\u{32C0}-\u{32FE}\u{3300}-\u{33FF}\u{3400}-\u{4DB5}\u{4DB6}-\u{4DBF}\u{4E00}-\u{9FCC}\u{9FCD}-\u{9FFF}\u{A000}-\u{A014}\u{A015}\u{A016}-\u{A48C}\u{A490}-\u{A4C6}\u{A960}-\u{A97C}\u{AC00}-\u{D7A3}\u{F900}-\u{FA6D}\u{FA6E}-\u{FA6F}\u{FA70}-\u{FAD9}\u{FADA}-\u{FAFF}\u{FE10}-\u{FE16}\u{FE17}\u{FE18}\u{FE19}\u{FE30}\u{FE31}-\u{FE32}\u{FE33}-\u{FE34}\u{FE35}\u{FE36}\u{FE37}\u{FE38}\u{FE39}\u{FE3A}\u{FE3B}\u{FE3C}\u{FE3D}\u{FE3E}\u{FE3F}\u{FE40}\u{FE41}\u{FE42}\u{FE43}\u{FE44}\u{FE45}-\u{FE46}\u{FE47}\u{FE48}\u{FE49}-\u{FE4C}\u{FE4D}-\u{FE4F}\u{FE50}-\u{FE52}\u{FE54}-\u{FE57}\u{FE58}\u{FE59}\u{FE5A}\u{FE5B}\u{FE5C}\u{FE5D}\u{FE5E}\u{FE5F}-\u{FE61}\u{FE62}\u{FE63}\u{FE64}-\u{FE66}\u{FE68}\u{FE69}\u{FE6A}-\u{FE6B}\u{FF01}-\u{FF03}\u{FF04}\u{FF05}-\u{FF07}\u{FF08}\u{FF09}\u{FF0A}\u{FF0B}\u{FF0C}\u{FF0D}\u{FF0E}-\u{FF0F}\u{FF10}-\u{FF19}\u{FF1A}-\u{FF1B}\u{FF1C}-\u{FF1E}\u{FF1F}-\u{FF20}\u{FF21}-\u{FF3A}\u{FF3B}\u{FF3C}\u{FF3D}\u{FF3E}\u{FF3F}\u{FF40}\u{FF41}-\u{FF5A}\u{FF5B}\u{FF5C}\u{FF5D}\u{FF5E}\u{FF5F}\u{FF60}\u{FFE0}-\u{FFE1}\u{FFE2}\u{FFE3}\u{FFE4}\u{FFE5}-\u{FFE6}\u{1B000}-\u{1B001}\u{1F200}-\u{1F202}\u{1F210}-\u{1F23A}\u{1F240}-\u{1F248}\u{1F250}-\u{1F251}\u{20000}-\u{2A6D6}\u{2A6D7}-\u{2A6FF}\u{2A700}-\u{2B734}\u{2B735}-\u{2B73F}\u{2B740}-\u{2B81D}\u{2B81E}-\u{2F7FF}\u{2F800}-\u{2FA1D}\u{2FA1E}-\u{2FFFD}\u{30000}-\u{3FFFD}]/
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment