Last active
August 29, 2015 14:15
-
-
Save luikore/da7450d68b7511104040 to your computer and use it in GitHub Desktop.
Regexp to Match Fullwidth Characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# generate a regexp for match full-width characters | |
# data from | |
# | |
# A ; Ambiguous 不确定 | |
# F ; Fullwidth 全宽 | |
# H ; Halfwidth 半宽 | |
# N ; Neutral 中性 | |
# Na ; Narrow 窄 | |
# W ; Wide 宽 | |
# | |
# see also | |
# https://docs.python.org/2/library/unicodedata.html | |
# | |
# for computing width of a certain string, see the solution of urwid: | |
# http://likang.me/blog/2012/04/13/calculate-character-width-in-python/ | |
# https://github.com/wardi/urwid/blob/master/urwid/old_str_util.py | |
require 'open-uri' | |
data = `curl ftp://ftp.unicode.org/Public/UNIDATA/EastAsianWidth.txt` | |
lines = data.lines.map{|l| l.sub(/#.*$/, '') }.select{|l| !l.empty? }.map &:strip | |
res = [] | |
lines.each{|l| | |
if l =~ /;[FW]/ | |
res << l.sub(/;[FW]/, '').gsub('..', '-').gsub(/(\w+)/, '\\u{\1}') | |
end | |
} | |
puts "/[#{res.join}]/" | |
# /[\u{1100}-\u{115F}\u{2329}\u{232A}\u{2E80}-\u{2E99}\u{2E9B}-\u{2EF3}\u{2F00}-\u{2FD5}\u{2FF0}-\u{2FFB}\u{3000}\u{3001}-\u{3003}\u{3004}\u{3005}\u{3006}\u{3007}\u{3008}\u{3009}\u{300A}\u{300B}\u{300C}\u{300D}\u{300E}\u{300F}\u{3010}\u{3011}\u{3012}-\u{3013}\u{3014}\u{3015}\u{3016}\u{3017}\u{3018}\u{3019}\u{301A}\u{301B}\u{301C}\u{301D}\u{301E}-\u{301F}\u{3020}\u{3021}-\u{3029}\u{302A}-\u{302D}\u{302E}-\u{302F}\u{3030}\u{3031}-\u{3035}\u{3036}-\u{3037}\u{3038}-\u{303A}\u{303B}\u{303C}\u{303D}\u{303E}\u{3041}-\u{3096}\u{3099}-\u{309A}\u{309B}-\u{309C}\u{309D}-\u{309E}\u{309F}\u{30A0}\u{30A1}-\u{30FA}\u{30FB}\u{30FC}-\u{30FE}\u{30FF}\u{3105}-\u{312D}\u{3131}-\u{318E}\u{3190}-\u{3191}\u{3192}-\u{3195}\u{3196}-\u{319F}\u{31A0}-\u{31BA}\u{31C0}-\u{31E3}\u{31F0}-\u{31FF}\u{3200}-\u{321E}\u{3220}-\u{3229}\u{322A}-\u{3247}\u{3250}\u{3251}-\u{325F}\u{3260}-\u{327F}\u{3280}-\u{3289}\u{328A}-\u{32B0}\u{32B1}-\u{32BF}\u{32C0}-\u{32FE}\u{3300}-\u{33FF}\u{3400}-\u{4DB5}\u{4DB6}-\u{4DBF}\u{4E00}-\u{9FCC}\u{9FCD}-\u{9FFF}\u{A000}-\u{A014}\u{A015}\u{A016}-\u{A48C}\u{A490}-\u{A4C6}\u{A960}-\u{A97C}\u{AC00}-\u{D7A3}\u{F900}-\u{FA6D}\u{FA6E}-\u{FA6F}\u{FA70}-\u{FAD9}\u{FADA}-\u{FAFF}\u{FE10}-\u{FE16}\u{FE17}\u{FE18}\u{FE19}\u{FE30}\u{FE31}-\u{FE32}\u{FE33}-\u{FE34}\u{FE35}\u{FE36}\u{FE37}\u{FE38}\u{FE39}\u{FE3A}\u{FE3B}\u{FE3C}\u{FE3D}\u{FE3E}\u{FE3F}\u{FE40}\u{FE41}\u{FE42}\u{FE43}\u{FE44}\u{FE45}-\u{FE46}\u{FE47}\u{FE48}\u{FE49}-\u{FE4C}\u{FE4D}-\u{FE4F}\u{FE50}-\u{FE52}\u{FE54}-\u{FE57}\u{FE58}\u{FE59}\u{FE5A}\u{FE5B}\u{FE5C}\u{FE5D}\u{FE5E}\u{FE5F}-\u{FE61}\u{FE62}\u{FE63}\u{FE64}-\u{FE66}\u{FE68}\u{FE69}\u{FE6A}-\u{FE6B}\u{FF01}-\u{FF03}\u{FF04}\u{FF05}-\u{FF07}\u{FF08}\u{FF09}\u{FF0A}\u{FF0B}\u{FF0C}\u{FF0D}\u{FF0E}-\u{FF0F}\u{FF10}-\u{FF19}\u{FF1A}-\u{FF1B}\u{FF1C}-\u{FF1E}\u{FF1F}-\u{FF20}\u{FF21}-\u{FF3A}\u{FF3B}\u{FF3C}\u{FF3D}\u{FF3E}\u{FF3F}\u{FF40}\u{FF41}-\u{FF5A}\u{FF5B}\u{FF5C}\u{FF5D}\u{FF5E}\u{FF5F}\u{FF60}\u{FFE0}-\u{FFE1}\u{FFE2}\u{FFE3}\u{FFE4}\u{FFE5}-\u{FFE6}\u{1B000}-\u{1B001}\u{1F200}-\u{1F202}\u{1F210}-\u{1F23A}\u{1F240}-\u{1F248}\u{1F250}-\u{1F251}\u{20000}-\u{2A6D6}\u{2A6D7}-\u{2A6FF}\u{2A700}-\u{2B734}\u{2B735}-\u{2B73F}\u{2B740}-\u{2B81D}\u{2B81E}-\u{2F7FF}\u{2F800}-\u{2FA1D}\u{2FA1E}-\u{2FFFD}\u{30000}-\u{3FFFD}]/ |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment