Skip to content

Instantly share code, notes, and snippets.

@haruo31
Last active February 14, 2016 17:24
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save haruo31/1ba45329c67c4184c7bf to your computer and use it in GitHub Desktop.
Save haruo31/1ba45329c67c4184c7bf to your computer and use it in GitHub Desktop.
jisxやsjisのコードマッピングから正規表現に使えるレンジリストを生成する
#!/usr/bin/python
# -*- coding: utf8 -*-
# -*- eval: (setq flycheck-python-pylint-executable "/usr/bin/pylint") -*-
"""
This script generates the ranges of unicode character code that is defined in code mapping at unicode.org.
unicode.orgの文字コードマッピングから、unicodeの文字コードのレンジをプリントするスクリプト。
"""
from itertools import groupby
from codecs import decode
try:
from urllib2 import urlopen
except ImportError:
# for python3
from urllib.request import urlopen
def create(url):
nums = set()
for l in urlopen(url):
l = decode(l, 'iso8859-1')
d = l.split('#')[0]
if not d:
continue
d = d.split('\t')[1].strip()
if not d:
continue
nums.update(int(n, 16) for n in d.lower().lstrip('u+').split('+'))
offsets = groupby([(n - i, n) for i, n in enumerate(sorted(nums))], lambda t: t[0])
for _, g in offsets:
s = e = next(g)
for e in g:
pass
if s[1] != e[1]:
yield '\\u{:04X}-\\u{:04X}'.format(s[1], e[1])
else:
yield '\\u{:04X}'.format(s[1])
if __name__ == '__main__':
# iso-8859-1
# src = 'http://ftp.unicode.org/Public/MAPPINGS/ISO8859/8859-1.TXT'
# euc-jp
src = 'http://x0213.org/codetable/sjis-0213-2004-std.txt'
for entry in create(src):
print(entry)
@haruo31
Copy link
Author

haruo31 commented Feb 14, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment