Skip to content

Instantly share code, notes, and snippets.

@zhiyue
Created March 31, 2018 04:22
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save zhiyue/83a1671955f7c59590cc5fc1ed9524b2 to your computer and use it in GitHub Desktop.
Save zhiyue/83a1671955f7c59590cc5fc1ed9524b2 to your computer and use it in GitHub Desktop.
Number of characters in statistics
from types import StringType
def statistics_cn_words(s, encoding='utf-8'):
rx = re.compile(u"[a-zA-Z0-9_\u0392-\u03c9]+|[\u4E00-\u9FFF\u3400-\u4dbf\uf900-\ufaff\u3040-\u309f\uac00-\ud7af]+",
re.UNICODE)
if type(s) is StringType: # not unicode
s = unicode(s, encoding, 'ignore')
splitted = rx.findall(s)
cjk_len = 0
for w in splitted:
if ord(w[0]) >= 12352: # \u3040
cjk_len += len(w)
return cjk_len
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment