Skip to content

Instantly share code, notes, and snippets.

@zxteloiv
Last active January 2, 2016 11:55
Show Gist options
  • Save zxteloiv/5057130 to your computer and use it in GitHub Desktop.
Save zxteloiv/5057130 to your computer and use it in GitHub Desktop.
get the amount of Chinese character in a string
import sys
# Use the property:
# Almost every Chinese character is 3 bytes in UTF-8
"""
# bytes length check
for line in open(sys.argv[1], "r"):
parts = line.rstrip('\r\n').split('\t')
if len(parts[0].strip()) <= 3:
print line.rstrip('\r\n')
"""
# check chinese character amount in a keyword
for line in open(sys.argv[1], "r"):
parts = line.split('\t')
try:
ch_num = len(parts[0].strip()) - len(unicode(parts[0].strip(), 'gbk'))
except:
continue
if ch_num > 2:
print line.rstrip('\r\n')
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment