Skip to content

Instantly share code, notes, and snippets.

@dajare
Last active June 21, 2017 14:01
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save dajare/a610ee6ed10784cce972fc977cd0f095 to your computer and use it in GitHub Desktop.
Save dajare/a610ee6ed10784cce972fc977cd0f095 to your computer and use it in GitHub Desktop.
Counts unicode range (total printable characters) in mixed text file; currently set to Hebrew
#!/usr/bin/python
# coding: utf-8
import re
import codecs
import sys
## chmod 0755 to make executable
## run with `./name.py input_file`
## source: https://unix.stackexchange.com/a/372270/99759
find_hebrew = re.compile(ur'[\u0590-\u05ff]+') # python 2
# find_hebrew = re.compile(r'[\u0590-\u05ff]+') # python 3
text_file = sys.argv[1]
count = 0
with codecs.open(text_file, 'rU', encoding='utf-8') as f:
for line in f.readlines():
for n in find_hebrew.findall(line):
count += len(n)
print(count)
@dajare
Copy link
Author

dajare commented Jun 21, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment