Skip to content

Instantly share code, notes, and snippets.

@rillian
Created March 12, 2020 19:05
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save rillian/7e06def080c74df26bbd9a3efafa8d25 to your computer and use it in GitHub Desktop.
Save rillian/7e06def080c74df26bbd9a3efafa8d25 to your computer and use it in GitHub Desktop.
quick script for analysing combining characters
#!/usr/bin/env python3
import collections
import unicodedata
histogram = collections.Counter()
with open('cop_wordlist.combined') as f:
for line in f.readlines():
# Skip dictionary header.
if line.startswith('dictionary='):
continue
line = line.strip()
p = line.split(',')
if len(p) != 2:
print('Bad line:', p)
continue
word, frequency = line.split(',')
_, word = word.split('=')
histogram.update(word)
for char in histogram.keys():
if unicodedata.combining(char):
print(f'u+{ord(char):06x} {unicodedata.name(char)} {histogram[char]}')
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment