Count unigrams and bigrams in the Wolfart-Ahenakew nêhiyawêwin corpus!
When building a keyboard for typing Cree, it is useful to know which graphemes are typed often, and which pairs of graphemes are typed one after the other. Using unigram statistics, we can place the most frequent graphemes in the most ergonomic "neutral" positions. To speed up typing, we place frequently typed pairs on opposite sides of the keyboard, optimizing for two-handed typing with to maximize alternating hands/thumbs.
- Python 3.6+
sponge(1)
from moreutils (brew install moreutils
)nfc(1)
from unormalize (brew install eddieantonio/eddieantonio/unormalize
)
.
├── Makefile
├── bigrams.pdf [output]
├── bigrams.tsv [output]
├── cleancorp.txt [input]
├── count-bigrams
├── count-unigrams
├── create-fdp
├── defuse
├── filter-out-non-sro
├── tokenize
└── unigrams.tsv [output]
Note: You must download cleancorp.txt
separately!
SHA256 sum:
2cb09a54b9c3cc329eba573b4798aa2d79b19f58b4f417a97150c9133c3b3343 cleancorp.txt