Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Star 7 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save increpare/9aaf57056b857cb44a38d0ff0de9534b to your computer and use it in GitHub Desktop.
Save increpare/9aaf57056b857cb44a38d0ff0de9534b to your computer and use it in GitHub Desktop.
toki pona letter/syllable/word frequency statistics based on #toki-pona-taso on the ma pona pi toki pona discord server. Second file has some stats from "toki pona taso sin" on telegram.
none of the following data is used in the other file - it's just a different data source. The track eachother pretty well though!
(64218 words in total)
li 4647
mi 4143
e 3597
toki 2905
ni 2811
pona 2692
a 2126
ala 1996
jan 1853
sina 1765
la 1729
lon 1594
sona 1483
mute 1268
tawa 1242
pi 1169
ike 1019
tenpo 1006
seme 973
wile 914
ona 905
o 856
kama 764
taso 757
ken 738
pali 663
nimi 663
tan 660
ma 636
pilin 592
lili 584
moku 565
lukin 445
tomo 444
ilo 433
kepeken 432
sitelen 411
musi 408
anu 348
jo 325
ali 321
sama 318
luka 318
kin 311
en 310
ante 282
pana 261
ijo 258
lape 256
telo 253
suno 252
wan 229
suli 228
pini 228
losi 224
nasa 220
nasin 220
lipu 218
nanpa 217
lawa 198
tu 196
mani 192
kalama 185
kulupu 176
wawa 172
sin 170
weka 161
ale 151
moli 148
sike 143
pakala 137
soweli 130
sewi 126
awen 113
utala 107
inli 103
pan 97
kon 95
poka 94
sonja 89
ko 89
leko 86
sijelo 86
linja 85
pimeja 84
pu 82
seli 80
kute 80
kasi 78
jaki 75
insa 73
suwi 71
lete 67
pije 58
kili 56
sonko 54
uta 54
kiwen 50
mama 50
p 49
open 48
oko 46
esun 45
meli 44
lupa 43
poki 42
wowa 39
mije 39
unpa 38
i 37
mun 36
onkon 35
monsuta 35
olin 34
len 32
nijon 31
namako 30
palisa 30
l 29
pipi 29
loje 29
anpa 28
kule 28
m 28
walo 27
noka 27
nena 27
selo 26
jelo 24
supa 21
epanja 21
pata 21
n 20
t 19
kala 19
powe 19
laso 19
epelanto 16
sinpin 15
mu 14
tosi 14
kanse 14
u 14
tajo 13
akesi 13
aaa 12
w 12
k 12
po 11
katala 11
na 11
kan 11
apeja 10
mateli 10
syllable frequency based on toki-pona-taso (lots of filtering, tried to remove things like usernames/words with non-tp-characters-in-them, also ended up removing anything else styled (like emphasized text) possibly)
li 19545
na 17887
mi 14485
la 12106
a 11507
po 9428
e 9250
ni 8528
to 8182
si 7587
ki 7095
ta 6354
pi 6301
te 5773
so 5671
ma 5620
o 5439
wa 5142
ka 4776
lon 3881
le 3767
jan 3738
mu 3520
ke 3323
i 3301
wi 3268
pa 3232
ken 3118
ten 2845
mo 2693
lin 2623
lo 2330
pe 2135
ku 2078
sa 1938
su 1903
lu 1836
se 1824
jo 1685
kin 1552
tan 1512
len 1480
pu 1413
me 1395
sin 1068
ja 869
an 841
no 783
we 778
nu 698
je 611
tu 610
in 570
u 557
wen 515
ko 504
en 483
kon 389
pan 353
wan 335
nan 289
pen 227
sun 192
lan 147
kan 147
ne 139
mon 139
pin 86
ju 77
son 76
mun 75
un 69
ti 49
jon 37
win 30
ton 30
san 29
on 29
wo 27
pon 26
ji 23
man 21
jen 18
men 15
sen 14
tun 11
wu 10
non 9
tin 7
nin 7
nun 3
min 2
kun 2
jin 2
pun 1
nen 1
won 1
lun 1
jun 1
letter frequency:
a 78489
i 76698
n 55747
l 48078
o 41835
e 38048
m 28147
t 25683
p 23403
k 23231
s 20699
u 13249
w 10225
j 7173
words by frequency (words with frequency>10)
mi 12551
li 11430
e 8785
toki 6617
pona 6479
ni 5753
a 5231
la 4715
ala 4430
sina 4012
lon 3907
jan 3736
tawa 3480
pi 2976
sona 2949
tenpo 2757
ona 2741
wile 2434
mute 2242
taso 2140
o 2063
kama 2041
ken 2001
pilin 1971
nimi 1790
ike 1703
lili 1594
tan 1476
tomo 1472
pali 1389
ma 1361
sitelen 1306
kepeken 1104
musi 975
jo 930
moku 912
lukin 835
sama 828
telo 826
lape 820
seme 805
kin 747
ilo 734
ale 733
pini 729
ante 722
suli 703
ijo 684
anu 665
nasa 660
kulupu 646
suno 635
pana 566
kalama 549
lipu 528
tu 514
nasin 501
sin 492
pakala 482
en 477
wawa 448
olin 419
lawa 416
awen 366
sewi 356
seli 355
kon 352
soweli 352
weka 341
mu 329
wan 328
inli 323
ali 319
lete 306
sike 296
nanpa 286
kasi 283
moli 281
kute 270
suwi 268
utala 260
pimeja 255
mama 252
sijelo 249
pan 223
luka 215
uta 214
open 211
ko 209
jaki 192
kala 188
pu 185
insa 185
esun 183
kili 178
poka 172
mani 168
len 158
linja 145
meli 142
kiwen 129
poki 119
supa 110
i 110
kule 109
kanse 103
mije 101
waso 100
walo 96
pipi 94
palisa 94
to 92
anpa 88
noka 84
akesi 78
loje 77
mun 75
nena 71
ten 66
unpa 66
sinpin 65
mewika 64
selo 64
aa 61
monsi 58
epanja 58
epelanto 58
jelo 57
monsuta 57
laso 54
oko 54
alasa 53
kawa 49
u 49
lo 46
s 44
in 43
p 42
elopa 40
aaa 40
sonala 39
me 36
t 36
is 36
sonko 35
aaaa 34
losi 33
noun 33
lupa 32
l 31
tok 29
sonja 27
n 26
pillin 26
it 26
k 25
leni 25
lanpan 25
pije 25
ee 24
toks 24
kanata 24
amelika 23
tosi 23
majuna 22
ne 22
like 22
aaaaa 21
kipisi 21
m 21
ka 20
nijon 20
jans 20
po 20
w 20
tempo 19
naluto 19
j 19
lile 19
iwisi 18
aaaaaa 18
ana 17
masatuse 16
nu 16
wije 16
elena 16
an 16
onkon 16
waleja 15
losupan 15
maliku 15
lasina 15
leko 15
anku 14
nikole 14
makuwe 14
ejewa 14
wajen 14
linluwi 14
oselija 13
nawi 13
kisi 13
sumi 12
pa 12
teka 12
namako 12
te 12
il 12
inkepa 11
kan 11
apeja 11
tomen 11
lu 11
ti 11
man 11
on 11
ese 11
pesije 11
powe 11
pikan 11
akon 11
kapesi 11
oo 10
lena 10
naj 10
juwese 10
juke 10
new 10
misisipi 10
no 10
kipo 10
posuka 10
kepe 10
jasi 10
na 10
@Davido101
Copy link

...tonsi is not one of them... ike a :/

@Davido101
Copy link

After trying to use this in my code and failing, I have noticed that the issue is that "ilo" and "ale" (line 164 and 165) have a space instead of a tab

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment