Skip to content

Instantly share code, notes, and snippets.

@wincent
Created August 13, 2015 08:16
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save wincent/59d46104c20442f9ae2b to your computer and use it in GitHub Desktop.
Save wincent/59d46104c20442f9ae2b to your computer and use it in GitHub Desktop.
$ yak-layout corpus-stats
Corpus length: 7,198,132 bytes
Unigrams by frequency (top 68 of 68):
-------------------------------------
e: 319,790 (11.37%)
t: 233,143 (8.29%)
r: 176,189 (6.26%)
a: 170,065 (6.05%)
n: 169,157 (6.01%)
o: 164,595 (5.85%)
i: 159,275 (5.66%)
s: 147,306 (5.24%)
l: 108,220 (3.85%)
c: 101,123 (3.59%)
d: 95,128 (3.38%)
u: 81,564 (2.90%)
p: 73,809 (2.62%)
h: 68,358 (2.43%)
m: 66,807 (2.37%)
f: 53,523 (1.90%)
g: 46,803 (1.66%)
.: 45,501 (1.62%)
y: 34,742 (1.23%)
(: 34,635 (1.23%)
): 34,324 (1.22%)
v: 32,855 (1.17%)
b: 32,567 (1.16%)
w: 26,613 (0.95%)
': 25,676 (0.91%)
,: 25,438 (0.90%)
;: 25,009 (0.89%)
=: 18,773 (0.67%)
k: 18,656 (0.66%)
/: 18,551 (0.66%)
:: 16,382 (0.58%)
{: 15,625 (0.56%)
}: 15,570 (0.55%)
x: 15,176 (0.54%)
_: 13,295 (0.47%)
*: 12,718 (0.45%)
q: 11,782 (0.42%)
-: 10,573 (0.38%)
": 10,125 (0.36%)
>: 7,438 (0.26%)
`: 7,140 (0.25%)
0: 7,019 (0.25%)
1: 6,805 (0.24%)
j: 5,593 (0.20%)
[: 5,138 (0.18%)
]: 5,054 (0.18%)
2: 4,654 (0.17%)
3: 3,037 (0.11%)
+: 2,720 (0.10%)
<: 2,574 (0.09%)
#: 2,473 (0.09%)
|: 2,298 (0.08%)
4: 2,293 (0.08%)
@: 2,289 (0.08%)
5: 2,123 (0.08%)
z: 1,821 (0.06%)
&: 1,799 (0.06%)
!: 1,709 (0.06%)
9: 1,703 (0.06%)
8: 1,653 (0.06%)
6: 1,604 (0.06%)
7: 1,410 (0.05%)
?: 1,298 (0.05%)
\: 930 (0.03%)
$: 623 (0.02%)
%: 389 (0.01%)
^: 165 (0.01%)
~: 105 (0.00%)
Bigrams by frequency (top 50 of 3,406):
---------------------------------------
re: 51,657 (2.17%)
in: 39,785 (1.67%)
er: 37,576 (1.58%)
th: 35,868 (1.51%)
en: 33,892 (1.42%)
on: 32,853 (1.38%)
te: 28,704 (1.21%)
nt: 28,127 (1.18%)
at: 27,442 (1.15%)
es: 24,686 (1.04%)
ti: 24,680 (1.04%)
st: 24,423 (1.03%)
or: 24,139 (1.01%)
le: 24,122 (1.01%)
he: 22,762 (0.96%)
ar: 22,603 (0.95%)
to: 22,106 (0.93%)
ct: 22,001 (0.92%)
de: 20,812 (0.87%)
se: 20,293 (0.85%)
co: 20,112 (0.84%)
an: 19,981 (0.84%)
me: 19,933 (0.84%)
is: 18,082 (0.76%)
al: 18,003 (0.76%)
ed: 17,915 (0.75%)
ec: 17,500 (0.74%)
et: 16,864 (0.71%)
ro: 16,609 (0.70%)
ng: 16,399 (0.69%)
nd: 16,211 (0.68%)
io: 15,710 (0.66%)
it: 15,391 (0.65%)
ta: 15,037 (0.63%)
ra: 14,785 (0.62%)
pe: 14,599 (0.61%)
el: 14,403 (0.61%)
);: 14,135 (0.59%)
ll: 13,701 (0.58%)
ge: 13,633 (0.57%)
ac: 13,581 (0.57%)
ve: 13,572 (0.57%)
ne: 13,540 (0.57%)
om: 13,052 (0.55%)
hi: 12,700 (0.53%)
va: 12,626 (0.53%)
ea: 12,228 (0.51%)
ch: 12,002 (0.50%)
un: 11,703 (0.49%)
ue: 11,642 (0.49%)
Trigrams by frequency (top 50 of 29,632):
-----------------------------------------
ent: 18,082 (0.90%)
the: 16,349 (0.82%)
ion: 14,787 (0.74%)
tio: 13,769 (0.69%)
ing: 12,159 (0.61%)
ect: 9,165 (0.46%)
ate: 8,875 (0.44%)
var: 8,693 (0.43%)
cti: 7,903 (0.39%)
thi: 7,878 (0.39%)
men: 7,421 (0.37%)
rea: 7,403 (0.37%)
for: 7,285 (0.36%)
all: 7,010 (0.35%)
his: 6,991 (0.35%)
com: 6,853 (0.34%)
con: 6,826 (0.34%)
pro: 6,445 (0.32%)
act: 6,443 (0.32%)
que: 6,428 (0.32%)
ode: 6,227 (0.31%)
dat: 6,009 (0.30%)
and: 5,933 (0.30%)
sta: 5,299 (0.26%)
pec: 5,266 (0.26%)
eve: 5,265 (0.26%)
ame: 5,231 (0.26%)
est: 5,183 (0.26%)
ter: 5,176 (0.26%)
ati: 5,074 (0.25%)
equ: 5,022 (0.25%)
fun: 4,909 (0.24%)
unc: 4,852 (0.24%)
ren: 4,851 (0.24%)
nct: 4,814 (0.24%)
tor: 4,746 (0.24%)
ele: 4,666 (0.23%)
nod: 4,644 (0.23%)
nde: 4,619 (0.23%)
get: 4,591 (0.23%)
exp: 4,534 (0.23%)
res: 4,530 (0.23%)
tur: 4,522 (0.23%)
nam: 4,471 (0.22%)
use: 4,447 (0.22%)
eac: 4,231 (0.21%)
der: 4,222 (0.21%)
cal: 4,185 (0.21%)
ret: 4,182 (0.21%)
end: 4,160 (0.21%)
Unigrams frequency overview:
----------------------------
etranoislcduphmfg.y()vbw',;=k/:{}x_*q-">`01j[]23+<#|4@5z&!9867?\$%^~
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment