Skip to content

Instantly share code, notes, and snippets.

@IlnarSelimcan
Created September 20, 2019 03:11
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save IlnarSelimcan/54cc2ab1fc4b6bdc9991f97a0d8a3b33 to your computer and use it in GitHub Desktop.
Save IlnarSelimcan/54cc2ab1fc4b6bdc9991f97a0d8a3b33 to your computer and use it in GitHub Desktop.
nog: commit 6f65e512b45e04ef9f177ea8e1adf6ba26cb648e
stems: 1367
bible coverage
Number of tokenised words in the corpus: 189329
Coverage: 81.88%
Top unknown words in the corpus:
343 Масих
341 а
306 Раббий
233 Кие
230 иман
194 аркалы
189 баьриси
176 Масихтинъ
148 Петер
146 Паул
143 А
128 дува
127 Раббийдинъ
123 Рух
120 солай
116 оьким
105 баьрисин
85 сокталары
85 болынъыз
83 Масихке
Translation time: 1.595435380935669 seconds
bible corpus size (tokens): 138010 ../../../data4apertium/corpora/bible/nog.txt
sah: commit 46b66f6f3e90a766d13647d00e1a6bcf03f1b25e
stems: 9505
bible coverage
Error: Malformed input stream.Number of tokenised words in the corpus: 1107
Coverage: 91.06%
Top unknown words in the corpus:
5 Апостоллар
3 Евангелие
2 Спасскай
2 миссионердар
2 Аланд
1 Annotation
1 НОВЫЙ
1 ЗАВЕТ
1 якутском
1 языке
1 in
1 c
1 эҕэрдэлиибин
1 Тэнгри
1 чочуобунаны
1 сүрэхтэнэр
1 бэргэһэлэнэр
1 Дежнев
1 Абакайааданы
1 бэргэһэлээбит
Translation time: 0.040879249572753906 seconds
bible corpus size (tokens): 146801 ../../../data4apertium/corpora/bible/sah.txt
wikipedia coverage
Error: Malformed input stream.Number of tokenised words in the corpus: 3672
Coverage: 90.36%
Top unknown words in the corpus:
5 Орест
4 якутского
4 языка
3 Вейсенбург
2 Ромулус
2 киириитиэр
2 саарар
2 фон
2 монастырыгар
2 рукопись
2 этилэр
2 биллибитэ
2 К
2 Пеллерин
2 Кустуктуурап
2 буоланнар
2 Максимовы
2 Маҥаачыйа
2 Поликарпов
2 сорохторун
Translation time: 0.08163261413574219 seconds
wikipedia corpus size (tokens): 5082510 wiki.txt
chv: commit 16c6cacbb54cd238566d5da2b7a807085bb9d6cd
stems: 62530
bible coverage
Number of tokenised words in the corpus: 196268
Coverage: 94.08%
Top unknown words in the corpus:
1432 Иисус
272 Эй
248 Иисуса
136 Святой
125 Иоанн
86 Моисей
86 пӗтӗмпех
75 кирек
67 тӳрре
61 шыва
61 Симон
60 Пилат
56 Давид
54 тунине
53 Павела
52 Ирод
50 самантрах
49 ҫавнашкалах
48 Аминь
48 пулнӑран
Translation time: 3.941494941711426 seconds
bible corpus size (tokens): 133632 ../../../data4apertium/corpora/bible/chv.txt
wikipedia coverage
Error: Malformed input stream.Number of tokenised words in the corpus: 686
Coverage: 92.57%
Top unknown words in the corpus:
2 Хуть
2 тĕрлĕрен
2 идиомсем
2 тĕпченин
2 каяп
1 мĕшĕнче
1 нимпех
1 уйралса
1 талккăшпех
1 ыттисенчен
1 идиомсенчен
1 паллăраххисем
1 этнографилле
1 ыттисемшĕн
1 радиокăларăмсемпе
1 телепередачăсем
1 тулăшĕнче
1 кодлăхĕсем
1 кодлăхĕсене
1 Ăнлантаркăч
Translation time: 0.07455682754516602 seconds
wikipedia corpus size (tokens): 295582 wiki.txt
kum: commit 162d6e69a4e860d4057489f2fcd97105ce99d9d9
stems: 4949
bible coverage
Number of tokenised words in the corpus: 207468
Coverage: 93.33%
Top unknown words in the corpus:
191 Месигьни
144 оьзлени
90 ягьудилени
90 ягьудилер
79 Месигьге
79 Я
76 таби
69 Месигьден
62 каламын
62 чакъы
59 оьзлеге
58 сужда
57 Шолайлыкъда
55 инкар
55 Устаз
55 Къанунну
54 эсе
50 я
49 сюннет
45 ягьуди
Translation time: 2.8214516639709473 seconds
bible corpus size (tokens): 153845 ../../../data4apertium/corpora/bible/kum.txt
kaa: commit 85249552c43627efee4c754ccc7954ac4c8a953e
stems: 28474
bible coverage
Number of tokenised words in the corpus: 190814
Coverage: 93.79%
Top unknown words in the corpus:
520 Muxaddes
408 ytkeni
353 Masix
346 A
214 z
183 Masixtıń
172 ǵo
161 Petr
148 Pavel
142 atırǵan
139 zi
108 Háy
106 muxaddes
105 Ruwx
85 atanaq
81 Masixqa
75 haq
75 ziniń
68 bolǵanlıqtan
65 Erusalimge
Translation time: 4.4775168895721436 seconds
bible corpus size (tokens): 145429 ../../../data4apertium/corpora/bible/kaa.txt
wikipedia coverage
Error: Malformed input stream.Number of tokenised words in the corpus: 4
Coverage: 100.00%
Top unknown words in the corpus:
Translation time: 0.016776323318481445 seconds
wikipedia corpus size (tokens): 337430 wiki.txt
tuk: commit c1493edb237396bcc0432743c9ca16b6439fc541
stems: 2986
bible coverage
Number of tokenised words in the corpus: 598585
Coverage: 70.38%
Top unknown words in the corpus:
3337 Reb
2415 Rebbiň
2359 Ol
2228 Men
2132 ol
1424 olar
1366 Olar
1309 olaryň
1070 oňa
1056 Sen
1053 Rebbe
1048 men
1041 olary
923 meniň
850 Meniň
828 seniň
782 maňa
772 siz
768 çünki
761 Eý
Translation time: 5.525491237640381 seconds
bible corpus size (tokens): 401307 ../../../data4apertium/corpora/bible/tuk.txt
wikipedia coverage
Error: Malformed input stream.Number of tokenised words in the corpus: 129
Coverage: 82.17%
Top unknown words in the corpus:
2 inženerligi
1 Sahypa
1 Kimýä
1 ähtimal
1 şire
1 çyg
1 tagam
1 splawy
1 guýmak
1 garylmak
1 tebigat
1 iň
1 wajyp
1 olaryň
1 üýtgeýişleri
1 üýtgeýişleriň
1 tabyn
1 baradaky
1 algoritm
1 giňişleýin
Translation time: 0.010648488998413086 seconds
wikipedia corpus size (tokens): 2021374 wiki.txt
bak: commit 2cb89f7bc78526f47da6c6de1b1420475ca85b58
stems: 56463
bible coverage
Number of tokenised words in the corpus: 197315
Coverage: 94.29%
Top unknown words in the corpus:
144 Һөйөнөслө
114 китте
94 имандаштар
75 үҙҙәре
71 шундай
63 алдына
58 ҡыуып
54 арҡысаҡҡа
53 киткән
52 фарисейҙар
51 бөтөнөһө
50 ҡисса
49 дусар
49 өҫтөнән
48 Ирод
48 береһенә
48 Имандаштар
46 Йәһүҙә
46 халҡы
45 Яҡуб
Translation time: 4.478304386138916 seconds
bible corpus size (tokens): 145707 ../../../data4apertium/corpora/bible/bak.txt
wikipedia coverage
Error: Malformed input stream.Number of tokenised words in the corpus: 474
Coverage: 96.20%
Top unknown words in the corpus:
4 власы
2 ө
2 н
1 ноябрендә
1 октябрендә
1 суверенитеты
1 февраленән
1 ТӨРКСОЙ
1 Халҡы
1 ын
1 тауҙарының
1 Ямантау
1 мәмерйәләре
Translation time: 0.0639498233795166 seconds
wikipedia corpus size (tokens): 17443832 wiki.txt
kaz: commit 04f31c3d337e1fa69420b6ffbcab7cc826490032
stems: 37801
bible coverage
Number of tokenised words in the corpus: 210008
Coverage: 98.09%
Top unknown words in the corpus:
66 яһудилер
65 парызшылдар
63 немесе
50 Пилат
49 Жохан
41 Яһудея
36 яһудилердің
32 Ғалилея
29 Қорынттықтарга
27 Лұқа
24 Марқа
23 Яһудилердің
23 Філіп
23 дұғай
20 Менмін
20 Барнаба
19 тағзым
17 Ыбырайымға
17 Тоқтының
16 Тімоте
Translation time: 9.840543031692505 seconds
bible corpus size (tokens): 151631 ../../../data4apertium/corpora/bible/kaz.txt
wikipedia coverage
Error: Malformed input stream.Number of tokenised words in the corpus: 2889
Coverage: 96.16%
Top unknown words in the corpus:
3 оңт
2 тайпалық
2 жоспарлық
2 Беловеж
2 респ
2 жоғ
2 өкілеттігі
2 мореналық
2 тен
2 ке
2 сағ
1 Бaтысында
1 төмeнгі
1 Мұхитқа
1 жəне
1 aлуан
1 Хaлықтың
1 православты
1 номинал
1 Стан
Translation time: 0.38750314712524414 seconds
wikipedia corpus size (tokens): 33782767 wiki.txt
tur: commit 1e6e3b4d3fce24e0aa18342051dc6fe8533da679
stems: 22652
bible coverage
Number of tokenised words in the corpus: 481376
Coverage: 93.90%
Top unknown words in the corpus:
829 nın
809 ın
627 a
407 na
289 ı
268 i
267 nun
203 dan
187 Kâhin
179 ndan
174 yı
161 nı
159 Irmağı
154 Filistliler
152 Efrayim
151 nin
142 Yoav
137 Manaşşe
127 Moav
122 nde
Translation time: 11.723340511322021 seconds
bible corpus size (tokens): 309293 ../../../data4apertium/corpora/bible/tur.txt
wikipedia coverage
Error: Malformed input stream.Number of tokenised words in the corpus: 16711
Coverage: 89.29%
Top unknown words in the corpus:
134 ın
124 Temuçin
40 nin
38 ı
38 a
36 i
34 Camuka
33 Jin
33 nın
29 Cuci
27 Harezmşah
22 Cebe
17 Mukhulai
17 Subutay
16 Yesügey
14 Börte
14 Alaeddin
13 Höelin
13 Suphi
12 Şira
Translation time: 0.6075348854064941 seconds
wikipedia corpus size (tokens): 54337641 wiki.txt
tat: commit 3c854811ec1251b005529a2da9df8a2d81b93680
stems: 59755
bible coverage
Number of tokenised words in the corpus: 196538
Coverage: 98.61%
Top unknown words in the corpus:
52 фарисейләр
30 кайберләре
29 Corinthians
23 Фарисейләр
22 Revelation
21 1st
20 Петернең
17 кайберәүләр
17 Тимуте
16 2nd
15 Антиухеягә
14 Һанани
14 Әгрип
13 Яһүдиядә
13 кинаяле
13 саддукейлар
13 Петергә
13 Леви
13 Көрнили
13 Фисте
Translation time: 7.186840295791626 seconds
bible corpus size (tokens): 144953 ../../../data4apertium/corpora/bible/tat.txt
wikipedia coverage
Error: Malformed input stream.Number of tokenised words in the corpus: 26
Coverage: 69.23%
Top unknown words in the corpus:
1 Tatarça
1 İnternet
1 tulısınça
1 İnternetı
1 Respublikası
1 däwlät
1 tellärendä
1 tatar
Translation time: 0.12969040870666504 seconds
wikipedia corpus size (tokens): 6884329 wiki.txt
gag: commit 3c6bf03fcdcb84bf35831d6e5ebc39f28e74087c
stems: 6470
bible coverage
Number of tokenised words in the corpus: 1
Coverage: 100.00%
Top unknown words in the corpus:
Translation time: 0.027973413467407227 seconds
bible corpus size (tokens): wikipedia coverage
Number of tokenised words in the corpus: 478
Coverage: 93.72%
Top unknown words in the corpus:
2 Aarı
2 mikrotemaları
2 abzaț
1 notoc
1 noeditsection
1 Ağrı
1 viridis
1 gruz
1 აფხაზეთი
1 kismi
1 Topraaın
1 sunnü
1 mikrotema
1 abzațın
1 ercääz
1 başlıkları
1 Rhodeus
1 sericeus
1 akarlarda
1 göllerde
Translation time: 0.03628849983215332 seconds
wikipedia corpus size (tokens): 123741 wiki.txt
uzb: commit b5f2b1242b784271ed4c30acee83048e97e109bc
stems: 36684
bible coverage
Number of tokenised words in the corpus: 198551
Coverage: 95.11%
Top unknown words in the corpus:
185 ko
71 g
61 so
48 cho
41 Kimki
36 lasizlar
31 bilasizlar
30 emasmi
28 qiladigan
28 Acts
27 go
26 lur
26 Shoul
24 emasman
23 vahiy
23 III
22 Isha
22 yozilganidek
22 Revelation
21 qilasizlar
Translation time: 2.853121280670166 seconds
bible corpus size (tokens): 131151 ../../../data4apertium/corpora/bible/uzb.txt
wikipedia coverage
Error: Malformed input stream.Number of tokenised words in the corpus: 4163
Coverage: 90.37%
Top unknown words in the corpus:
8 sr
7 dan
6 srlarda
5 Tuproqqalʼa
4 2ming
4 Herirud
3 MDH
3 shyo
3 reytingida
3 GFP
3 Tajan
3 3ming
3 xorazmiylar
3 Kot
2 Global
2 vertolyot
2 reytingga
2 Murgʻob
2 Gekatey
2 Gerodot
Translation time: 0.07757258415222168 seconds
wikipedia corpus size (tokens): 9183827 wiki.txt
crh: commit e64d105662e9f4776368fbfdf47de37635e557bf
stems: 13631
bible coverage
Number of tokenised words in the corpus: 118941
Coverage: 37.10%
Top unknown words in the corpus:
1566 ве
1022 Иса
910 бир
880 деди
729 ичюн
677 исе
624 эди
578 да
556 деп
505 Мен
501 бу
464 де
420 адам
352 оларгъа
339 не
336 Онынъ
319 Алланынъ
315 сонъ
296 Бу
285 оны
Translation time: 0.3830137252807617 seconds
bible corpus size (tokens): 82456 ../../../data4apertium/corpora/bible/crh.txt
wikipedia coverage
Number of tokenised words in the corpus: 5006
Coverage: 92.75%
Top unknown words in the corpus:
12 Amdi
10 Abibulla
8 Odabaş
5 Giraybay
5 nemse
4 Ablây
4 افغانستان
4 Afġānistān
3 ci
3 Abeşistan
3 Aluston
2 ac
2 ae
2 af
2 ag
2 ai
2 am
2 ao
2 Antarktidanıñ
2 au
Translation time: 0.2028791904449463 seconds
wikipedia corpus size (tokens): 173704 wiki.txt
kir: commit caec6e6e4bd6e33be07b36820bd06ca349423497
stems: 15886
bible coverage
Number of tokenised words in the corpus: 201319
Coverage: 94.95%
Top unknown words in the corpus:
946 Кудай
597 Кудайдын
305 Кудайга
94 Кудайды
90 Кудайдан
58 Ысман
43 Ыбрайым
38 Жүйүт
38 расмисинен
29 Corinthians
28 Барнап
28 Шабыл
27 чөмүлдүрүү
24 Кудайыбыз
23 Ыбрайымдын
23 чөмүлүү
22 Revelation
21 аян
21 алышпады
21 1st
Translation time: 6.5446202754974365 seconds
bible corpus size (tokens): 148445 ../../../data4apertium/corpora/bible/kir.txt
wikipedia coverage
Error: Malformed input stream.Number of tokenised words in the corpus: 9198
Coverage: 91.01%
Top unknown words in the corpus:
15 С
10 К
7 комплекстик
7 Н
6 закон
6 Кыргызжер
5 П
5 В
5 масштабдагы
5 Д
5 Памир
5 каганаты
4 Ч
4 Гумбольдт
4 Риттер
4 чарбалык
4 Тескей
4 ч
4 Арсланбаб
4 жаңгак
Translation time: 0.3299267292022705 seconds
wikipedia corpus size (tokens): 8321420 wiki.txt
tyv: commit f89afc383b03e30638da11fb296d93d7da2f7a0e
stems: 11845
bible coverage
Number of tokenised words in the corpus: 219959
Coverage: 96.11%
Top unknown words in the corpus:
82 Бижилгеде
62 бараалгакчылары
49 бараалгакчызы
45 угаадыглыг
45 шыдажып
41 израиль
41 экиртип
37 бузуттуг
34 согур
33 эккеп
32 эрлик
32 дирлип
32 хевин
29 Аминь
29 Варнава
29 Corinthians
28 шыдамык
26 доңгая
26 алгыржып
26 соп
Translation time: 3.4303221702575684 seconds
bible corpus size (tokens): 156126 ../../../data4apertium/corpora/bible/tyv.txt
wikipedia coverage
Number of tokenised words in the corpus: 413
Coverage: 94.19%
Top unknown words in the corpus:
4 quot
2 калмык
1 х
1 талакы
1 глобус
1 шиштейин
1 Бажин
1 субурган
1 топограф
1 Каррутерс
1 Михаилның
1 мрамор
1 силбип
1 үндүсүн
1 тоолдап
1 Гэсэр
1 эпостуң
1 1990чч
1 ля
1 минор
Translation time: 0.0222933292388916 seconds
wikipedia corpus size (tokens): 337589 wiki.txt
uig: commit 7ce96d726371b41ac371745ae5267fa246557144
stems: 25385
bible coverage
Number of tokenised words in the corpus: 1
Coverage: 100.00%
Top unknown words in the corpus:
Translation time: 0.062395572662353516 seconds
bible corpus size (tokens): wikipedia coverage
Error: Malformed input stream.Number of tokenised words in the corpus: 203
Coverage: 86.21%
Top unknown words in the corpus:
3 كومپۇتەر
1 ۋىكىپىدىيە
1 ۋىكىپېدىيەنىڭ
1 نۇسخاسىغا
1 نەۋرۇز
1 بايرامى
1 بیلگیسايار
1 مۈھەندیسلیغی
1 ھەلقی
1 نھايیتی
1 ۋەئی
1 شلیتیشیمیزگە
1 ھیتاي
1 كیشیلیریمیزئو
1 قۇۋاتقان
1 يیتیشیۋاتقان
1 مما
1 بیزنیڭئو
1 مۇمی
1 سەلیشتۇرغاندا
Translation time: 0.0732121467590332 seconds
wikipedia corpus size (tokens): 1791416 wiki.txt
aze: commit da572614b8d54f1caef8d0eabe11ca2e669edf42
stems: 11583
bible coverage
Number of tokenised words in the corpus: 753060
Coverage: 56.79%
Top unknown words in the corpus:
12538 və
3642 Rəbb
3569 də
2634 ilə
2451 Rəbbin
2397 görə
2021 idi
1971 hər
1834 Çünki
1813 oğlu
1729 Mən
1624 isə
1347 etdi
1277 qədər
1265 İsa
1207 Allah
1151 çünki
1136 Allahın
1080 Ey
940 yanına
Translation time: 6.365848779678345 seconds
bible corpus size (tokens): 534925 ../../../data4apertium/corpora/bible/aze.txt
wikipedia coverage
Number of tokenised words in the corpus: 1
Coverage: 100.00%
Top unknown words in the corpus:
Translation time: 0.014212846755981445 seconds
wikipedia corpus size (tokens): 0 wiki.txt
kjh: commit b624f2e589f1421d15c962506943abd360894742
stems: 710
bible coverage
Number of tokenised words in the corpus: 175448
Coverage: 47.63%
Top unknown words in the corpus:
1926 паза
1210 Иисус
1085 даа
979 тізең
957 тіп
851 дее
720 ниме
681 ӱчӱн
659 прай
625 Хан
607 Че
577 теен
525 че
497 хада
436 нимес
426 парған
342 киліп
334 нооза
329 Аннаңар
321 ағаа
Translation time: 0.7867178916931152 seconds
bible corpus size (tokens): 137272 ../../../data4apertium/corpora/bible/kjh.txt
krc: commit e9f9e9c1406aae02a973147cce1b1f49e21b282c
stems: 8551
bible coverage
Number of tokenised words in the corpus: 193680
Coverage: 85.41%
Top unknown words in the corpus:
802 Исса
444 Иссаны
416 Масих
313 Раббий
287 юсюнден
242 Кесини
217 кесини
168 санга
156 Раббийни
148 муну
144 Масихни
123 Иссагъа
121 жууапха
119 Муну
115 Пауул
113 жууап
110 этигиз
109 махтау
96 Нюр
96 Раббийибиз
Translation time: 1.7533786296844482 seconds
bible corpus size (tokens): 142337 ../../../data4apertium/corpora/bible/krc.txt
wikipedia coverage
Number of tokenised words in the corpus: 1
Coverage: 100.00%
Top unknown words in the corpus:
Translation time: 0.01462864875793457 seconds
wikipedia corpus size (tokens): 0 wiki.txt
ota: commit 788a52c454b3e6e815df89e7b09069da16a40a44
stems: 77
bible coverage
Number of tokenised words in the corpus: 1
Coverage: 100.00%
Top unknown words in the corpus:
Translation time: 0.0038390159606933594 seconds
bible corpus size (tokens):
alt: commit 9fd53d1efb6e6556848ccb6695d7d7a597164f73
stems: 182
bible coverage
Number of tokenised words in the corpus: 194244
Coverage: 61.89%
Top unknown words in the corpus:
354 ончо
335 1
313 12
302 13
300 9
293 14
290 8
289 6
288 11
286 2
284 10
283 3
281 4
280 5
272 15
272 Кайракан
268 7
261 ажыра
258 18
256 17
Translation time: 0.8320250511169434 seconds
bible corpus size (tokens): 133151 ../../../data4apertium/corpora/bible/alt.txt
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment