Skip to content

Instantly share code, notes, and snippets.

View kylebgorman's full-sized avatar

Kyle Gorman kylebgorman

View GitHub Profile
@kylebgorman
kylebgorman / pairs.tsv
Created February 10, 2022 21:29
Katakana / English transliteration pairs, extracted from JMDict by Yuying Ren
We can't make this file beautiful and searchable because it's too large.
フルーツサラダ fruits salad
クリッパーチップ clipper chip
ライフサイクル life cycle
ボイストレーニング voice training
オップアート op art
ノーズコーン nose cone
インカムタックス income tax
エグゼクティブフロア executive floor
ウェブフォーム web form
ハムサンド ham sand
@kylebgorman
kylebgorman / rubert-embedding.py
Last active June 14, 2021 20:51
Embedding with rubert.
#!/usr/bin/env python
# Documented in: https://metatext.io/models/DeepPavlov-rubert-base-cased
import transformers
model_name = "DeepPavlov/rubert-base-cased"
model = transformers.AutoModel.from_pretrained(model_name)
tokenizer = transformers.AutoTokenizer.from_pretrained(model_name)
@kylebgorman
kylebgorman / asciify.pl
Created March 19, 2020 21:34
Converts from an English-like UTF-8 to ASCII
#!/usr/bin/perl
use strict;
use warnings;
use Unicode::Normalize;
use open ":encoding(utf8)";
binmode STDIN, ":encoding(utf8)";
binmode STDOUT, ":encoding(ascii)";
@kylebgorman
kylebgorman / 95to27.pl
Created March 19, 2020 21:33
Converts from printable ASCII to a 27-character vocabulary
#!/usr/bin/perl
use strict;
use warnings;
use open ":encoding(ascii)";
binmode STDIN, ":encoding(ascii)";
binmode STDOUT, ":encoding(ascii)";
binmode STDERR, ":encoding(ascii)";
@kylebgorman
kylebgorman / sgml2docs.py
Last active September 24, 2019 21:17
Splits Gigaword SGML documents into separate files
#!/usr/bin/env python
"""Extracts documents from the Gigaword SGML."""
import argparse
import logging
import os
import bs4
@kylebgorman
kylebgorman / LING78100-lecture02.ipynb
Created September 18, 2019 14:49
LING78100 Lecture 2
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
@kylebgorman
kylebgorman / lnre.py
Last active June 18, 2023 05:39
LNRE calculator
#!/usr/bin/env python
"""LNRE calculator.
This script computes a number of statistics characterizing LNRE data:
* N: corpus size
* V: vocabulary size
* V(1): the number of _hapax legomena_ (symbols occuring once)
* V(2): the number of _dis legomena_ (symbols occurring twice)
* V/N: vocabulary growth rate
@kylebgorman
kylebgorman / byte.sym
Created July 10, 2019 12:43
OpenFst byte symbol table
<epsilon> 0
<SOH> 1
<STX> 2
<ETX> 3
<EOT> 4
<ENQ> 5
<ACK> 6
<BEL> 7
<BS> 8
<HT> 9
@kylebgorman
kylebgorman / casefold.py
Created July 10, 2019 12:15
Applies Unicode case folding to input data
#!/usr/bin/env python
import fileinput
import nltk
if __name__ == "__main__":
for line in fileinput.input():
print(line.rstrip().casefold())
@kylebgorman
kylebgorman / word_tokenize.py
Last active June 18, 2023 05:35
Applies NLTK PTB tokenizer to input text
#!/usr/bin/env python
import fileinput
import nltk
if __name__ == "__main__":
for line in fileinput.input():
print(" ".join(nltk.word_tokenize(line)))