Skip to content

Instantly share code, notes, and snippets.

@shihono
shihono / jdepp_make_install.ipynb
Created November 19, 2023 11:07
jdepp_make_install.ipynb
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
@shihono
shihono / ndiff_fulwidth.py
Created August 31, 2023 10:55
python difflib.ndiff を全角に対応
import difflib
import unicodedata
def get_char_width_list(text):
"""text の文字ごとの幅をリストで返す
半角の場合は1, 全角の場合は2
"""
result = []
for c in text:
if unicodedata.east_asian_width(c) in ["F", "W"]:
@shihono
shihono / japanese_lm.ipynb
Created June 12, 2022 23:38
japanese_lm.ipynb
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
@shihono
shihono / pytorch_ngram_lm.ipynb
Created May 1, 2022 08:19
pytorch_ngram_lm.ipynb
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
@shihono
shihono / kneser_ney_smoothing.ipynb
Created April 17, 2022 05:47
kneser_ney_smoothing.ipynb
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
@shihono
shihono / nltk_lm_examples.ipynb
Created April 5, 2022 00:15
nltk_lm_examples.ipynb
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
@shihono
shihono / torchtext_pentreebank_gt.ipynb
Created February 28, 2022 00:19
torchtext_pentreebank_gt.ipynb
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
import lzma
import sys
import glob
def load_xz_ngram_file(n=2):
ngram_dict = {}
ngram_freq = {}
files = glob.glob("/path/to/nwc2010-ngrams/word/over999/{}gms/*.xz".format(n))
for file in files:
print(file)
from sklearn.base import BaseEstimator
import MeCab
class OwakatiTokenizer(BaseEstimator):
def __init__(self, dicdir=None, rcfile=None):
"""
:param dicdir: system dicdir `-d`
:param rcfile: resource file `-r`
"""