Skip to content

Instantly share code, notes, and snippets.

View lovit's full-sized avatar
🧩
Focusing

Hyunjoong Kim lovit

🧩
Focusing
View GitHub Profile
@lovit
lovit / huggingface_konlpy_usage.md
Created August 27, 2020 22:29
Huggingface tokenizers / transformers + KoNLPy.md
import huggingface_konlpy

KoNLPy as pre-tokenizer

@lovit
lovit / huggingface_tokenizers_usage.md
Created August 27, 2020 22:28
Hugging Face tokenizers usage
import tokenizers
tokenizers.__version__
@lovit
lovit / huggingface_konlpy.md
Last active January 8, 2024 20:43
huggingface + KoNLPy

Huggingface

  • NLP 관련 다양한 패키지를 제공하고 있으며, 특히 언어 모델 (language models) 을 학습하기 위하여 세 가지 패키지가 유용
package note
transformers Transformer 기반 (masked) language models 알고리즘, 기학습된 모델을 제공
tokenizers transformers 에서 사용할 수 있는 토크나이저들을 학습/사용할 수 있는 기능 제공. transformers 와 분리된 패키지로 제공
nlp 데이터셋 및 평가 척도 (evaluation metrics) 을 제공
@lovit
lovit / umap_supervised_embedding.md
Created December 10, 2019 14:27
UMAP supervised embedding example
from sklearn.datasets import make_classification

X, y = make_classification(
    n_samples=50000, n_features=200, n_informative=5, 
    n_redundant=0, n_clusters_per_class=10, weights=[0.80],
    flip_y=0.05, class_sep=3.5, random_state=42
)

# standard normalization: (x - mean) / std
@lovit
lovit / scraped_news_corpus_statistics.md
Created September 16, 2019 07:38
scraped new scorpus statistics

Corpus summary

  • begin date = 2014-01-01
  • end date = 2019-08-16
  • num docs = 51016505
  • num sents = 413471083

Yearly summary

year begin date end date num docs num sents
@lovit
lovit / soynlp_noun_tokenizer_usage.md
Created April 30, 2019 05:32
soynlp Noun Tokenizer usage

현재 버전 (0.0.491) 에서는 코드가 정리되지 않아서 init 함수의 argument 이름이 바뀔 수 있습니다.

이 튜토리얼은 github.com/lovit/textmining-dataset 의 데이터셋을 이용한 예시입니다.

import soynlp
from soynlp.utils import DoublespaceLineCorpus
from soynlp.noun import LRNounExtractor_v2
from lovit_textmining_dataset.navernews_10days import get_news_paths
@lovit
lovit / bokeh_show_image.py
Last active April 9, 2019 17:24
Bokeh (1.0.4) image show example
# replace matplotlib.pyplot.imshow(img)
import numpy as np
from bokeh.plotting import figure, show, output_notebook
output_notebook()
N = 500
x = np.linspace(0, 10, N)

tmux cheatsheet

As configured in my dotfiles.

start new:

tmux

start new with session name: