Hyunjoong Kim lovit

## huggingface_konlpy_usage.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                lovit
                / huggingface_konlpy_usage.md
            
            
              Created
              August 27, 2020 22:29
            
              
                Huggingface tokenizers / transformers + KoNLPy.md
              
          
    import huggingface_konlpy
KoNLPy as pre-tokenizer


## huggingface_tokenizers_usage.md

      
              1 file
            
          
              2 forks
            
          
              1 comment
            
          
              5 stars
            
          
                lovit
                / huggingface_tokenizers_usage.md
            
            
              Created
              August 27, 2020 22:28
            
              
                Hugging Face tokenizers usage
              
          
    import tokenizers
tokenizers.__version__

  
## huggingface_konlpy.md

      
              1 file
            
          
              3 forks
            
          
              1 comment
            
          
              16 stars
            
          
                lovit
                / huggingface_konlpy.md
            
            
              Last active
              January 8, 2024 20:43
            
              
                huggingface + KoNLPy
              
          
    Huggingface


NLP 관련 다양한 패키지를 제공하고 있으며, 특히 언어 모델 (language models) 을 학습하기 위하여 세 가지 패키지가 유용


package
note


transformers
Transformer 기반 (masked) language models 알고리즘, 기학습된 모델을 제공


tokenizers
transformers 에서 사용할 수 있는 토크나이저들을 학습/사용할 수 있는 기능 제공. transformers 와 분리된 패키지로 제공


nlp
데이터셋 및 평가 척도 (evaluation metrics) 을 제공


## umap_supervised_embedding.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                lovit
                / umap_supervised_embedding.md
            
            
              Created
              December 10, 2019 14:27
            
              
                UMAP supervised embedding example
              
          
    from sklearn.datasets import make_classification

X, y = make_classification(
    n_samples=50000, n_features=200, n_informative=5, 
    n_redundant=0, n_clusters_per_class=10, weights=[0.80],
    flip_y=0.05, class_sep=3.5, random_state=42
)

# standard normalization: (x - mean) / std

  
## scraped_news_corpus_statistics.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                lovit
                / scraped_news_corpus_statistics.md
            
            
              Created
              September 16, 2019 07:38
            
              
                scraped new scorpus statistics
              
          
    Corpus summary


begin date = 2014-01-01
end date = 2019-08-16
num docs = 51016505
num sents = 413471083

Yearly summary


year
begin date
end date
num docs
num sents


## soynlp_noun_tokenizer_usage.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                lovit
                / soynlp_noun_tokenizer_usage.md
            
            
              Created
              April 30, 2019 05:32
            
              
                soynlp Noun Tokenizer usage
              
          
    현재 버전 (0.0.491) 에서는 코드가 정리되지 않아서 init 함수의 argument 이름이 바뀔 수 있습니다.
이 튜토리얼은 github.com/lovit/textmining-dataset 의 데이터셋을 이용한 예시입니다.
import soynlp
from soynlp.utils import DoublespaceLineCorpus
from soynlp.noun import LRNounExtractor_v2
from lovit_textmining_dataset.navernews_10days import get_news_paths

  
## bokeh_show_image.py
# replace matplotlib.pyplot.imshow(img)

import numpy as np

from bokeh.plotting import figure, show, output_notebook

output_notebook()

N = 500
x = np.linspace(0, 10, N)

## tmux_cheatsheet.markdown

      
              1 file
            
          
              1 fork
            
          
              0 comments
            
          
              0 stars
            
          
                lovit
                / tmux_cheatsheet.markdown
            
            
              Created
              April 3, 2019 16:06
                — forked from henrik/tmux_cheatsheet.markdown
            
              
                tmux cheatsheet
              
          
    tmux cheatsheet

As configured in my dotfiles.
start new:
tmux

start new with session name:
package	note
transformers	Transformer 기반 (masked) language models 알고리즘, 기학습된 모델을 제공
tokenizers	transformers 에서 사용할 수 있는 토크나이저들을 학습/사용할 수 있는 기능 제공. transformers 와 분리된 패키지로 제공
nlp	데이터셋 및 평가 척도 (evaluation metrics) 을 제공
	# replace matplotlib.pyplot.imshow(img)

	import numpy as np

	from bokeh.plotting import figure, show, output_notebook

	output_notebook()

	N = 500
	x = np.linspace(0, 10, N)