Yohei Tamura tamuhey

## tokenizations.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                tamuhey
                / tokenizations.md
            
            
              Last active
              January 2, 2020 16:38
            
              
                tokenization alignment
              
          
    2つのtokenizationのアラインメントを求めたい．
https://github.com/explosion/spacy-transformers/issues/87
stack overflowで質問した．
https://stackoverflow.com/questions/59497734/how-can-i-get-an-alignment-for-two-different-tokenizations
diffと似ていそう

  
## mecab_setup.sh
pip install gdown --user
gdown https://drive.google.com/uc?id=0B4y35FiV1wh7cENtOXlicTFaRUE
gdown https://drive.google.com/uc?id=0B4y35FiV1wh7MWVlSDBCSXZMTXM
tar xzvf mecab-0.996.tar.gz
cd mecab-0.996
./configure
make
make check
sudo make install

## pyproject.toml
[tool.poetry]
name = "foo"
version = "0.1.0"
description = ""

[tool.poetry.dependencies]
python = "^3.7"
bar = {path = "bar"}

[tool.poetry.dev-dependencies]

## install_mecab.sh
export MECAB_URL="https://drive.google.com/uc?export=download&id=0B4y35FiV1wh7cENtOXlicTFaRUE" && \
export IPADIC_URL="https://drive.google.com/uc?export=download&id=0B4y35FiV1wh7MWVlSDBCSXZMTXM" && \
cd /tmp && \
    wget --no-check-certificate ${MECAB_URL} -O mecab.tar.gz && \
    tar xzvf mecab.tar.gz && cd mecab-0.996 && ./configure && make && make check && make install && \
    rm -rf /tmp/* && \
cd /tmp && \
    wget --no-check-certificate ${IPADIC_URL} -O ipadic.tar.gz && \
    tar xzvf ipadic.tar.gz && cd mecab-ipadic-2.7.0-20070801 && ./configure --with-charset=utf8 && ldconfig && make && make install && \
    rm -rf /tmp/*

## oss_license.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                tamuhey
                / oss_license.md
            
            
              Last active
              February 4, 2020 10:26
            
              
                OSSライセンス検討
              
          
    MITではなくApache2にする.

商標に関する条項があり，今回作成したロゴとソフト名(Camphr)の権利を排他的に保持できるから．
特許に関する条項がきちんと書いてあり，CLAを用意する必要がないから (CLAはcontributorへのハードルを上げる & 個人情報管理の手間が増える)

https://opensource.guide/legal/#does-my-project-need-an-additional-contributor-agreement
https://opensource.stackexchange.com/questions/5585/benefits-of-a-cla-when-using-apache-2-0-license


参考


## bccwj2jsonl.py
"""Script to convert bccwj NER dataset to jsonl

Usage:

$ python bccwj2jsonl xml/ output/

# convert to irex

$ pythonn bccwj2jsonl xml/ output/ irex
"""

## camphr.md

      
              1 file
            
          
              0 forks
            
          
              1 comment
            
          
              1 star
            
          
                tamuhey
                / camphr.md
            
            
              Last active
              May 7, 2020 05:51
            
          
    Camphr: spaCy plugin for Transformers, Udify, Elmo, etc.

Hi, I'm Yohei Tamura, a software engineer at PKSHA Technology. I recently published a spaCy plugin called Camphr, which helps in seamless integration for a wide variety of techniques from state-of-the-art to conventional ones. You can use Transformers, Udify, ELmo, etc. on spaCy.
This post introduces how to use Camphr in a nutshell.
Why I chose spaCy

spaCy is an awesome NLP framework and in my opinion has following advantages:

  
## .pre-commit-config.yml
default_language_version:
    python: python3.7

repos:
-   repo: https://github.com/pre-commit/pre-commit-hooks
    rev: v2.4.0
    hooks:
    - id: check-added-large-files
      args: ['--maxkb=1000']
    - id: check-merge-conflict

## tokenizations.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                tamuhey
                / tokenizations.md
            
            
              Created
              February 18, 2020 11:03
            
              
                https://github.com/tamuhey/tokenizations
              
          
    2つの分かち書きの対応を計算する

言語処理をする際，mecabなどのトークナイザを使ってテキストを分かち書きすることが多いと思います．本記事では，異なるトークナイザの出力（分かち書き）の対応を計算する方法とその実装(tokenizations)を紹介します．
例えば，以下のようなsentencepieceとBERTの分かち書きの結果の対応を計算する，トークナイザの実装に依存しない一般的な方法を見ていきます．
# 分かち書き
(a) BERT          : ['フ', '##ヘルト', '##ゥス', '##フルク', '条約', 'を', '締結']
(b) sentencepiece : ['▁', 'フ', 'ベル', 'トゥス', 'ブルク', '条約', 'を', '締結']


## tokenizations.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                tamuhey
                / tokenizations.md
            
            
              Last active
              February 18, 2020 11:12
            
          
    2つの分かち書きの対応を計算する

言語処理をする際，mecabなどのトークナイザを使ってテキストを分かち書きすることが多いと思います．本記事では，異なるトークナイザの出力（分かち書き）の対応を計算する方法とその実装(tokenizations)を紹介します．
例えば以下のような，sentencepieceとBERTの分かち書きの結果の対応を計算する，トークナイザの実装に依存しない一般的な方法を見ていきます．
# 分かち書き
(a) BERT          : ['フ', '##ヘルト', '##ゥス', '##フルク', '条約', 'を', '締結']
(b) sentencepiece : ['▁', 'フ', 'ベル', 'トゥス', 'ブルク', '条約', 'を', '締結']
	pip install gdown --user
	gdown https://drive.google.com/uc?id=0B4y35FiV1wh7cENtOXlicTFaRUE
	gdown https://drive.google.com/uc?id=0B4y35FiV1wh7MWVlSDBCSXZMTXM
	tar xzvf mecab-0.996.tar.gz
	cd mecab-0.996
	./configure
	make
	make check
	sudo make install
	[tool.poetry]
	name = "foo"
	version = "0.1.0"
	description = ""

	[tool.poetry.dependencies]
	python = "^3.7"
	bar = {path = "bar"}

	[tool.poetry.dev-dependencies]
	export MECAB_URL="https://drive.google.com/uc?export=download&id=0B4y35FiV1wh7cENtOXlicTFaRUE" && \
	export IPADIC_URL="https://drive.google.com/uc?export=download&id=0B4y35FiV1wh7MWVlSDBCSXZMTXM" && \
	cd /tmp && \
	wget --no-check-certificate ${MECAB_URL} -O mecab.tar.gz && \
	tar xzvf mecab.tar.gz && cd mecab-0.996 && ./configure && make && make check && make install && \
	rm -rf /tmp/* && \
	cd /tmp && \
	wget --no-check-certificate ${IPADIC_URL} -O ipadic.tar.gz && \
	tar xzvf ipadic.tar.gz && cd mecab-ipadic-2.7.0-20070801 && ./configure --with-charset=utf8 && ldconfig && make && make install && \
	rm -rf /tmp/*
	"""Script to convert bccwj NER dataset to jsonl

	Usage:

	$ python bccwj2jsonl xml/ output/

	# convert to irex

	$ pythonn bccwj2jsonl xml/ output/ irex
	"""
	default_language_version:
	python: python3.7

	repos:
	- repo: https://github.com/pre-commit/pre-commit-hooks
	rev: v2.4.0
	hooks:
	- id: check-added-large-files
	args: ['--maxkb=1000']
	- id: check-merge-conflict