Skip to content

Instantly share code, notes, and snippets.

@liubin
Forked from tomowarkar/mecab_cabocha.ipynb
Created October 24, 2022 03:18
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save liubin/6a5339c9584d366aadd863ed2cbeab39 to your computer and use it in GitHub Desktop.
Save liubin/6a5339c9584d366aadd863ed2cbeab39 to your computer and use it in GitHub Desktop.
How to use MeCab and CaboCha in Google Colaboratory! you can also see here: https://tomowarkar.github.io/blog/posts/colab_mecab/
Display the source blob
Display the rendered blob
Raw
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
@liubin
Copy link
Author

liubin commented Oct 24, 2022

My steps:

basic packages

apt-get install mecab swig libmecab-dev mecab-ipadic-utf8
apt install python
apt install python-dev

wget https://bootstrap.pypa.io/pip/2.7/get-pip.py
python get-pip.py
pip install mecab-python==0.996

CRF

curl -sL -o CRF++-0.58.tar.gz "https://drive.google.com/uc?export=download&id=0B4y35FiV1wh7QVR6VXJ5dWExSTQ"
tar -zxf CRF++-0.58.tar.gz
cd CRF++-0.58
./configure && make && make install && ldconfig
cd ..

cabocha

url="https://drive.google.com/uc?export=download&id=0B4y35FiV1wh7SDd1Q1dUQkZQaUU"
curl -sc /tmp/cookie ${url} >/dev/null
code="$(awk '/_warning_/ {print $NF}' /tmp/cookie)"
curl -sLb /tmp/cookie ${url}"&confirm=${code}" -o cabocha-0.69.tar.bz2
tar -jxf cabocha-0.69.tar.bz2
cd cabocha-0.69
./configure -with-charset=utf-8 
make
make check
make install
ldconfig
pip install python/
cd ..

中古和文UniDic

wget https://clrd.ninjal.ac.jp/unidic_archive/2203/UniDic-202203_20_chuko.zip
unzip UniDic-202203_20_chuko.zip 
cd 20_chuko/
ls -tl /var/lib/mecab/dic/ipadic-utf8
ln -s /var/lib/mecab/dic/ipadic-utf8/dicrc dicrc
mecab -d ./

Ptyhon

import CaboCha
cp = CaboCha.Parser("-d 20_chuko")
print(cp.parseToString("いづれの御時にか、女御、更衣あまたさぶらひたまひけるなかに、いとやむごとなき際にはあらぬが、すぐれて時めきたまふありけり"))

Result:

          いづれの御時にか、-----D        
                        女御、---D        
    更衣あまたさぶらひたまひける-D        
                          なかに、---D    
              いとやむごとなき際には-D    
                            あらぬが、---D
                                すぐれて-D
                      時めきたまふありけり

配置 cabocha

diff /usr/local/etc/cabocharc /usr/local/etc/cabocharc.bak 
8c8
< # posset = IPA
---
> posset = IPA
11c11
< posset = UNIDIC
---
> # posset = UNIDIC
39c39
< # parser-model  = /usr/local/lib/cabocha/model/dep.ipa.model
---
> parser-model  = /usr/local/lib/cabocha/model/dep.ipa.model
44c44
< parser-model  = /usr/local/lib/cabocha/model/dep.unidic.model
---
> # parser-model  = /usr/local/lib/cabocha/model/dep.unidic.model
48c48
< # chunker-model = /usr/local/lib/cabocha/model/chunk.ipa.model
---
> chunker-model = /usr/local/lib/cabocha/model/chunk.ipa.model
51c51
< chunker-model = /usr/local/lib/cabocha/model/chunk.unidic.model
---
> # chunker-model = /usr/local/lib/cabocha/model/chunk.unidic.model
54c54
< # ne-model = /usr/local/lib/cabocha/model/ne.ipa.model
---
> ne-model = /usr/local/lib/cabocha/model/ne.ipa.model
57c57
< ne-model = /usr/local/lib/cabocha/model/ne.unidic.model
---
> # ne-model = /usr/local/lib/cabocha/model/ne.unidic.model

Python

>>> import CaboCha
>>> cp = CaboCha.Parser("-d 20_chuko -P UNIDIC")
>>> print(cp.parseToString("いづれの御時にか、女御、更衣あまたさぶらひたまひけるなかに、いとやむごとなき際にはあらぬが、すぐれて時めきたまふありけり"))
                    いづれの-D                  
                    御時にか-----------------D
                          女御-D             |
      更衣あまたさぶらひたまひける-D           |
                            なかに-------D   |
                                  いと-D   |   |
                            やむごとなき-D |   |
                                    際には-D   |
                                  あらぬが---D
                                      すぐれて-D
                            時めきたまふありけり
EOS

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment