Last active
March 30, 2018 07:10
-
-
Save nakagami/b767de198705d28cb097c0f89a63ed96 to your computer and use it in GitHub Desktop.
青空文庫の「我輩は猫である」を形態素解析して読み仮名を表示
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import io | |
import re | |
import collections | |
import zipfile | |
import requests | |
from janome.tokenizer import Tokenizer | |
r = requests.get('http://www.aozora.gr.jp/cards/000148/files/789_ruby_5639.zip') | |
f = zipfile.ZipFile(io.BytesIO(r.content)).open('wagahaiwa_nekodearu.txt') | |
text = f.read().decode('cp932') | |
text = re.sub('《[^》]+》', '', text) | |
text = re.sub('|', '', text) | |
text = re.sub('[.+?]', '', text) | |
text = re.sub('-----[\s\S]*-----', '', text) | |
text = re.split('底本:',text)[0] | |
t = Tokenizer() | |
for line in text.split('\n'): | |
print(''.join([token.reading for token in t.tokenize(line)])) |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment