Skip to content

Instantly share code, notes, and snippets.

@nakagami
Last active March 30, 2018 07:10
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save nakagami/b767de198705d28cb097c0f89a63ed96 to your computer and use it in GitHub Desktop.
Save nakagami/b767de198705d28cb097c0f89a63ed96 to your computer and use it in GitHub Desktop.
青空文庫の「我輩は猫である」を形態素解析して読み仮名を表示
import io
import re
import collections
import zipfile
import requests
from janome.tokenizer import Tokenizer
r = requests.get('http://www.aozora.gr.jp/cards/000148/files/789_ruby_5639.zip')
f = zipfile.ZipFile(io.BytesIO(r.content)).open('wagahaiwa_nekodearu.txt')
text = f.read().decode('cp932')
text = re.sub('《[^》]+》', '', text)
text = re.sub('|', '', text)
text = re.sub('[.+?]', '', text)
text = re.sub('-----[\s\S]*-----', '', text)
text = re.split('底本:',text)[0]
t = Tokenizer()
for line in text.split('\n'):
print(''.join([token.reading for token in t.tokenize(line)]))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment