Skip to content

Instantly share code, notes, and snippets.

@buruzaemon
Last active November 25, 2015 14:48
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save buruzaemon/1f9fae20d635b2f2006e to your computer and use it in GitHub Desktop.
Save buruzaemon/1f9fae20d635b2f2006e to your computer and use it in GitHub Desktop.
# -*- coding: utf-8 -*-
import re
from natto import MeCab
patt = re.compile(r'''(?x)
# Ticker symbols
[0-9\uFF10-\uFF19]{2,}(\s|\.)[A-Z\uFF21-\uFF3A]{1,2} |
# short-from contractions
[\'|\u2019](d|ll|m|s|re|ve) |
# short-form negative (preserve in their entirety
n[\'|\u2019]t |
# hyphen- and forward-slash delimited words
([A-Z]+(\-|\/))+[A-Z]+ |
# extract text from single- and double-quotes
[\"\u2018\u2019\u201C\u201D]
''', re.IGNORECASE| re.MULTILINE)
txt = """
Wouldn't you know? I've had it! This's a fine how-do-you-do, isn't it, Jor-El!
She said "That'll be the day, uh-huh, that I die."
1099.T は銘柄コードの例です。
注意:
 今季はやはり 1079 JP および AAPL も気になる存在である。
 P/E の観点からいうと、双方は「ばちぐぅ~!」という感じです。
GOOGおよびAAPL、または1234.JPか1099 JPも何れも買いである。
An example with fancy double quotes: \u201CThe Sun Doesn't Rise Always, Don't You know?\u201D.
An example with fancy single quotes: What an \u2018exquisite\u2019 bouquet!
"""
nm = MeCab()
for n in nm.parse(txt, boundary_constraints=patt, as_nodes=True):
print(n.surface)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment