Skip to content

Instantly share code, notes, and snippets.

@oyakata
Created July 9, 2011 08:05
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save oyakata/1073430 to your computer and use it in GitHub Desktop.
Save oyakata/1073430 to your computer and use it in GitHub Desktop.
マルチバイト文字を含む正規表現のパターンを定義するときはunicodeでパターンを書かないと予期せぬ結果をもたらす
# -*- coding:utf-8 -*-
import re
regex1 = re.compile(r"[  ]+")
regex2 = re.compile(ur"[  ]+")
def main(*args):
"""\
str でパターンを定義したregex1はおかしな分割がされてしまいます。
unicodeで定義したregex2は大丈夫です。
================
実行結果
================
?名氏俊
瀬名氏俊
"""
name = "瀬名 氏俊"
print "".join(regex1.split(name))
print "".join(regex2.split(name))
if __name__ == '__main__':
import sys
main(sys.argv[1:])
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment