Created
July 9, 2011 08:05
-
-
Save oyakata/1073430 to your computer and use it in GitHub Desktop.
マルチバイト文字を含む正規表現のパターンを定義するときはunicodeでパターンを書かないと予期せぬ結果をもたらす
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# -*- coding:utf-8 -*- | |
import re | |
regex1 = re.compile(r"[ ]+") | |
regex2 = re.compile(ur"[ ]+") | |
def main(*args): | |
"""\ | |
str でパターンを定義したregex1はおかしな分割がされてしまいます。 | |
unicodeで定義したregex2は大丈夫です。 | |
================ | |
実行結果 | |
================ | |
?名氏俊 | |
瀬名氏俊 | |
""" | |
name = "瀬名 氏俊" | |
print "".join(regex1.split(name)) | |
print "".join(regex2.split(name)) | |
if __name__ == '__main__': | |
import sys | |
main(sys.argv[1:]) |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment