Created
July 8, 2021 02:50
-
-
Save yoichi/6751cd9829238bcc9357e661f1523823 to your computer and use it in GitHub Desktop.
split_characters.ipynb
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"nbformat": 4, | |
"nbformat_minor": 0, | |
"metadata": { | |
"colab": { | |
"name": "split_characters.ipynb", | |
"provenance": [], | |
"collapsed_sections": [], | |
"authorship_tag": "ABX9TyODU+WAK4N8tj4FIcEVQ686", | |
"include_colab_link": true | |
}, | |
"kernelspec": { | |
"name": "python3", | |
"display_name": "Python 3" | |
}, | |
"language_info": { | |
"name": "python" | |
} | |
}, | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "view-in-github", | |
"colab_type": "text" | |
}, | |
"source": [ | |
"<a href=\"https://colab.research.google.com/gist/yoichi/6751cd9829238bcc9357e661f1523823/split_characters.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "WOY75fig31Zk" | |
}, | |
"source": [ | |
"課題:エンコードされた文字列が与えられたとする。一文字ずつに分割せよ" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "S-WhWz7LvBM_" | |
}, | |
"source": [ | |
"まずは簡単なものから\n", | |
"\n", | |
"a: 0x61, b: 0x62, c: 0x63" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "zMITDSHbu9ZJ" | |
}, | |
"source": [ | |
"abc = b'\\x61\\x62\\x63'\n", | |
"abc.decode('ascii')" | |
], | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "X9-IQOOZvncM" | |
}, | |
"source": [ | |
"for i in range(len(abc)):\n", | |
" print(abc[i:i+1].decode('ascii'))" | |
], | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "woFt1U-6xFu6" | |
}, | |
"source": [ | |
"これだとどうだろう?\n", | |
"\n", | |
"あ: U+3042, い: U+3044, う: U+3046" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "ylpG4ACbxOPN" | |
}, | |
"source": [ | |
"あいう = b'\\x42\\x30\\x44\\x30\\x46\\x30'\n", | |
"あいう.decode('utf-16le')" | |
], | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "wmXjt-69VGZz" | |
}, | |
"source": [ | |
"# 1バイトずつ切り出すのはNG\n", | |
"for i in range(len(あいう)//2):\n", | |
" print(あいう[i:i+1].decode('utf-16le'))" | |
], | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "Ef1Q5SjpxZnC" | |
}, | |
"source": [ | |
"# 2バイトずつ切り出せばOK\n", | |
"for i in range(len(あいう)//2):\n", | |
" print(あいう[i*2:(i+1)*2].decode('utf-16le'))" | |
], | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "GcvtEP-Px0uW" | |
}, | |
"source": [ | |
"これはどう?\n", | |
"\n", | |
"🍌 https://emojipedia.org/banana/" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "_EI-NlgUyZFU" | |
}, | |
"source": [ | |
"ばなな = '美味しい🍌をどうぞ'.encode('utf-16le') \n", | |
"ばなな.decode('utf-16le')" | |
], | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "mRJXt3mrzSI2" | |
}, | |
"source": [ | |
"for i in range(len(ばなな)//2):\n", | |
" print(ばなな[i*2:(i+1)*2].decode('utf-16le'))" | |
], | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "an0T8_oxzkx-" | |
}, | |
"source": [ | |
"# バナナは何バイト?\n", | |
"print(len('🍌'.encode('utf-16le')))" | |
], | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "DibF55vlzsCq" | |
}, | |
"source": [ | |
"# バイト列の内容を見てみる\n", | |
"print(''.join('\\\\x%02x' % b for b in '🍌'.encode('utf-16le')))" | |
], | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "SOI7Jdvc0HDH" | |
}, | |
"source": [ | |
"🍌: U+1F34C \n", | |
"\n", | |
"[BMP (基本多言語面)](https://ja.wikipedia.org/wiki/%E5%9F%BA%E6%9C%AC%E5%A4%9A%E8%A8%80%E8%AA%9E%E9%9D%A2)\n", | |
"に含まれない\n", | |
"\n", | |
"→ UTF-16ではサロゲートペアで表現される\n", | |
"\n", | |
"10bit + 10bit に分けて、それぞれを 16bit で表現\n" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "mdXBPL7A0yrR" | |
}, | |
"source": [ | |
"'%04x' % ((0x1f34c & 0x3ff) | 0xdc00)" | |
], | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "FTeoIwp-1kZX" | |
}, | |
"source": [ | |
"d = 0x1f34c >> 10\n", | |
"'%04x' % ((((d >> 6)-1) << 6) | (d & 0x3f) | 0xd800)" | |
], | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "thPs5R5N3se8" | |
}, | |
"source": [ | |
"# サロゲートペアに配慮\n", | |
"for i in range(len(ばなな)//2):\n", | |
" if 0xdc <= ばなな[i*2+1] < 0xe0:\n", | |
" continue\n", | |
" elif 0xd8 <= ばなな[i*2+1] < 0xdc:\n", | |
" # サロゲートペア\n", | |
" print(ばなな[i*2:(i+2)*2].decode('utf-16le'))\n", | |
" else:\n", | |
" print(ばなな[i*2:(i+1)*2].decode('utf-16le'))" | |
], | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "extS30ZjOzmF" | |
}, | |
"source": [ | |
"# デコードしてしまってから文字単位で処理でもいい\n", | |
"for c in ばなな.decode('utf-16le'):\n", | |
" print(c)" | |
], | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "PnFwJRDsRXMt" | |
}, | |
"source": [ | |
"これだとどうだろう?\n", | |
"\n", | |
"🇯🇵 https://emojipedia.org/flag-japan/\n", | |
"\n", | |
"Windowsだと\"JP\"と表示される。他の環境だと国旗が表示される。\n", | |
"\n", | |
"https://www.emojiall.com/ja/blog/321\n" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "1iZtn4bGQ1-z" | |
}, | |
"source": [ | |
"旗='🇯🇵'.encode('utf-16le')\n", | |
"for i in range(len(旗)//2):\n", | |
" if 0xdc <= 旗[i*2+1] < 0xe0:\n", | |
" continue\n", | |
" elif 0xd8 <= 旗[i*2+1] < 0xdc:\n", | |
" print(旗[i*2:(i+2)*2].decode('utf-16le'))\n", | |
" else:\n", | |
" print(旗[i*2:(i+1)*2].decode('utf-16le'))" | |
], | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "iA134EwARr6k" | |
}, | |
"source": [ | |
"# バイト列の内容を見てみる\n", | |
"print(''.join('\\\\x%02x' % b for b in 旗))" | |
], | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "HULSIy3uQjJT" | |
}, | |
"source": [ | |
"# 3rd partyのライブラリをインストール\n", | |
"!pip install uniseg" | |
], | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "bMKCoKXWQlDp" | |
}, | |
"source": [ | |
"import uniseg.graphemecluster\n", | |
"for c in uniseg.graphemecluster.grapheme_clusters(旗.decode('utf-16le')):\n", | |
" print(c)" | |
], | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "MF-J8KrKPKdA" | |
}, | |
"source": [ | |
"これはどうか?\n", | |
"\n", | |
"👩🎓 https://emojipedia.org/woman-student/" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "ONW9vMKsPeTS" | |
}, | |
"source": [ | |
"人='👩🎓'.encode('utf-16le')\n", | |
"for i in range(len(人)//2):\n", | |
" if 0xdc <= 人[i*2+1] < 0xe0:\n", | |
" continue\n", | |
" elif 0xd8 <= 人[i*2+1] < 0xdc:\n", | |
" print(人[i*2:(i+2)*2].decode('utf-16le'))\n", | |
" else:\n", | |
" print(人[i*2:(i+1)*2].decode('utf-16le'))" | |
], | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "iSY9h6bERz5N" | |
}, | |
"source": [ | |
"import uniseg.graphemecluster\n", | |
"for c in uniseg.graphemecluster.grapheme_clusters(人.decode('utf-16le')):\n", | |
" print(c)" | |
], | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "Bi4a4HWkQRTR" | |
}, | |
"source": [ | |
"# バイト列の内容を見てみる (ZWJ: U+200D)\n", | |
"print(''.join('\\\\x%02x' % b for b in 人))" | |
], | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "eryPegu_SAFc" | |
}, | |
"source": [ | |
"uniseg は古いUnicodeに基づいており、ZWJシーケンスに非対応\n", | |
"\n", | |
"* http://ufcppfree.azurewebsites.net/Grapheme?s=%F0%9F%91%A9%E2%80%8D%F0%9F%8E%93\n", | |
" * [UNICODE TEXT SEGMENTATION (latest: Unicode 13.0.0)](https://unicode.org/reports/tr29/)\n", | |
"* [pypi: uniseg](https://pypi.org/project/uniseg/)\n", | |
" * [UNICODE TEXT SEGMENTATION (Unicode 6.2.0)](http://www.unicode.org/reports/tr29/tr29-21.html)\n", | |
"\n", | |
"[pypi: grapheme](https://pypi.org/project/grapheme/) は最新のUnicodeに対応している。" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "4eKs-l-E7C2J" | |
}, | |
"source": [ | |
"!pip install grapheme" | |
], | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "rz_ARPba7KO5" | |
}, | |
"source": [ | |
"import grapheme\n", | |
"for c in grapheme.graphemes(人.decode('utf-16le')):\n", | |
" print(c)" | |
], | |
"execution_count": null, | |
"outputs": [] | |
} | |
] | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment