Skip to content

Instantly share code, notes, and snippets.

@yoichi
Created July 8, 2021 02:50
Show Gist options
  • Save yoichi/6751cd9829238bcc9357e661f1523823 to your computer and use it in GitHub Desktop.
Save yoichi/6751cd9829238bcc9357e661f1523823 to your computer and use it in GitHub Desktop.
split_characters.ipynb
Display the source blob
Display the rendered blob
Raw
{
"nbformat": 4,
"nbformat_minor": 0,
"metadata": {
"colab": {
"name": "split_characters.ipynb",
"provenance": [],
"collapsed_sections": [],
"authorship_tag": "ABX9TyODU+WAK4N8tj4FIcEVQ686",
"include_colab_link": true
},
"kernelspec": {
"name": "python3",
"display_name": "Python 3"
},
"language_info": {
"name": "python"
}
},
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "view-in-github",
"colab_type": "text"
},
"source": [
"<a href=\"https://colab.research.google.com/gist/yoichi/6751cd9829238bcc9357e661f1523823/split_characters.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "WOY75fig31Zk"
},
"source": [
"課題:エンコードされた文字列が与えられたとする。一文字ずつに分割せよ"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "S-WhWz7LvBM_"
},
"source": [
"まずは簡単なものから\n",
"\n",
"a: 0x61, b: 0x62, c: 0x63"
]
},
{
"cell_type": "code",
"metadata": {
"id": "zMITDSHbu9ZJ"
},
"source": [
"abc = b'\\x61\\x62\\x63'\n",
"abc.decode('ascii')"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "X9-IQOOZvncM"
},
"source": [
"for i in range(len(abc)):\n",
" print(abc[i:i+1].decode('ascii'))"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "woFt1U-6xFu6"
},
"source": [
"これだとどうだろう?\n",
"\n",
"あ: U+3042, い: U+3044, う: U+3046"
]
},
{
"cell_type": "code",
"metadata": {
"id": "ylpG4ACbxOPN"
},
"source": [
"あいう = b'\\x42\\x30\\x44\\x30\\x46\\x30'\n",
"あいう.decode('utf-16le')"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "wmXjt-69VGZz"
},
"source": [
"# 1バイトずつ切り出すのはNG\n",
"for i in range(len(あいう)//2):\n",
" print(あいう[i:i+1].decode('utf-16le'))"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "Ef1Q5SjpxZnC"
},
"source": [
"# 2バイトずつ切り出せばOK\n",
"for i in range(len(あいう)//2):\n",
" print(あいう[i*2:(i+1)*2].decode('utf-16le'))"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "GcvtEP-Px0uW"
},
"source": [
"これはどう?\n",
"\n",
"🍌 https://emojipedia.org/banana/"
]
},
{
"cell_type": "code",
"metadata": {
"id": "_EI-NlgUyZFU"
},
"source": [
"ばなな = '美味しい🍌をどうぞ'.encode('utf-16le') \n",
"ばなな.decode('utf-16le')"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "mRJXt3mrzSI2"
},
"source": [
"for i in range(len(ばなな)//2):\n",
" print(ばなな[i*2:(i+1)*2].decode('utf-16le'))"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "an0T8_oxzkx-"
},
"source": [
"# バナナは何バイト?\n",
"print(len('🍌'.encode('utf-16le')))"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "DibF55vlzsCq"
},
"source": [
"# バイト列の内容を見てみる\n",
"print(''.join('\\\\x%02x' % b for b in '🍌'.encode('utf-16le')))"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "SOI7Jdvc0HDH"
},
"source": [
"🍌: U+1F34C \n",
"\n",
"[BMP (基本多言語面)](https://ja.wikipedia.org/wiki/%E5%9F%BA%E6%9C%AC%E5%A4%9A%E8%A8%80%E8%AA%9E%E9%9D%A2)\n",
"に含まれない\n",
"\n",
"→ UTF-16ではサロゲートペアで表現される\n",
"\n",
"10bit + 10bit に分けて、それぞれを 16bit で表現\n"
]
},
{
"cell_type": "code",
"metadata": {
"id": "mdXBPL7A0yrR"
},
"source": [
"'%04x' % ((0x1f34c & 0x3ff) | 0xdc00)"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "FTeoIwp-1kZX"
},
"source": [
"d = 0x1f34c >> 10\n",
"'%04x' % ((((d >> 6)-1) << 6) | (d & 0x3f) | 0xd800)"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "thPs5R5N3se8"
},
"source": [
"# サロゲートペアに配慮\n",
"for i in range(len(ばなな)//2):\n",
" if 0xdc <= ばなな[i*2+1] < 0xe0:\n",
" continue\n",
" elif 0xd8 <= ばなな[i*2+1] < 0xdc:\n",
" # サロゲートペア\n",
" print(ばなな[i*2:(i+2)*2].decode('utf-16le'))\n",
" else:\n",
" print(ばなな[i*2:(i+1)*2].decode('utf-16le'))"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "extS30ZjOzmF"
},
"source": [
"# デコードしてしまってから文字単位で処理でもいい\n",
"for c in ばなな.decode('utf-16le'):\n",
" print(c)"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "PnFwJRDsRXMt"
},
"source": [
"これだとどうだろう?\n",
"\n",
"🇯🇵 https://emojipedia.org/flag-japan/\n",
"\n",
"Windowsだと\"JP\"と表示される。他の環境だと国旗が表示される。\n",
"\n",
"https://www.emojiall.com/ja/blog/321\n"
]
},
{
"cell_type": "code",
"metadata": {
"id": "1iZtn4bGQ1-z"
},
"source": [
"旗='🇯🇵'.encode('utf-16le')\n",
"for i in range(len(旗)//2):\n",
" if 0xdc <= 旗[i*2+1] < 0xe0:\n",
" continue\n",
" elif 0xd8 <= 旗[i*2+1] < 0xdc:\n",
" print(旗[i*2:(i+2)*2].decode('utf-16le'))\n",
" else:\n",
" print(旗[i*2:(i+1)*2].decode('utf-16le'))"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "iA134EwARr6k"
},
"source": [
"# バイト列の内容を見てみる\n",
"print(''.join('\\\\x%02x' % b for b in 旗))"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "HULSIy3uQjJT"
},
"source": [
"# 3rd partyのライブラリをインストール\n",
"!pip install uniseg"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "bMKCoKXWQlDp"
},
"source": [
"import uniseg.graphemecluster\n",
"for c in uniseg.graphemecluster.grapheme_clusters(旗.decode('utf-16le')):\n",
" print(c)"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "MF-J8KrKPKdA"
},
"source": [
"これはどうか?\n",
"\n",
"👩‍🎓 https://emojipedia.org/woman-student/"
]
},
{
"cell_type": "code",
"metadata": {
"id": "ONW9vMKsPeTS"
},
"source": [
"人='👩‍🎓'.encode('utf-16le')\n",
"for i in range(len(人)//2):\n",
" if 0xdc <= 人[i*2+1] < 0xe0:\n",
" continue\n",
" elif 0xd8 <= 人[i*2+1] < 0xdc:\n",
" print(人[i*2:(i+2)*2].decode('utf-16le'))\n",
" else:\n",
" print(人[i*2:(i+1)*2].decode('utf-16le'))"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "iSY9h6bERz5N"
},
"source": [
"import uniseg.graphemecluster\n",
"for c in uniseg.graphemecluster.grapheme_clusters(人.decode('utf-16le')):\n",
" print(c)"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "Bi4a4HWkQRTR"
},
"source": [
"# バイト列の内容を見てみる (ZWJ: U+200D)\n",
"print(''.join('\\\\x%02x' % b for b in 人))"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "eryPegu_SAFc"
},
"source": [
"uniseg は古いUnicodeに基づいており、ZWJシーケンスに非対応\n",
"\n",
"* http://ufcppfree.azurewebsites.net/Grapheme?s=%F0%9F%91%A9%E2%80%8D%F0%9F%8E%93\n",
" * [UNICODE TEXT SEGMENTATION (latest: Unicode 13.0.0)](https://unicode.org/reports/tr29/)\n",
"* [pypi: uniseg](https://pypi.org/project/uniseg/)\n",
" * [UNICODE TEXT SEGMENTATION (Unicode 6.2.0)](http://www.unicode.org/reports/tr29/tr29-21.html)\n",
"\n",
"[pypi: grapheme](https://pypi.org/project/grapheme/) は最新のUnicodeに対応している。"
]
},
{
"cell_type": "code",
"metadata": {
"id": "4eKs-l-E7C2J"
},
"source": [
"!pip install grapheme"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "rz_ARPba7KO5"
},
"source": [
"import grapheme\n",
"for c in grapheme.graphemes(人.decode('utf-16le')):\n",
" print(c)"
],
"execution_count": null,
"outputs": []
}
]
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment