Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save guocheng/1ae6c2d76461a66cfc5ec6009b5791d1 to your computer and use it in GitHub Desktop.
Save guocheng/1ae6c2d76461a66cfc5ec6009b5791d1 to your computer and use it in GitHub Desktop.
Unicode & UTF-8 in Python
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This topic might be confusing due to the fact that the concept of unicode, UTF-8, hexdecimals and binaries are all mixed together. To clarify this topic, I am going to start with this:\n",
"\n",
"1. If `English is the only language on the planet`, then we don't need the concept of Unicode and UTF-8. ASCII would be enough. \n",
"\n",
"\n",
"2. However, since that is not the case, we need to go way beyound 256 symbols (the ASCII table) to hold everything. This bigger table that holds almost everything is called Unicode.\n",
"\n",
"\n",
"3. As the table gets bigger, 1 byte (8 bits) is not enough to hold all the information. It turns out that 4 bytes are need to do so.\n",
"\n",
"\n",
"4. Think about this for a second: When a computer reads four bytes, how does it know if it represents 1, 2, 3 or 4 characters? \n",
"\n",
"\n",
"5. If more than 1 byte is used to represent a character, then all the bytes need to be `packed` (think of a box) as one unit. This \"boxing\" method is called UTF-8."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### UTF-8 Format"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"| Number of bytes | Bits for code point (empty spaces) | Byte 1| Byte 2| Byte 3| Byte 4|\n",
"|------|------|---|---|---|---|\n",
"| 1 | 7 |0xxxxxxx|\n",
"| 2 | 11|110xxxxx|10xxxxxx||\n",
"| 3 | 16|1110xxxx|10xxxxxx|10xxxxxx||\n",
"| 4 | 21|11110xxx|10xxxxxx|10xxxxxx|10xxxxxx|"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As you can see above, the `x` represents the number of bits you can use for storing a character. Think of the 0s and 1s as headers."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Example Time"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A chinese character: 汉"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"1. Find the Unicode value of this character in hexdecimal format"
]
},
{
"cell_type": "code",
"execution_count": 45,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'0x6c49'"
]
},
"execution_count": 45,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"hex(ord('汉'))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"2. Humans do not think in hexdecimals, so we want to see the unicode value in decimal"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"27721"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ord('汉') "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"3. However, computers can only store this character in binaries:"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'0110110001001001'"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"f'{ord(\"汉\"):016b}'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"16 bits are need for packing this character. According to the UTF-8 format table above, 3 bytes (16 empty spaces) are need."
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"16"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(f'{ord(\"汉\"):016b}')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"4. so let's pack (encode) this character using the UTF-8 format"
]
},
{
"cell_type": "code",
"execution_count": 83,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'111001101011000110001001'"
]
},
"execution_count": 83,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"f'{int(\"汉\".encode(\"utf-8\").hex(), 16):b}'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"|Byte 1| Byte 2| Byte 3|\n",
"|---|---|---|\n",
"|1110 <font color='blue'><b>0110</b></font>|10<font color='blue'><b>110001</b></font>|10<font color='blue'><b>001001</b></font>|"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"5. Done! we can write the above binaries onto a hard drive now (notice how you can save a document in UTF-8 format in almost all text editors). When a computer reads this string, either you need to tell the text editor to read as UT8-8 or it will automatically to do so (default preference)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.3"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
@mrandreastoth
Copy link

"UT8-8" should be "UTF-8".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment