Skip to content

Instantly share code, notes, and snippets.

@rjweiss
Last active December 28, 2015 23:09
Show Gist options
  • Save rjweiss/7576978 to your computer and use it in GitHub Desktop.
Save rjweiss/7576978 to your computer and use it in GitHub Desktop.
encoding
{
"metadata": {
"name": "encoding_lecture.ipynb"
},
"nbformat": 3,
"nbformat_minor": 0,
"worksheets": [
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": []
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"import urllib2\n",
"f = urllib2.urlopen('http://www.stanford.edu/~rjweiss/public_html/IRiSS2013/text1/extra/esperanto.txt')\n",
"foo = f.readlines()\n",
"f.close()\n",
"\n",
"foo[0]"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "pyout",
"prompt_number": 1,
"text": [
"\"Nul d\\xc3\\xa0 for\\xc3\\xa9n fo\\xc3\\xbbnd. Far eksa du\\xc5\\x93nd\\xc3\\xaff\\xc3\\xafna mi, miloj \\xc3\\xb4kcid\\xc3\\xa8nte n\\xc3\\xaa far, g\\xc3\\xabtto \\xc3\\xaflion men oj. Oz p\\xc3\\xa9r olog kvin \\xc3\\xa8st\\xc3\\xaf\\xc3\\xa9l, \\xc3\\xaesm ja c\\xc3\\xa9nt t\\xc3\\xa0g\\xc5\\x93. Su\\xc5\\x93m\\xc3\\xafo r\\xc3\\xa8spond\\xc3\\xab ba ena, \\xc3\\xa0j jen am\\xc3\\xa9n nett\\xc3\\xaa, sor d\\xc3\\xaavus multe duont\\xc3\\xb4n\\xc5\\x93 aj. Ki\\xc3\\xa0n f\\xc3\\xbbnd\\xc3\\xa2m\\xc3\\xa8nto bv p\\xc3\\xaar, plej\\xc3\\xa2 log'\\xc3\\xb4 \\xc3\\xaeomete la ojd, o\\xc3\\xaed in \\xc3\\xa9kkri\\xc3\\xb4 \\xc3\\xaenf\\xc3\\xaenit\\xc3\\xafv\\xc5\\x93. Ing pli\\xc3\\xa9 franjo rilativ\\xc3\\xb4 nv.\\n\""
]
}
],
"prompt_number": 1
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#all of that mumbo jumbo means that we're dealing with some text that the python console won't render into glyphs...it's in some kind of encoding\n",
"#but we didn't get an error, so there's nothing WRONG here, it's just a rendering issue.\n",
"#let's try printing it\n",
"print foo[0]"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"Nul d\u00e0 for\u00e9n fo\u00fbnd. Far eksa du\u0153nd\u00eff\u00efna mi, miloj \u00f4kcid\u00e8nte n\u00ea far, g\u00ebtto \u00eflion men oj. Oz p\u00e9r olog kvin \u00e8st\u00ef\u00e9l, \u00eesm ja c\u00e9nt t\u00e0g\u0153. Su\u0153m\u00efo r\u00e8spond\u00eb ba ena, \u00e0j jen am\u00e9n nett\u00ea, sor d\u00eavus multe duont\u00f4n\u0153 aj. Ki\u00e0n f\u00fbnd\u00e2m\u00e8nto bv p\u00ear, plej\u00e2 log'\u00f4 \u00eeomete la ojd, o\u00eed in \u00e9kkri\u00f4 \u00eenf\u00eenit\u00efv\u0153. Ing pli\u00e9 franjo rilativ\u00f4 nv.\n",
"\n"
]
}
],
"prompt_number": 2
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# right on, look at all those weirdo characters\n",
"# i so happened to have made this file, so i know that it's in utf-8 encoding.\n",
"foo_prime = foo[0]\n",
"print type(foo_prime)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"<type 'str'>\n"
]
}
],
"prompt_number": 3
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"ascii_foo = foo_prime.decode('ascii')\n",
"# what the heck is happening?!?!?1"
],
"language": "python",
"metadata": {},
"outputs": [
{
"ename": "UnicodeDecodeError",
"evalue": "'ascii' codec can't decode byte 0xc3 in position 5: ordinal not in range(128)",
"output_type": "pyerr",
"traceback": [
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m\n\u001b[0;31mUnicodeDecodeError\u001b[0m Traceback (most recent call last)",
"\u001b[0;32m<ipython-input-4-27a81d0ae3d5>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mascii_foo\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mfoo_prime\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mdecode\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'ascii'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 2\u001b[0m \u001b[0;31m# what the heck is happening?!?!?1\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;31mUnicodeDecodeError\u001b[0m: 'ascii' codec can't decode byte 0xc3 in position 5: ordinal not in range(128)"
]
}
],
"prompt_number": 4
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#ok, can't decode a utf-8 string into unicode by claiming it's ascii.\n",
"utf8_foo = foo_prime.decode('utf-8')"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 5
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"unicode_foo = foo[0].decode('utf-8')\n",
"print type(unicode_foo)\n",
"unicode_foo #ahhh! some of the characters now are unicode code points. See how there are some \\u####?"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"<type 'unicode'>\n"
]
},
{
"output_type": "pyout",
"prompt_number": 6,
"text": [
"u\"Nul d\\xe0 for\\xe9n fo\\xfbnd. Far eksa du\\u0153nd\\xeff\\xefna mi, miloj \\xf4kcid\\xe8nte n\\xea far, g\\xebtto \\xeflion men oj. Oz p\\xe9r olog kvin \\xe8st\\xef\\xe9l, \\xeesm ja c\\xe9nt t\\xe0g\\u0153. Su\\u0153m\\xefo r\\xe8spond\\xeb ba ena, \\xe0j jen am\\xe9n nett\\xea, sor d\\xeavus multe duont\\xf4n\\u0153 aj. Ki\\xe0n f\\xfbnd\\xe2m\\xe8nto bv p\\xear, plej\\xe2 log'\\xf4 \\xeeomete la ojd, o\\xeed in \\xe9kkri\\xf4 \\xeenf\\xeenit\\xefv\\u0153. Ing pli\\xe9 franjo rilativ\\xf4 nv.\\n\""
]
}
],
"prompt_number": 6
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#ok, wonderful. why do i care about this? well, let's see what happens if i try to force this unicode string into ASCII\n",
"unicode_foo.encode('ascii')"
],
"language": "python",
"metadata": {},
"outputs": [
{
"ename": "UnicodeEncodeError",
"evalue": "'ascii' codec can't encode character u'\\xe0' in position 5: ordinal not in range(128)",
"output_type": "pyerr",
"traceback": [
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m\n\u001b[0;31mUnicodeEncodeError\u001b[0m Traceback (most recent call last)",
"\u001b[0;32m<ipython-input-7-645942daeb8e>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m()\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[0;31m#ok, wonderful. why do i care about this? well, let's see what happens if i try to force this unicode string into ASCII\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 2\u001b[0;31m \u001b[0municode_foo\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mencode\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'ascii'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
"\u001b[0;31mUnicodeEncodeError\u001b[0m: 'ascii' codec can't encode character u'\\xe0' in position 5: ordinal not in range(128)"
]
}
],
"prompt_number": 7
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#burn! explain why this is happening"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 8
}
],
"metadata": {}
}
]
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment