Skip to content

Instantly share code, notes, and snippets.

@msukmanowsky
Last active August 29, 2015 14:02
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save msukmanowsky/73451e0e31ca1637961a to your computer and use it in GitHub Desktop.
Save msukmanowsky/73451e0e31ca1637961a to your computer and use it in GitHub Desktop.

Python 2.7 contains a bug when dealing with percent-encoded Unicode strings such as:

>>> import urlparse
>>> url = u"http%3A%2F%2F%C5%A1%C4%BC%C5%AB%C4%8D.org%2F"
>>> print "{!r}".format(urlparse.unquote(url))
u'http://\xc5\xa1\xc4\xbc\xc5\xab\xc4\x8d.org/'
>>> print urlparse.unquote(url)
http://šļū�.org/

The result returned from urlparse.unquote is incorrect and is instead ASCII characters in a Unicode string, not Unicode characters. We can see when we try to print the unquoted result which is encoded from Unicode to ASCII and produces the ASCII representation of the Unicode string http://šļū�.org/

ASCII / LATIN-1 http://www.ascii-code.com/

Hex Character Description
C5 Å Latin capital letter A with ring above
A1 ¡ Inverted exclamation mark
C4 Ä Latin capital letter A with diaeresis
BC ¼ Fraction one quarter
C5 Å Latin capital letter A with ring above
AB « Left double angle quotes
C4 Ä Latin capital letter A with diaeresis
8D Nothing defined, unused in ASCII

The current standard for percent-encoded strings states:

The generic URI syntax mandates that new URI schemes that provide for the representation of character data in a URI must, in effect, represent characters from the unreserved set without translation, and should convert all other characters to bytes according to UTF-8, and then percent-encode those values. This requirement was introduced in January 2005 with the publication of RFC 3986. URI schemes introduced before this date are not affected.

In otherwords, the hex encoding should be assumed to be UTF-8, not ASCII. The proper result is returned after encoding the string prior to unquote.

>>> print "{!r}".format(urlparse.unquote(url.encode("utf-8")))
'http://\xc5\xa1\xc4\xbc\xc5\xab\xc4\x8d.org/'
>>> print "{!r}".format(urlparse.unquote(url.encode("utf-8")).decode("utf-8"))
u'http://\u0161\u013c\u016b\u010d.org/'
>>> print urlparse.unquote(url.encode("utf-8"))
http://šļūč.org/

So the solution is either to monkey patch the unquote function, or to ensure that prior to using any function that uses unquote (like parse_qs or parse_qsl), you ensure that Unicode strings are encoded to something like UTF-8.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment