Python 2.7 contains a bug when dealing with percent-encoded Unicode strings such as:
>>> import urlparse
>>> url = u"http%3A%2F%2F%C5%A1%C4%BC%C5%AB%C4%8D.org%2F"
>>> print "{!r}".format(urlparse.unquote(url))
u'http://\xc5\xa1\xc4\xbc\xc5\xab\xc4\x8d.org/'
>>> print urlparse.unquote(url)
http://šļū�.org/
The result returned from urlparse.unquote
is incorrect and is instead ASCII
characters in a Unicode string, not Unicode characters. We can see when we try
to print the unquoted result which is encoded from Unicode to ASCII and
produces the ASCII representation of the Unicode string http://šļū�.org/
ASCII / LATIN-1 http://www.ascii-code.com/
Hex | Character | Description |
---|---|---|
C5 | Å | Latin capital letter A with ring above |
A1 | ¡ | Inverted exclamation mark |
C4 | Ä | Latin capital letter A with diaeresis |
BC | ¼ | Fraction one quarter |
C5 | Å | Latin capital letter A with ring above |
AB | « | Left double angle quotes |
C4 | Ä | Latin capital letter A with diaeresis |
8D | Nothing defined, unused in ASCII |
The current standard for percent-encoded strings states:
The generic URI syntax mandates that new URI schemes that provide for the representation of character data in a URI must, in effect, represent characters from the unreserved set without translation, and should convert all other characters to bytes according to UTF-8, and then percent-encode those values. This requirement was introduced in January 2005 with the publication of RFC 3986. URI schemes introduced before this date are not affected.
In otherwords, the hex encoding should be assumed to be UTF-8, not ASCII. The
proper result is returned after encoding the string prior to unquote
.
>>> print "{!r}".format(urlparse.unquote(url.encode("utf-8")))
'http://\xc5\xa1\xc4\xbc\xc5\xab\xc4\x8d.org/'
>>> print "{!r}".format(urlparse.unquote(url.encode("utf-8")).decode("utf-8"))
u'http://\u0161\u013c\u016b\u010d.org/'
>>> print urlparse.unquote(url.encode("utf-8"))
http://šļūč.org/
So the solution is either to monkey patch
the unquote
function, or to ensure that prior to using any function that uses
unquote
(like parse_qs
or parse_qsl
), you ensure that Unicode strings are
encoded to something like UTF-8.