msukmanowsky/python_urlparse_unicode_woes.md

## python_urlparse_unicode_woes.md

      
    Raw
  

              python_urlparse_unicode_woes.md
            
          
    Python 2.7 contains a bug when dealing with percent-encoded Unicode strings
such as:
>>> import urlparse
>>> url = u"http%3A%2F%2F%C5%A1%C4%BC%C5%AB%C4%8D.org%2F"
>>> print "{!r}".format(urlparse.unquote(url))
u'http://\xc5\xa1\xc4\xbc\xc5\xab\xc4\x8d.org/'
>>> print urlparse.unquote(url)
http://Å¡Ä¼Å«Ä�.org/

The result returned from urlparse.unquote is incorrect and is instead ASCII
characters in a Unicode string, not Unicode characters. We can see when we try
to print the unquoted result which is encoded from Unicode to ASCII and
produces the ASCII representation of the Unicode string http://Å¡Ä¼Å«Ä�.org/
ASCII / LATIN-1 http://www.ascii-code.com/


Hex
Character
Description


C5
Å
Latin capital letter A with ring above


A1
¡
Inverted exclamation mark


C4
Ä
Latin capital letter A with diaeresis


BC
¼
Fraction one quarter


C5
Å
Latin capital letter A with ring above


AB
«
Left double angle quotes


C4
Ä
Latin capital letter A with diaeresis


8D

Nothing defined, unused in ASCII


The current standard for percent-encoded strings states:

The generic URI syntax mandates that new URI schemes that provide for the representation of character data in a URI must, in effect, represent characters from the unreserved set without translation, and should convert all other characters to bytes according to UTF-8, and then percent-encode those values. This requirement was introduced in January 2005 with the publication of RFC 3986. URI schemes introduced before this date are not affected.

In otherwords, the hex encoding should be assumed to be UTF-8, not ASCII. The
proper result is returned after encoding the string prior to unquote.
>>> print "{!r}".format(urlparse.unquote(url.encode("utf-8")))
'http://\xc5\xa1\xc4\xbc\xc5\xab\xc4\x8d.org/'
>>> print "{!r}".format(urlparse.unquote(url.encode("utf-8")).decode("utf-8"))
u'http://\u0161\u013c\u016b\u010d.org/'
>>> print urlparse.unquote(url.encode("utf-8"))
http://šļūč.org/

So the solution is either to monkey patch
the unquote function, or to ensure that prior to using any function that uses
unquote (like parse_qs or parse_qsl), you ensure that Unicode strings are
encoded to something like UTF-8.
Hex	Character	Description
C5	Å	Latin capital letter A with ring above
A1	¡	Inverted exclamation mark
C4	Ä	Latin capital letter A with diaeresis
BC	¼	Fraction one quarter
C5	Å	Latin capital letter A with ring above
AB	«	Left double angle quotes
C4	Ä	Latin capital letter A with diaeresis
8D		Nothing defined, unused in ASCII