Skip to content

Instantly share code, notes, and snippets.

@karlcow

karlcow/lxml-bug.py

Created Aug 4, 2012
Embed
What would you like to do?
Silly lxml bug in Python
>>>from lxml import etree
>>> xml = u'<?xml version="1.0" encoding="utf-8" ?><foo><bar/></foo>'
>>> etree.XML(xml)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "lxml.etree.pyx", line 2736, in lxml.etree.XML (src/lxml/lxml.etree.c:54437)
File "parser.pxi", line 1569, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:82685)
ValueError: Unicode strings with encoding declaration are not supported.
>>> etree.HTML(xml)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "lxml.etree.pyx", line 2708, in lxml.etree.HTML (src/lxml/lxml.etree.c:54160)
File "parser.pxi", line 1569, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:82685)
ValueError: Unicode strings with encoding declaration are not supported.
>>> lxml.etree.__version__
u'2.3.3'
>>> xml = u"<foo><bar/></foo>"
>>> etree.HTML(xml)
<Element html at 0x105364870>
>>> etree.XML(xml)
<Element foo at 0x105395a00>
@karlcow

This comment has been minimized.

Copy link
Owner Author

@karlcow karlcow commented Aug 4, 2012

Posted a comment on lxml bug report https://bugs.launchpad.net/lxml/+bug/613302
Not sure why it was set as Won't Fix even with the given explanation.

@kernc

This comment has been minimized.

Copy link

@kernc kernc commented Feb 7, 2013

it's a bug, I agree. and @kennethreitz doesn't seem to intend to fix his part either: https://github.com/kennethreitz/requests/issues/465

anyway, above lxml-bug is easily enough fixed with:

>>> from lxml import etree
>>> xml = u'<?xml version="1.0" encoding="utf-8" ?><foo><bar/></foo>'
>>> xml = bytes(bytearray(xml, encoding='utf-8'))  # ADDENDUM OF THIS LINE (when unicode means utf-8, e.g. on Linux)
>>> etree.XML(xml)
<Element html at 0x5b44c90>
@sigmavirus24

This comment has been minimized.

Copy link

@sigmavirus24 sigmavirus24 commented Feb 7, 2013

@kernc we won't fix it because it isn't a bug. If you're using requests to get this string then the following should always work:

import requests
from lxml import etree

r = requests.get('http://example.com')
elem = etree.XML(r.content)

If you instead use r.text, that is when you'll run into problems. On the other hand, from this gist, it seems clear this is something with lxml and not requests. One call with a unicode string doesn't work, while a different does. And from the error and the discussion on LaunchPad, it seems like this intentional.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment