Skip to content

Instantly share code, notes, and snippets.

@karlcow
Created August 4, 2012 15:21
Show Gist options
  • Save karlcow/3258330 to your computer and use it in GitHub Desktop.
Save karlcow/3258330 to your computer and use it in GitHub Desktop.
Silly lxml bug in Python
>>>from lxml import etree
>>> xml = u'<?xml version="1.0" encoding="utf-8" ?><foo><bar/></foo>'
>>> etree.XML(xml)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "lxml.etree.pyx", line 2736, in lxml.etree.XML (src/lxml/lxml.etree.c:54437)
File "parser.pxi", line 1569, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:82685)
ValueError: Unicode strings with encoding declaration are not supported.
>>> etree.HTML(xml)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "lxml.etree.pyx", line 2708, in lxml.etree.HTML (src/lxml/lxml.etree.c:54160)
File "parser.pxi", line 1569, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:82685)
ValueError: Unicode strings with encoding declaration are not supported.
>>> lxml.etree.__version__
u'2.3.3'
>>> xml = u"<foo><bar/></foo>"
>>> etree.HTML(xml)
<Element html at 0x105364870>
>>> etree.XML(xml)
<Element foo at 0x105395a00>
@sigmavirus24
Copy link

@kernc we won't fix it because it isn't a bug. If you're using requests to get this string then the following should always work:

import requests
from lxml import etree

r = requests.get('http://example.com')
elem = etree.XML(r.content)

If you instead use r.text, that is when you'll run into problems. On the other hand, from this gist, it seems clear this is something with lxml and not requests. One call with a unicode string doesn't work, while a different does. And from the error and the discussion on LaunchPad, it seems like this intentional.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment