Last active
January 26, 2016 23:53
-
-
Save dufferzafar/36088f701bba507fd8f4 to your computer and use it in GitHub Desktop.
Timing execution of BeautifulSoup's prettify and lxml.tostring()
Beautifulsoup uses html.parser by default.
It is possible to use lxml as @Justin42 said.
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#you-need-a-parser
I believe it is regarded by most people writing scrapers in python that lxml is the fastest parser.
I have used bs4 when I started out scraping and quickly switched to lxml, there is nothing better than lxml l. I just don't like how hard is to install, that's my only complain from lxml.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Beautiful Soup doesn't add this much overhead by itself. I believe the default parser is html.python, which is already quite a bit slower than lxml. It would be interesting to see the results of Beautiful Soup using the lxml parser. Any chance that could be added?
Also, since html.python has had a lot of performance improvements between versions, it's probably a good idea to mention what version of python is being used.