Skip to content

Instantly share code, notes, and snippets.

@dufferzafar
Last active January 26, 2016 23:53
Show Gist options
  • Star 4 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save dufferzafar/36088f701bba507fd8f4 to your computer and use it in GitHub Desktop.
Save dufferzafar/36088f701bba507fd8f4 to your computer and use it in GitHub Desktop.
Timing execution of BeautifulSoup's prettify and lxml.tostring()
Display the source blob
Display the rendered blob
Raw
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Copy link

ghost commented Jan 26, 2016

soup = BeautifulSoup(html_contents)

what decoder is being used here?

@dufferzafar
Copy link
Author

@Socialery I am not sure, what you mean. :/

@Justin42
Copy link

Beautiful Soup doesn't add this much overhead by itself. I believe the default parser is html.python, which is already quite a bit slower than lxml. It would be interesting to see the results of Beautiful Soup using the lxml parser. Any chance that could be added?

soup = BeautifulSoup(html_contents, "lxml")

Also, since html.python has had a lot of performance improvements between versions, it's probably a good idea to mention what version of python is being used.

@medecau
Copy link

medecau commented Jan 26, 2016

Beautifulsoup uses html.parser by default.
It is possible to use lxml as @Justin42 said.
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#you-need-a-parser

I believe it is regarded by most people writing scrapers in python that lxml is the fastest parser.

@IAlwaysBeCoding
Copy link

I have used bs4 when I started out scraping and quickly switched to lxml, there is nothing better than lxml l. I just don't like how hard is to install, that's my only complain from lxml.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment