-
-
Save dufferzafar/36088f701bba507fd8f4 to your computer and use it in GitHub Desktop.
Beautiful Soup doesn't add this much overhead by itself. I believe the default parser is html.python, which is already quite a bit slower than lxml. It would be interesting to see the results of Beautiful Soup using the lxml parser. Any chance that could be added?
soup = BeautifulSoup(html_contents, "lxml")
Also, since html.python has had a lot of performance improvements between versions, it's probably a good idea to mention what version of python is being used.
Beautifulsoup uses html.parser by default.
It is possible to use lxml as @Justin42 said.
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#you-need-a-parser
I believe it is regarded by most people writing scrapers in python that lxml is the fastest parser.
I have used bs4 when I started out scraping and quickly switched to lxml, there is nothing better than lxml l. I just don't like how hard is to install, that's my only complain from lxml.
@Socialery I am not sure, what you mean. :/