Skip to content

Instantly share code, notes, and snippets.

@MercuryRising
Created November 12, 2012 19:29
Show Gist options
  • Save MercuryRising/4061368 to your computer and use it in GitHub Desktop.
Save MercuryRising/4061368 to your computer and use it in GitHub Desktop.
Pyquery, lxml, BeautifulSoup comparison
from bs4 import BeautifulSoup as bs
from pyquery import PyQuery as pq
from lxml.html import fromstring
import re
import requests
import time
def Timer():
a = time.time()
while True:
c = time.time()
yield time.time()-a
a = c
timer = Timer()
url = "http://www.python.org/"
html = requests.get(url).text
num = 100000
print '\n==== Total trials: %s =====' %num
next(timer)
soup = bs(html, 'lxml')
for x in range(num):
paragraphs = soup.findAll('p')
t = next(timer)
print 'bs4 total time: %.1f' %t
d = pq(html)
for x in range(num):
paragraphs = d('p')
t = next(timer)
print 'pq total time: %.1f' %t
tree = fromstring(html)
for x in range(num):
paragraphs = tree.cssselect('p')
t = next(timer)
print 'lxml (cssselect) total time: %.1f' %t
tree = fromstring(html)
for x in range(num):
paragraphs = tree.xpath('.//p')
t = next(timer)
print 'lxml (xpath) total time: %.1f' %t
for x in range(num):
paragraphs = re.findall('<[p ]>.*?</p>', html)
t = next(timer)
print 'regex total time: %.1f (doesn\'t find all p)\n' %t
@guptarohit
Copy link

Results using python 3.7.3

==== Total trials: 100000 =====
bs4 total time: 94.1
pq total time: 9.5
lxml (cssselect) total time: 8.6
lxml (xpath) total time: 5.9
regex total time: 12.9 (doesn't find all p)

@andriyor
Copy link

I tried selectolax and in this case selectolax is 2 times faster than lxml
https://rushter.com/blog/python-fast-html-parser/

from selectolax.parser import HTMLParser

tree = HTMLParser(html)
for x in range(num):
    paragraphs = tree.css('p')
t = next(timer)
print('selectolax total time: %.1f' % t)
==== Total trials: 100000 =====
bs4 total time: 95.4
pq total time: 10.9
lxml (cssselect) total time: 10.0
lxml (xpath) total time: 6.4
regex total time: 14.4 (doesn't find all p)
selectolax total time: 3.4

@deedy5
Copy link

deedy5 commented Apr 24, 2021

python 3.9.2

==== Total trials: 100000 =====
bs4 total time: 31.9
pq total time: 4.9
lxml (cssselect) total time: 4.4
lxml (xpath) total time: 3.1
regex total time: 8.5 (doesn't find all p)

@hokwanhung
Copy link

Python 3.10.4

==== Total trials: 100000 =====
bs4 total time: 30.1
pq total time: 2.8
lxml (cssselect) total time: 2.6
lxml (xpath) total time: 2.0
regex total time: 6.3 (doesn't find all p)

@xavierskip
Copy link

Python 3.10.1

==== Total trials: 100000 =====
bs4 total time: 45.9
pq total time: 4.6
lxml (cssselect) total time: 4.3
lxml (xpath) total time: 3.3
regex total time: 8.4 (doesn't find all p)

@p3nj
Copy link

p3nj commented Apr 30, 2023

Python 3.11.2

==== Total trials: 100000 =====
bs4 total time: 18.1
pq total time: 2.2
lxml (cssselect) total time: 2.2
lxml (xpath) total time: 1.7
regex total time: 5.2 (doesn't find all p)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment