Skip to content

Instantly share code, notes, and snippets.

@miketahani
Created November 5, 2015 22:38
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save miketahani/dc33427c8e2651a79e1c to your computer and use it in GitHub Desktop.
Save miketahani/dc33427c8e2651a79e1c to your computer and use it in GitHub Desktop.
python html parsing inconsistencies
import re
from sys import argv
from bs4 import BeautifulSoup as bs
from pyquery import PyQuery as pq
from lxml import etree
filename = argv[-1]
anchors = re.compile('<a.+?>.+?<\/a>', re.DOTALL|re.I)
with open(filename, 'r') as infile:
raw = infile.read()
doc_bs4 = bs(raw, 'lxml')
doc_pq = pq(raw)
print 'bs4: %d' % len(doc_bs4.find_all('a'))
print 'pq: %d' % len(doc_pq('a'))
print 're: %d' % len(anchors.findall(raw))
'''
produces:
bs4: 244
pq: 301
re: 8313
'''
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment