Skip to content

Instantly share code, notes, and snippets.

@andresriancho
Created June 22, 2014 15:40
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save andresriancho/62ab7578fa8cbba64735 to your computer and use it in GitHub Desktop.
Save andresriancho/62ab7578fa8cbba64735 to your computer and use it in GitHub Desktop.
Phishtank XML parsing (as HTML because the XML is broken)
import lxml.etree as etree
class CollectorTarget(object):
def __init__(self):
self.events = []
self.urls = 0
def start(self, tag, attrib):
#self.events.append("start %s %r" % (tag, dict(attrib)))
if tag == 'url':
self.urls += 1
def end(self, tag):
return
self.events.append("end %s" % tag)
def data(self, data):
return
self.events.append("data %r" % data)
def comment(self, text):
return
self.events.append("comment %s" % text)
def close(self):
return "closed!"
f = file('phishtank.xml')
target = CollectorTarget()
parser = etree.HTMLParser(recover=True, target=target)
tree = etree.parse(f, parser)
print target.urls
@andresriancho
Copy link
Author

The trick is in etree.HTMLParser(recover=True, target=target), don't use XMLParser since the xml file is invalid/broken and things won't work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment