Created
January 24, 2016 01:40
-
-
Save FirefighterBlu3/db3b8962c44291cd19e0 to your computer and use it in GitHub Desktop.
Prevent BeautifulSoup from translating HTML charrefs and entities
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import bs4.dammit | |
import bs4.builder._htmlparser | |
from bs4 import BeautifulSoup | |
# when analyzing spam, we always run into situations where the sender tries | |
# hard to obfuscate their input in order to sneak by spam detecting engines. | |
# instead of "ABC", they'll use HTML Entity references; ABC in | |
# order to extract the viewer readable segments, we need a parser. most | |
# parsers try to be smart and clean things up. BeautifulSoup and lxml are | |
# two common parsers. lxml is largely compiled C so we can't tweak it very | |
# easily. | |
# step #1 | |
# BeautifulSoup insists on always doing entity substitution and there's no | |
# way to politely tell it to fuck off. override the hex->int and | |
# word->symbol conversions, simply append our data to the growing stack | |
_handle_data = bs4.builder._htmlparser.BeautifulSoupHTMLParser.handle_data | |
bs4.builder._htmlparser.BeautifulSoupHTMLParser.handle_charref = lambda cls,s: _handle_data(cls, '&#'+s+';') | |
bs4.builder._htmlparser.BeautifulSoupHTMLParser.handle_entityref = lambda cls,s: _handle_data(cls, '&'+s+';') | |
# step #2 | |
# BeautifulSoup insists on further ensuring printed data is always tidy and | |
# semantically correct, thus it ALWAYS does entity substitution even after | |
# we refused to do it above. the below ensures the __str__ methods don't | |
# attempt to mangle the serialized data. this simply returns the original | |
# matched input when the substitution methods are called | |
bs4.dammit.EntitySubstitution._substitute_html_entity = lambda o: o.group(0) | |
bs4.dammit.EntitySubstitution._substitute_xml_entity = lambda o: o.group(0) | |
# now let's merrily do what the fuck we want | |
body = '<html><body><p>Ӓ’</p></body></html>' | |
soup = BeautifulSoup(body, 'html.parser') | |
assert str(soup.find('p')) == '<p>Ӓ’</p>' |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
same here. Thanks for saving my day😂