Created
November 11, 2010 23:12
-
-
Save bradmontgomery/673417 to your computer and use it in GitHub Desktop.
A way to remove all HTML attributes with BeautifulSoup
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
from BeautifulSoup import BeautifulSoup | |
def _remove_attrs(soup): | |
for tag in soup.findAll(True): | |
tag.attrs = None | |
return soup | |
def example(): | |
doc = '<html><head><title>test</title></head><body id="foo" onload="whatever"><p class="whatever">junk</p><div style="background: yellow;" id="foo" class="blah">blah</div></body></html>' | |
print 'Before:\n%s' % doc | |
soup = BeautifulSoup(doc) | |
clean_soup = _remove_attrs(soup) | |
print '\nAfter:\n%s' % clean_soup |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
I ended up using the following to efficiently "blacklist" attributes from a tag in place (I needed to continue using the
Tag
after) which is all I needed to do in my case- theclear()
method that @edif used seems to be the best way to remove all of the attributes, though I only needed to remove a subsetIt's for the inverse of what @WNiels provided. I stumbled across this gist, curious about how BS
Tag
objects would handle havingdel
used on them. Turns out it seems to handle them fineThis is what I did, it's relatively efficient which was important for me because it runs across a corpus of hundreds of thousands of HTML files. I included PEP-3107 type hints/annotations for clarity
Example input tag:
As
dict_items
(before):As
dict_items
(after):EDIT/NOTE: In this particular example (
<FORM>
) only the<FORM>
tag attributes will be cleaned up. You will need to usefind_all
explicitly on the list of form Tags if you want to perform the same filtering on, for example, the<INPUT>
tags within the<FORM>