Skip to content

Instantly share code, notes, and snippets.

@bradmontgomery
Created November 11, 2010 23:12
Show Gist options
  • Save bradmontgomery/673417 to your computer and use it in GitHub Desktop.
Save bradmontgomery/673417 to your computer and use it in GitHub Desktop.
A way to remove all HTML attributes with BeautifulSoup
from BeautifulSoup import BeautifulSoup
def _remove_attrs(soup):
for tag in soup.findAll(True):
tag.attrs = None
return soup
def example():
doc = '<html><head><title>test</title></head><body id="foo" onload="whatever"><p class="whatever">junk</p><div style="background: yellow;" id="foo" class="blah">blah</div></body></html>'
print 'Before:\n%s' % doc
soup = BeautifulSoup(doc)
clean_soup = _remove_attrs(soup)
print '\nAfter:\n%s' % clean_soup
@WNiels
Copy link

WNiels commented Nov 11, 2019

How can I remove all tags except those in a whitelist?
If in whitelist there are 'a' and 'img' tag, how can remove all tags(<script>

...) but keeping links and images?

for k in list(tag.attrs.keys()):
            print(k)
            if k not in attr_whitelist:
                tag.attrs.pop(k, None)

@edif
Copy link

edif commented Dec 26, 2019

This gist is a bit old but why not use the clear() function!?

import requests
from bs4 import BeautifulSoup

result = requests.get('https://en.wikibooks.org/wiki/HyperText_Markup_Language/Introduction')
src = result.content
doc = BeautifulSoup(src, features="html.parser")
head_title = doc.find('h1', {'id': 'firstHeading'})

print(head_title)
head_title.attrs.clear()
print(head_title)

@sripriyesha
Copy link

clear() function works perfect! Thank you

@mzpqnxow
Copy link

mzpqnxow commented Nov 24, 2020

I ended up using the following to efficiently "blacklist" attributes from a tag in place (I needed to continue using the Tag after) which is all I needed to do in my case- the clear() method that @edif used seems to be the best way to remove all of the attributes, though I only needed to remove a subset

It's for the inverse of what @WNiels provided. I stumbled across this gist, curious about how BS Tag objects would handle having del used on them. Turns out it seems to handle them fine

This is what I did, it's relatively efficient which was important for me because it runs across a corpus of hundreds of thousands of HTML files. I included PEP-3107 type hints/annotations for clarity

def _filter_input_attr(tag: bs4.element.Tag) -> None:
    """Remove a subset of attributes from a bs4 Tag"""
    filter_attr_name_set = {'autocorrect', 'autofocus', 'border', 'disabled',
                            'height', 'incremental', 'list', 'max', 'maxsize',
                            'min', 'multiple', 'pattern', 'required', 'size',
                            'step', 'tabindex', 'width'}
    drop_key_set = set(tag.attrs) & filter_attr_name_set
    for key in drop_key_set:
        del tag.attrs[key]

...
soup = BeautifulSoup(open('sample.html'), features='lxml')
for form in soup.find_all('form'):
    filter_input_attr(form)
    print(form)
...

Example input tag:

<input autofocus="autofocus" class="std_textbox" id="user" name="user" placeholder="Enter your username." required="" tabindex="1" type="text" value=""/>

As dict_items (before):

dict_items([('name', 'user'), ('id', 'user'), ('autofocus', 'autofocus'), ('value', ''), ('placeholder', 'Enter your username.'), ('class', ['std_textbox']), ('type', 'text'), ('tabindex', '1'), ('required', '')])

As dict_items (after):

dict_items([('name', 'user'), ('id', 'user'), ('value', ''), ('placeholder', 'Enter your username.'), ('class', ['std_textbox']), ('type', 'text')])

EDIT/NOTE: In this particular example (<FORM>) only the <FORM> tag attributes will be cleaned up. You will need to use find_all explicitly on the list of form Tags if you want to perform the same filtering on, for example, the <INPUT> tags within the <FORM>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment