Skip to content

Instantly share code, notes, and snippets.

@bradmontgomery
Created November 11, 2010 23:12
Show Gist options
  • Save bradmontgomery/673417 to your computer and use it in GitHub Desktop.
Save bradmontgomery/673417 to your computer and use it in GitHub Desktop.
A way to remove all HTML attributes with BeautifulSoup
from BeautifulSoup import BeautifulSoup
def _remove_attrs(soup):
for tag in soup.findAll(True):
tag.attrs = None
return soup
def example():
doc = '<html><head><title>test</title></head><body id="foo" onload="whatever"><p class="whatever">junk</p><div style="background: yellow;" id="foo" class="blah">blah</div></body></html>'
print 'Before:\n%s' % doc
soup = BeautifulSoup(doc)
clean_soup = _remove_attrs(soup)
print '\nAfter:\n%s' % clean_soup
@jimbaldwin123
Copy link

If you have more than one attribute in a tag, this won't work, because del t[attr] truncates the list and ends the loop prematurely.

change
for attr, val in t.attrs:

to
for attr, val in reversed(t.attrs):

and that will fix it.

@bradmontgomery
Copy link
Author

Just updated the gist. I discovered that you can use soup to find all Tags, and that settings a Tag's attr property to None will effectively remove it. I think this works, but YMMV.

@kavinyao
Copy link

kavinyao commented Sep 2, 2012

Setting tag.attrs to None would be too brute, resulting in error if you use find() or select() on the tree later.

The better way is: tag.attrs = {}

@f126ck
Copy link

f126ck commented Feb 2, 2017

How can I remove all tags except those in a whitelist?
If in whitelist there are 'a' and 'img' tag, how can remove all tags(<script>

...) but keeping links and images?

@revotu
Copy link

revotu commented Jul 14, 2017

@stefkes
Copy link

stefkes commented Aug 15, 2018

This gives errors with Python 3: too many values to unpack

@JohnDotOwl
Copy link

Is there any way to remove specific attributes?

@WNiels
Copy link

WNiels commented Nov 11, 2019

How can I remove all tags except those in a whitelist?
If in whitelist there are 'a' and 'img' tag, how can remove all tags(<script>

...) but keeping links and images?

for k in list(tag.attrs.keys()):
            print(k)
            if k not in attr_whitelist:
                tag.attrs.pop(k, None)

@edif
Copy link

edif commented Dec 26, 2019

This gist is a bit old but why not use the clear() function!?

import requests
from bs4 import BeautifulSoup

result = requests.get('https://en.wikibooks.org/wiki/HyperText_Markup_Language/Introduction')
src = result.content
doc = BeautifulSoup(src, features="html.parser")
head_title = doc.find('h1', {'id': 'firstHeading'})

print(head_title)
head_title.attrs.clear()
print(head_title)

@sripriyesha
Copy link

clear() function works perfect! Thank you

@mzpqnxow
Copy link

mzpqnxow commented Nov 24, 2020

I ended up using the following to efficiently "blacklist" attributes from a tag in place (I needed to continue using the Tag after) which is all I needed to do in my case- the clear() method that @edif used seems to be the best way to remove all of the attributes, though I only needed to remove a subset

It's for the inverse of what @WNiels provided. I stumbled across this gist, curious about how BS Tag objects would handle having del used on them. Turns out it seems to handle them fine

This is what I did, it's relatively efficient which was important for me because it runs across a corpus of hundreds of thousands of HTML files. I included PEP-3107 type hints/annotations for clarity

def _filter_input_attr(tag: bs4.element.Tag) -> None:
    """Remove a subset of attributes from a bs4 Tag"""
    filter_attr_name_set = {'autocorrect', 'autofocus', 'border', 'disabled',
                            'height', 'incremental', 'list', 'max', 'maxsize',
                            'min', 'multiple', 'pattern', 'required', 'size',
                            'step', 'tabindex', 'width'}
    drop_key_set = set(tag.attrs) & filter_attr_name_set
    for key in drop_key_set:
        del tag.attrs[key]

...
soup = BeautifulSoup(open('sample.html'), features='lxml')
for form in soup.find_all('form'):
    filter_input_attr(form)
    print(form)
...

Example input tag:

<input autofocus="autofocus" class="std_textbox" id="user" name="user" placeholder="Enter your username." required="" tabindex="1" type="text" value=""/>

As dict_items (before):

dict_items([('name', 'user'), ('id', 'user'), ('autofocus', 'autofocus'), ('value', ''), ('placeholder', 'Enter your username.'), ('class', ['std_textbox']), ('type', 'text'), ('tabindex', '1'), ('required', '')])

As dict_items (after):

dict_items([('name', 'user'), ('id', 'user'), ('value', ''), ('placeholder', 'Enter your username.'), ('class', ['std_textbox']), ('type', 'text')])

EDIT/NOTE: In this particular example (<FORM>) only the <FORM> tag attributes will be cleaned up. You will need to use find_all explicitly on the list of form Tags if you want to perform the same filtering on, for example, the <INPUT> tags within the <FORM>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment