-
-
Save bradmontgomery/673417 to your computer and use it in GitHub Desktop.
from BeautifulSoup import BeautifulSoup | |
def _remove_attrs(soup): | |
for tag in soup.findAll(True): | |
tag.attrs = None | |
return soup | |
def example(): | |
doc = '<html><head><title>test</title></head><body id="foo" onload="whatever"><p class="whatever">junk</p><div style="background: yellow;" id="foo" class="blah">blah</div></body></html>' | |
print 'Before:\n%s' % doc | |
soup = BeautifulSoup(doc) | |
clean_soup = _remove_attrs(soup) | |
print '\nAfter:\n%s' % clean_soup |
Just updated the gist. I discovered that you can use soup to find all Tags, and that settings a Tag's attr property to None will effectively remove it. I think this works, but YMMV.
Setting tag.attrs
to None
would be too brute, resulting in error if you use find()
or select()
on the tree later.
The better way is: tag.attrs = {}
How can I remove all tags except those in a whitelist?
If in whitelist there are 'a' and 'img' tag, how can remove all tags(<script>
some useful issues : remove all HTML attributes with BeautifulSoup except some tags( ...)
This gives errors with Python 3: too many values to unpack
Is there any way to remove specific attributes?
How can I remove all tags except those in a whitelist?
If in whitelist there are 'a' and 'img' tag, how can remove all tags(<script>...) but keeping links and images?
for k in list(tag.attrs.keys()):
print(k)
if k not in attr_whitelist:
tag.attrs.pop(k, None)
This gist is a bit old but why not use the clear() function!?
import requests
from bs4 import BeautifulSoup
result = requests.get('https://en.wikibooks.org/wiki/HyperText_Markup_Language/Introduction')
src = result.content
doc = BeautifulSoup(src, features="html.parser")
head_title = doc.find('h1', {'id': 'firstHeading'})
print(head_title)
head_title.attrs.clear()
print(head_title)
clear()
function works perfect! Thank you
I ended up using the following to efficiently "blacklist" attributes from a tag in place (I needed to continue using the Tag
after) which is all I needed to do in my case- the clear()
method that @edif used seems to be the best way to remove all of the attributes, though I only needed to remove a subset
It's for the inverse of what @WNiels provided. I stumbled across this gist, curious about how BS Tag
objects would handle having del
used on them. Turns out it seems to handle them fine
This is what I did, it's relatively efficient which was important for me because it runs across a corpus of hundreds of thousands of HTML files. I included PEP-3107 type hints/annotations for clarity
def _filter_input_attr(tag: bs4.element.Tag) -> None:
"""Remove a subset of attributes from a bs4 Tag"""
filter_attr_name_set = {'autocorrect', 'autofocus', 'border', 'disabled',
'height', 'incremental', 'list', 'max', 'maxsize',
'min', 'multiple', 'pattern', 'required', 'size',
'step', 'tabindex', 'width'}
drop_key_set = set(tag.attrs) & filter_attr_name_set
for key in drop_key_set:
del tag.attrs[key]
...
soup = BeautifulSoup(open('sample.html'), features='lxml')
for form in soup.find_all('form'):
filter_input_attr(form)
print(form)
...
Example input tag:
<input autofocus="autofocus" class="std_textbox" id="user" name="user" placeholder="Enter your username." required="" tabindex="1" type="text" value=""/>
As dict_items
(before):
dict_items([('name', 'user'), ('id', 'user'), ('autofocus', 'autofocus'), ('value', ''), ('placeholder', 'Enter your username.'), ('class', ['std_textbox']), ('type', 'text'), ('tabindex', '1'), ('required', '')])
As dict_items
(after):
dict_items([('name', 'user'), ('id', 'user'), ('value', ''), ('placeholder', 'Enter your username.'), ('class', ['std_textbox']), ('type', 'text')])
EDIT/NOTE: In this particular example (<FORM>
) only the <FORM>
tag attributes will be cleaned up. You will need to use find_all
explicitly on the list of form Tags if you want to perform the same filtering on, for example, the <INPUT>
tags within the <FORM>
If you have more than one attribute in a tag, this won't work, because del t[attr] truncates the list and ends the loop prematurely.
change
for attr, val in t.attrs:
to
for attr, val in reversed(t.attrs):
and that will fix it.