Skip to content

Instantly share code, notes, and snippets.

@revotu
Last active January 27, 2024 06:48
Show Gist options
  • Save revotu/21d52bd20a073546983985ba3bf55deb to your computer and use it in GitHub Desktop.
Save revotu/21d52bd20a073546983985ba3bf55deb to your computer and use it in GitHub Desktop.
remove all HTML attributes with BeautifulSoup except some tags(<a> <img>...)
from bs4 import BeautifulSoup
# remove all attributes
def _remove_all_attrs(soup):
for tag in soup.find_all(True):
tag.attrs = {}
return soup
# remove all attributes except some tags
def _remove_all_attrs_except(soup):
whitelist = ['a','img']
for tag in soup.find_all(True):
if tag.name not in whitelist:
tag.attrs = {}
return soup
# remove all attributes except some tags(only saving ['href','src'] attr)
def _remove_all_attrs_except_saving(soup):
whitelist = ['a','img']
for tag in soup.find_all(True):
if tag.name not in whitelist:
tag.attrs = {}
else:
attrs = dict(tag.attrs)
for attr in attrs:
if attr not in ['src','href']:
del tag.attrs[attr]
return soup
@revotu
Copy link
Author

revotu commented Jul 14, 2017

some useful issues reference from : BeautifulSoup删除html文档中所有属性

@mike667
Copy link

mike667 commented Jan 7, 2018

is working with nested tags? because in my test soup.find_all(True) will return only parents elements.

@SeaDude
Copy link

SeaDude commented Jan 2, 2022

Instead of removing HTML tags for the whole soup, how is this done for a specific BS4 tag?

Example:

  • I have the soup: feed = BeautifulSoup(response, 'xml')
  • I parse a specific tag using item = feed.select_one('item')
  • If the resulting item contains <a>, <p>, and <img> tags that I want to remove, how is this accomplished?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment