Skip to content

Instantly share code, notes, and snippets.

@thisbit
Created June 10, 2023 11:04
Show Gist options
  • Save thisbit/9f43c0c2bbfbf03966183d79a1967709 to your computer and use it in GitHub Desktop.
Save thisbit/9f43c0c2bbfbf03966183d79a1967709 to your computer and use it in GitHub Desktop.
Regular Expression written with the use of ChatGPT to clean up html
<style\b[^<]*>[\s\S]*?<\/style\b[^<]*>|<((?!(?:a|h[1-6]|p|strong|em|img|span)\b)\w+[^>]*)>|<\/?(?:div|section|link|span)\b[^>]*>|class\s*=\s*"[^"]*"|id\s*=\s*"[^"]*"|style\s*=\s*"[^"]*"|data-[^=]+="[^"]*"|\s+$|\t|&nbsp;|<\w+\b[^>]*><\/\w+\b[^>]*>|<\w+\b[^>]*><\/\w+\b[^>]*>\s*|<\/span>
@thisbit
Copy link
Author

thisbit commented Jun 10, 2023

This is the prompt I crafted durring a convo with ChatGPT

write a regular expression that would do all the following steps in one expression, in the order I list them:

  1. remove any html element that IS NOT a, h1,h2,h3,h4,h5,h6, p, strong, em, img
  2. specifically take care you remove all
    ,
    and elements (opening and closing tag)
  3. remove any instance of class="any string here" from anywhere
  4. remove any instance of id="any string here" from anywhere
  5. remove any instance of style="any string here" from anywhere
  6. remove any instance of data- continute with anything then ="any string here" from anywhere
  7. specificaly take care to remove any style html element (opening and closing tag) and any character inside style elements (content iside this element is typicaly multiline)
  8. remove any trailing space that remained (this did not work)
  9. remove any tab space
  10. remove any  
  11. remove any empty html elements other then img elements

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment