Skip to content

Instantly share code, notes, and snippets.

@HON95
Created September 23, 2015 21:58
Show Gist options
  • Save HON95/732211c51f394b8bd5f5 to your computer and use it in GitHub Desktop.
Save HON95/732211c51f394b8bd5f5 to your computer and use it in GitHub Desktop.
Regular expressions for cleaning trashy HTML produced by e.g. Word and Excel
# Regular expressions for cleaning trashy Office HTML. Meant for lated extraction of content.
# Note: This only removes trash I encountered.
# Remove no-break spaces, spans, b, u, and a elements and o:p elements (whatever those are)
(?:&nbsp;)|(?:\xA0)|(?:</?span[^>]*>)|(?:</?[bua][^>]*>)|(?:</?o:p>)
# Remove attributes for html, head, div, p, table, tr and td elements
(?:(?<=<html)|(?<=<head)|(?<=<div)|(?<=<p)|(?<=<table)|(?<=<tr)|(?<=<td))[^>]*(?=>)
# Remove everything inside head
(?<=<head>)(?:(?!</head>).)*
# Remove empty p elements
<p>\s*</p>
# Remove empty div elements
<div>\s*</div>
# Remove empty lines
\n[ ]*\r
# Remove p element tags inside td elements and preserves content
(?:(?<=<td>)\s*<p>)|(?:</p>\s*(?=</td>))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment