Skip to content

Instantly share code, notes, and snippets.

@Shaz3e
Created July 4, 2015 22:17
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save Shaz3e/43f516cb44e37adf525d to your computer and use it in GitHub Desktop.
Save Shaz3e/43f516cb44e37adf525d to your computer and use it in GitHub Desktop.
Remove Microsoft Word HTML tags
<?php
function cleanHTML($html) {
/// <summary>
/// Removes all FONT and SPAN tags, and all Class and Style attributes.
/// Designed to get rid of non-standard Microsoft Word HTML tags.
/// </summary>
// start by completely removing all unwanted tags
$html = ereg_replace("<(/)?(font|span|del|ins)[^>]*>","",$html);
// then run another pass over the html (twice), removing unwanted attributes
$html = ereg_replace("<([^>]*)(class|lang|style|size|face)=("[^"]*"|'[^']*'|[^>]+)([^>]*)>","<\1>",$html);
$html = ereg_replace("<([^>]*)(class|lang|style|size|face)=("[^"]*"|'[^']*'|[^>]+)([^>]*)>","<\1>",$html);
return $html
}
?>
When used, Microsoft Word creates lots of tags: font, span, style, class… These tags are useful inside Word itself, but when you paste a text from Word into a webpage, you’ll end up with lots of useless tags. Here’s a very handy function to remove all Word HTML tags.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment