Skip to content

Instantly share code, notes, and snippets.

@lutzissler
Created July 31, 2012 07:32
Show Gist options
  • Save lutzissler/3214524 to your computer and use it in GitHub Desktop.
Save lutzissler/3214524 to your computer and use it in GitHub Desktop.
Tidy HTML inserted by copy/pasting from Microsoft office (PHP)
// Regexps courtesy of 1st class media
// http://www.1stclassmedia.co.uk/developers/clean-ms-word-formatting.php
function tidy_office_html($str) {
$replacements = array(
'/<!--.*?-->/s' => '',
'/<o:p>\s*<\/o:p>/s' => '',
'/<o:p>.*?<\/o:p>/s' => "&nbsp;",
'/\s*mso-[^:]+:[^;"]+;?/i' => '',
'/\s*MARGIN: 0cm 0cm 0pt\s*;/i' => '',
'/\s*MARGIN: 0cm 0cm 0pt\s*"/i' => '',
'/\s*TEXT-INDENT: 0cm\s*;/i' => '',
'/\s*TEXT-INDENT: 0cm\s*"/i' => '',
'/\s*TEXT-ALIGN: [^\s;]+;?"/i' => '',
'/\s*PAGE-BREAK-BEFORE: [^\s;]+;?"/i' => '',
'/\s*FONT-VARIANT: [^\s;]+;?"/i' => '',
'/\s*tab-stops:[^;"]*;?/i' => '',
'/\s*tab-stops:[^"]*/i' => '',
'/\s*face="[^"]*"/i' => '',
'/\s*face=[^ >]*/i' => '',
'/\s*FONT-FAMILY:[^;"]*;?/i' => '',
'/<(\w[^>]*) class=([^ |>]*)([^>]*)/i' => "<$1$3",
'/<(\w[^>]*) style="([^\"]*)"([^>]*)/i' => "<$1$3",
'/\s*style="\s*"/i' => '',
'/<SPAN\s*[^>]*>\s*&nbsp;\s*<\/SPAN>/i' => '&nbsp;',
'/<SPAN\s*[^>]*><\/SPAN>/i' => '',
'/<(\w[^>]*) lang=([^ |>]*)([^>]*)/i' => "<$1$3",
'/<SPAN\s*>(.*?)<\/SPAN>/i' => '$1',
'/<FONT\s*>(.*?)<\/FONT>/i' => '$1',
':<p>&nbsp;</p>:i' => '',
'/<\\?\?xml[^>]*>/i' => '',
'/<\/?\w+:[^>]*>/i' => '',
'/<([^\s>]+)[^>]*>\s*<\/\1>/s' => '',
);
foreach ($replacements as $pattern => $replacement) {
$str = preg_replace($pattern, $replacement, $str);
}
return $str;
}
@lutzissler
Copy link
Author

The reason behind the foreach loop is to prevent a duplicate internal loop within array_keys() and array_values() as in

return preg_replace(array_keys($replacements), array_values($replacements), $str);

I did not benchmark this though. For even more optimization (but decreased readability), the two arrays should be defined separately to get rid of the need for splitting them into keys and values.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment