I have been struggling with a weird problem on one of my sites that prevent that site from functioning. One of XML files that is used for this site is supposed to come in UTF-8 but unfortunatly it had some extra characters that were not encoded properly. After looking at this site [http://perl-xml.sourceforge.net/faq/#encoding_conversion], I came up with a short regular expression of my own that can convert any malformed UTF-8 characters to XML/HTML numbered entities:
s/([^x80-xFF])/'?' . ord($1) . ';'/gse;
On a related note, another issue that came up a while back is the use of ampresand without being encoded as "&". Here is another regex to solve that issue (don't remember the site I got it from):
s/&(?!#?[xX]?(?:[0-9a-fA-F]+|w{1,8});)/&/g;