Skip to content

Instantly share code, notes, and snippets.

@yakovsh
Last active January 17, 2016 15:57
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save yakovsh/aa3d3ff19422e7ae3a9e to your computer and use it in GitHub Desktop.
Save yakovsh/aa3d3ff19422e7ae3a9e to your computer and use it in GitHub Desktop.
Fixing Malformed UTF-8 via Regex

I have been struggling with a weird problem on one of my sites that prevent that site from functioning. One of XML files that is used for this site is supposed to come in UTF-8 but unfortunatly it had some extra characters that were not encoded properly. After looking at this site [http://perl-xml.sourceforge.net/faq/#encoding_conversion], I came up with a short regular expression of my own that can convert any malformed UTF-8 characters to XML/HTML numbered entities:

s/([^x80-xFF])/'?' . ord($1) . ';'/gse;

On a related note, another issue that came up a while back is the use of ampresand without being encoded as "&". Here is another regex to solve that issue (don't remember the site I got it from):

s/&(?!#?[xX]?(?:[0-9a-fA-F]+|w{1,8});)/&/g;
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment