Skip to content

Instantly share code, notes, and snippets.

What would you like to do?
Fixing Malformed UTF-8 via Regex

I have been struggling with a weird problem on one of my sites that prevent that site from functioning. One of XML files that is used for this site is supposed to come in UTF-8 but unfortunatly it had some extra characters that were not encoded properly. After looking at this site [], I came up with a short regular expression of my own that can convert any malformed UTF-8 characters to XML/HTML numbered entities:

s/([^x80-xFF])/'?' . ord($1) . ';'/gse;

On a related note, another issue that came up a while back is the use of ampresand without being encoded as "&". Here is another regex to solve that issue (don't remember the site I got it from):

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment