Skip to content

Instantly share code, notes, and snippets.

@chansen
Created December 26, 2011 22:22
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save chansen/1522213 to your computer and use it in GitHub Desktop.
Save chansen/1522213 to your computer and use it in GitHub Desktop.
Decodes UTF-8, interpreting ill-formed UTF-8 sequences as CP1252, posted in reply to <http://blog.endpoint.com/2011/12/sanitizing-supposed-utf-8-data.html>
#!/usr/bin/perl
use strict;
use warnings;
use Encode qw[find_encoding];
use Unicode::UTF8 qw[decode_utf8];
{
my $encoding = find_encoding('Windows-1252')
or die q/Couldn't find Windows-1252 encoding/;
my $fallback = sub {
my ($octets, $is_usv) = @_;
return $is_usv ? "\x{FFFD}" : $encoding->decode($octets);
};
sub fix_latin {
@_ == 1 || die q/Usage: fix_latin($octets)/;
no warnings 'utf8';
return decode_utf8($_[0], $fallback);
}
}
my $octets = "\x91 Foo \xE2\x98\xBA \x92";
printf "<%s>\n",
join ' ', map { sprintf 'U+%.4X', ord $_ } split //, fix_latin($octets);
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment