Skip to content

Instantly share code, notes, and snippets.

@phluid61
Last active November 7, 2017 20:53
Show Gist options
  • Save phluid61/a5abbdb18295b8369477a5843c0a71d5 to your computer and use it in GitHub Desktop.
Save phluid61/a5abbdb18295b8369477a5843c0a71d5 to your computer and use it in GitHub Desktop.
EPrints 'reencode' subroutine to heuristically detect certain non-UTF-8 encodings
#
# A percent-encoded URI does not necessarily have to encode UTF-8
# sequences. For example:
#
# <https://zhidao.baidu.com/question/210190177.html?qbl=relate_question_0&word=%C4%DA%B2%BF%C9%F3%BC%C6%CD%E2%CE%C4%B2%CE%BF%BC%
#
# The percent-encoded sequence is actually EUC-CN (a Chinese character
# encoding). We have no problem in perl when the 'word' parameter is
# naively converted to octets, however the database will definitely
# complain when we attempt to insert those octets into a field that
# expects valid UTF-8 character sequences.
#
# This function attempts to detect EUC-CN or Big5 character sequences
# and transcode them to UTF-8-encoded Unicode.
#
# As a final fallback, any octets that aren't successfully converted
# to UTF-8 are transcoded from Windows-1252.
#
my $ascii = qr/[\x00-\x7F]/;
my $euc = qr/[\xA1-\xF7][\xA1-\xFE]/;
my $big5 = qr/[\x81-\xFE]([\x40-\x7E]|[\xA1-\xFE])/;
my $euc_string = qr/^($ascii)*(($euc)+($ascii)*)+$/;
my $big5_string = qr/^($ascii)*(($big5)+($ascii)*)+$/;
my $utf8_2 = qr/[\xC0-\xDF]/;
my $utf8_3 = qr/[\xE0-\xEF]/;
my $utf8_4 = qr/[\xF0-\xF7]/;
my $utf8_n = qr/[\x80-\xBF]/;
my $utf8_string = qr/^(($ascii)|($utf8_2)($utf8_n)|($utf8_3)($utf8_n){2}|($utf8_4)($utf8_n){3}$)+$/;
sub _reencode {
use bytes;
STRING: for ( @_ )
{
next STRING unless $_;
eval {
# Detect EUC-CN/GBK encoding
if( $_ =~ $euc_string )
{
$_ = decode('euc-cn', $_, Encode::FB_CROAK);
}
# Detect Big5
elsif( $_ =~ $big5_string )
{
$_ = decode('big5-eten', $_, Encode::FB_CROAK);
}
1;
};
eval {
# Fall back to Windows-1252 (~= ISO-8859-1) for non-UTF-8
if( $_ !~ $utf8_string )
{
$_ = decode('windows-1252', $_, Encode::FB_CROAK);
}
1;
};
}
return @_;
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment