Skip to content

Instantly share code, notes, and snippets.

@nicwolff
Created December 16, 2011 01:59
Show Gist options
  • Select an option

  • Save nicwolff/1484073 to your computer and use it in GitHub Desktop.

Select an option

Save nicwolff/1484073 to your computer and use it in GitHub Desktop.
Perl script to translate Windows CP1252 characters hidden in UTF-8 text
#!/usr/bin/perl
use strict;
# This program attempts to translate Windows CP1252 characters in UTF-8 text
# This will work pretty well, except where a file has a CP1252 character in the range
# 0xC0-0xDF followed by one in 0x80-0xBF, or one in 0xE0-0xEF followed by two in 0x80-0xBF.
# Those (hopefully rare) cases will get translated to the wrong Unicode characters.
use Encode;
use bytes;
sub recode { encode( "utf-8", decode( "cp1252", $_[0] ) ) }
while ( my $line = <> ) {
my $pos = 0;
my $length = length $line;
while ( $pos <= $length - 1 ) {
my $c = substr( $line, $pos, 1 );
if ( ord($c) < 0x80 ) { # ASCII char
print $c;
} elsif ( ord($c) >> 5 == 6 and $length - 1 > $pos + 1 ) { # 2-byte char?
my $c2 = substr( $line, ++$pos, 1 );
if ( ord($c2) >> 6 == 2 ) { # if the next byte starts with 0b10
print $c . $c2; # print 2-byte UTF-8 char
} else { # else
print recode( $c ); # map CP1252 char
$pos--; # and back up 1 char
}
} elsif ( ord($c) >> 4 == 14 and $length - 1 > $pos + 2 ) { # 3-byte char?
my $c2 = substr( $line, ++$pos, 1 );
my $c3 = substr( $line, ++$pos, 1 );
if (
ord($c2) >> 6 == 2 and ord($c3) >> 6 == 2
) { # if the next 2 bytes start with 0b10
print $c . $c2 . $c3; # print 3-byte UTF-8 char
} else { # else
print recode( $c ); # map CP1252 char
$pos -= 2; # and back up 2 chars
}
} else { # It's a Windows CP-1252 char
print recode( $c ); # Map CP1252 char
}
$pos++;
}
}
@lucianovilela
Copy link
Copy Markdown

Well done, it works fine. Thank you very much.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment