Skip to content

Instantly share code, notes, and snippets.

@frodwith
Created June 8, 2010 18:40
Show Gist options
  • Save frodwith/430456 to your computer and use it in GitHub Desktop.
Save frodwith/430456 to your computer and use it in GitHub Desktop.

Encoding and utf8 in perl

A perl string is a logical sequence of characters.

encoding a perl string produces a sequence of octets. decoding a sequence of octets produces a perl string.

utf8::is_utf8 tells you whether the string is a sequence of logical characters (true) or a sequence of octets (false). You should not use utf8 before calling this function.

Raw filehandles (like the default stdout) expect sequences of octets, and will warn if you give them strings that have their utf8 flag turned on.

use Encode;

my $string = "\x{0CA0}_\x{0CA0}"; # the look of disapproval
# utf8::is_utf8($string) is true

my $octets = encode('utf8', $string);
# utf8::is_utf8($octets) is false

my $utf8   = decode('utf8', $octets);
# utf8::is_utf8($utf8) is true

If we have raw octets in one encoding, we can turn them into a perl string with decode, then into some other sequence of octets in some other encoding:

use Encode;

my $greek   = raw_greek_octets();
my $perly   = decode('ISO-8859-7', $greek);
my $printme = encode('utf8', $perly);

print $printme, "\n";
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment