Skip to content

Instantly share code, notes, and snippets.

@benui-dev
Created September 3, 2010 06:34
Show Gist options
  • Save benui-dev/563514 to your computer and use it in GitHub Desktop.
Save benui-dev/563514 to your computer and use it in GitHub Desktop.
MongoDB and Perl utf8 fun
> db.bar.find()
{ "_id" : ObjectId("4c80707ce7ed288eb37deabf"),
"downgraded" : { "进口" : "进口" },
"encoded" : { "��" : "��" },
"upgraded" : { "进口" : "è¿�å�£" },
"id" : "foo"
}
use strict;
use warnings;
use utf8;
use MongoDB;
my $conn = MongoDB::Connection->new(host => 'unixdeva11', port => 21337);
my $db = $conn->get_database('foo');
my $coll = $db->get_collection('bar');
my $upgrade = "\x{e8}\x{bf}\x{9b}\x{e5}\x{8f}\x{a3}";
my $downgrade = "\x{e8}\x{bf}\x{9b}\x{e5}\x{8f}\x{a3}";
my $encode = "\x{e8}\x{bf}\x{9b}\x{e5}\x{8f}\x{a3}";
my $decode = "\x{e8}\x{bf}\x{9b}\x{e5}\x{8f}\x{a3}";
utf8::upgrade( $upgrade );
utf8::downgrade( $downgrade );
utf8::encode( $encode );
utf8::decode( $decode );
my $result = $coll->update(
{ id => 'foo' },
{
id => 'foo',
upgraded => { $upgrade => $upgrade },
downgraded => { $downgrade => $downgrade },
encoded => { $encode => $encode },
#decoded => { $decode => $decode }, # This causes MongoDB to crash silently
},
{ upsert => 1 } # create if non-exist, update if exist
);
print $result;
@benui-dev
Copy link
Author

I'm not 100% sure but I think this shows that keys in MongoDB are treated differently to values.

When a string is "upgraded", it's marked as utf-8 internally by Perl. It seems in this case it gets encoded as utf-8 twice when it's a string value.

Perl Documentation on utf8::upgrade:

$num_octets = utf8::upgrade($string)
Converts in-place the internal representation of the string from an octet sequence in the native encoding (Latin-1 or EBCDIC) to UTF-X. The logical character sequence itself is unchanged. If $string is already stored as UTF-X, then this is a no-op. Returns the number of octets necessary to represent the string as UTF-X. Can be used to make sure that the UTF-8 flag is on, so that \w or lc() work as Unicode on strings containing characters in the range 0x80-0xFF (on ASCII and derivatives).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment