Skip to content

Instantly share code, notes, and snippets.

@crzdeveloper
Last active August 17, 2017 09:01
Show Gist options
  • Save crzdeveloper/a649d44e602f19c908da40827705f3be to your computer and use it in GitHub Desktop.
Save crzdeveloper/a649d44e602f19c908da40827705f3be to your computer and use it in GitHub Desktop.
Partial ICU database update (libicu)

ICU library provides some facilities to work with Unicode and Globalization. The problem is that this library is installed system-wide and it is not so trivial to update it. We are going to update only some parts of the ICU Database, but the rest of the Database can be updated in the same way. You just need to know what exactly you need to change.

Abstract

This manual is not bound to PHP, it will work for other languages as well. The PHP uses php-intl extension which uses in its turn libicu and we are going to update some of the libicu database parts.

This manual might be not so good for Java users, because as for Java the ICU4J must be used and the approach can differ.

We'll try to solve two problems: outdated Timezone DB and IDN to ASCII conversion of domain names (IDN to Punycode conversion, idn_to_ascii() and idn_to_urf8() functions in PHP).

To figure out the version of libicu installed on your system, run:

apt-cache search '^libicu'
# or
apt search '^libicu'

In my case it outputs:

libicu52 - International Components for Unicode

It means, ICU version 52 is installed in my system.

Problem #1: Outdated Timezone Problem

Here is the code snippet which reproduces the problem (reproducible in libicu version 52 and probably 55):

<?php

// tztest.php

$date = new DateTime();
$x =  \IntlDateFormatter::create('en', 2, 2, 'Europe/Moscow', 1, 'Y-MM-dd HH:mm:ss Z');
echo $x->format($date) . PHP_EOL;

Expected output: current time in Europe/Moscow timezone (+0300)

Actual output: +0400

Problem #2: Emoji/Unicode Domain Name Conversion

idn_to_utf8 and idn_to_ascii functions convert IDN (Unicode) domain names to IDNA ASCII (punycode) form. The problem is reproducible in libicu till version 57 (maybe even in 58)

<?php

// idntest.php

$xn = 'xn--4s9haa.ws';

$result = idn_to_utf8($xn, IDNA_NONTRANSITIONAL_TO_ASCII, INTL_IDNA_VARIANT_UTS46);

if ($result === false) {
    throw new \InvalidArgumentException("Could not convert Punycode '$xn' to IDN.");
}

echo "SUCCESS: $result\n";

Expected output: SUCCESS

Actual output: Exception 'Could not convert Punycode ... to IDN'

Problem Solution Approach

According to the ICU Data documentation, if the ICU_DATA environment variable is set, the ICU Data will be loaded from that path. For example, if libicu52 is installed in your system and you run your PHP script like

ICU_DATA=/opt/icu php tztest.php

then the ICU Database will be searched in /opt/icu/icudt52l/ directory. The pattern is: /opt/icu/icudt<version><byte ordering>/, where:

  • <version> is the version of libicu installed in your system
  • <byte ordering> can be l, b or e. See Sharing ICU Data Between Platforms. For regular x86_64 platform it will be l.

So, the algorithm for finding the ICU data is the following:

  • If the ICU_DATA is not set, load the data directly from libicudata.so.52
  • If the ICU_DATA is set, try to load the data from $ICU_DATA/icudt52l
  • If the requested file is not found or the directory doesn't exist, fall back to libicudata.so.52

Solving Timezone Problem

There is a great article (in Russian) about that. In two words, using strace we can find which files are requested by the libicu:

export ICU_DATA=/opt/icu 
strace php -f tztest.php

(Note that on Debian Jessie when strace -o 'output.txt' php -f tztest.php is used instead of the example above, the libicu ignores ICU_DATA. Would be interesting to figure out why.)

In the strace log you'll see the following:

stat("/opt/icu/icudt52l/zoneinfo64.res", 0x7ffe20d1a7c0) = -1 ENOENT (No such file or directory)
stat("/opt/icu/icudt52l/timezoneTypes.res", 0x7ffe20d1a120) = -1 ENOENT (No such file or directory)
stat("/opt/icu/icudt52l/metaZones.res", 0x7ffe20d1a450) = -1 ENOENT (No such file or directory)
... and many more

According to the official documentation Updating the Time Zone Data, we need just 4 files:

  • zoneinfo64.res
  • windowsZones.res
  • timezoneTypes.res
  • metaZones.res

They can be downloaded from the repository. Just click on the latest year, then 44, then le for little-endian systems and place those 4 files into /opt/icu/icudt52l/ directory. The complete download link.

Check the output of the tztest.php, it will contain "+0300" instead of "+0400". Also you can check the strace output to be sure the files are loaded:

stat("/opt/icu/icudt52l/metaZones.res", {st_mode=S_IFREG|0644, st_size=40960, ...}) = 0
open("/opt/icu/icudt52l/metaZones.res", O_RDONLY) = 4
stat("/opt/icu/icudt52l/zoneinfo64.res", {st_mode=S_IFREG|0644, st_size=151872, ...}) = 0
open("/opt/icu/icudt52l/zoneinfo64.res", O_RDONLY) = 4
stat("/opt/icu/icudt52l/timezoneTypes.res", {st_mode=S_IFREG|0644, st_size=20032, ...}) = 0
open("/opt/icu/icudt52l/timezoneTypes.res", O_RDONLY) = 4

Solving IDN - ASCII Conversion Problem

First, run strace to see which files are requested by the libicu:

export ICU_DATA=/opt/icu
strace php -f idntest.php

Grep by /opt/icu and here it is:

stat("/opt/icu/icudt52l/uts46.nrm", 0x7fffc812b2c0) = -1 ENOENT (No such file or directory)

This file is "Unicode Character Data (Normalization since ICU 4.4)" , see ICU Data File Formats for more description.

Where to get this file from? On the Download page, click on the latest ICU version under the ICU4C column (as for the mid 2017, version 59 is the latest). Find the link to the repository

The needed uts46.nrm file will be in icu4c/source/data/in. Just download it from there (or clone the whole repo, of course) and place into /opt/icu/icudt52l/. That repository directory contains some pre-compiled Unicode Database, so you won't need to compile the whole ICU Database.

Now if you run the script

export ICU_DATA=/opt/icu
php -f idntest.php

you'll see the SUCCESS output. Lets check strace:

stat("/opt/icu/icudt52l/uts46.nrm", {st_mode=S_IFREG|0644, st_size=60668, ...}) = 0
open("/opt/icu/icudt52l/uts46.nrm", O_RDONLY) = 4

Updating Other ICU DB Parts

If you need other latest parts of the ICU DB, it seems, you'll need to compile the whole DB yourself. If you don't want to do that, you can use the ICU Data Library Customizer tool. But as of mid 2017, the latest available ICU version is 57 in that tool.

Select the ICU version you need (the latest, of course!), click "Get Data Library" button and download the zip archive, extract it and you'll get the .dat file, lets say, icudt57l.dat. This file is an archive containing everything.

To see what is inside, be sure you installed icu-devtools package:

icupkg -l icudt57l.dat

In order to extract any file:

icupkg -x uts46.nrm icudt57l.dat

So you can extract some files and place them on your server's /opt/icu/icudt52l directory.

Strace'ing it further, it seems that sometimes the libicu tries to open /opt/icu/icudt52l.dat file. I tried to feed this file (renaming icudt57l.dat into icudt52l.dat). But it didn't work as expected, although the file was loaded accroding to the strace. The reason for that could be that the .dat files have different formats depending on the ICU version.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment