Skip to content

Instantly share code, notes, and snippets.

@xeoncross
Created October 26, 2011 17:59
Show Gist options
  • Save xeoncross/1317164 to your computer and use it in GitHub Desktop.
Save xeoncross/1317164 to your computer and use it in GitHub Desktop.
PHP UTF-8, Locale, and I18N support
For those that are interested, it seems full support for [locales](http://php.net/manual/en/function.setlocale.php) and i18n in PHP is finally starting to take place.
// Set the current locale to the one the user agent wants
$locale = Locale::acceptFromHttp(getenv('HTTP_ACCEPT_LANGUAGE'));
// Default Locale
Locale::setDefault($locale);
setlocale(LC_ALL, $locale . '.UTF-8');
// Default timezone of server
date_default_timezone_set('UTC');
// iconv encoding
iconv_set_encoding("internal_encoding", "UTF-8");
// multibyte encoding
mb_internal_encoding('UTF-8');
There are several things that need to be condered and detecting the timezone/locale and then using it to correctly parse and display input and output is important. There is a [PHP I18N library](https://github.com/dotroll/I18N) that was just released which contains lookup tables for much of this information.
Processing User input is important to make sure you application has clean, well-formed UTF-8 strings from whatever input the user enters. [iconv](http://us3.php.net/manual/en/function.iconv.php) is great for this.
/**
* Convert a string from one encoding to another encoding
* and remove invalid bytes sequences.
*
* @param string $string to convert
* @param string $to encoding you want the string in
* @param string $from encoding that string is in
* @return string
*/
function encode($string, $to = 'UTF-8', $from = 'UTF-8')
{
// ASCII is already valid UTF-8
if($to == 'UTF-8' AND is_ascii($string))
{
return $string;
}
// Convert the string
return @iconv($from, $to . '//TRANSLIT//IGNORE', $string);
}
/**
* Tests whether a string contains only 7bit ASCII characters.
*
* @param string $string to check
* @return bool
*/
function is_ascii($string)
{
return ! preg_match('/[^\x00-\x7F]/S', $string);
}
Then just run the input through these functions.
$utf8_string = normalizer_normalize(encode($_POST['text']), Normalizer::FORM_C);
### Translations
As Andre said, It seems [gettext](http://php.net/gettext) is the smart default choice for writing applications that can be translated.
1. Gettext uses a binary protocol that is quite quick.
2. The gettext implementation is usually simpler as it only requires `_('Text to translate')`
3. Existing tools for translators to use and they're proven to work well.
When you reach facebook size then you can work on implementing RAM-cached, alternative methods like the one I mentioned in the question. However, nothing beats "simple, fast, and works" for most projects.
However, there are also addition things that gettext cannot handle. Things like displaying dates, money, and numbers. For those you need the [INTL extionsion](http://us3.php.net/manual/en/book.intl.php).
/**
* Return an IntlDateFormatter object using the current system locale
*
* @param string $locale string
* @param integer $datetype IntlDateFormatter constant
* @param integer $timetype IntlDateFormatter constant
* @param string $timezone Time zone ID, default is system default
* @return IntlDateFormatter
*/
function __date($locale = NULL, $datetype = IntlDateFormatter::MEDIUM, $timetype = IntlDateFormatter::SHORT, $timezone = NULL)
{
return new IntlDateFormatter($locale ?: setlocale(LC_ALL, 0), $datetype, $timetype, $timezone);
}
$now = new DateTime();
print __date()->format($now);
$time = __date()->parse($string);
In addition you can use [strftime](http://www.php.net/manual/en/function.strftime.php) to parse dates taking the current locale into consideration.
Sometimes you need the values for numbers and dates inserted correctly into locale messages
/**
* Format the given string using the current system locale
* Basically, it's sprintf on i18n steroids.
*
* @param string $string to parse
* @param array $params to insert
* @return string
*/
function __($string, array $params = NULL)
{
return msgfmt_format_message(setlocale(LC_ALL, 0), $string, $params);
}
// Multiple choices (can also just use ngettext)
print __(_("{1,choice,0#no errors|1#single error|1<{1, number} errors}"), array(4));
// Show time in the correct way
print __(_("It is now {0,time,medium}), time());
See the [ICU format details](http://icu-project.org/apiref/icu4c/classMessageFormat.html#details) for more information.
### Database
Make sure your connection to the database is using the correct charset so that nothing gets currupted on storage.
### String Functions
You need to understand the difference between the [string](http://us.php.net/strlen), [mb_string](http://us3.php.net/mb_string), and [grapheme functions](http://us2.php.net/manual/en/ref.intl.grapheme.php).
// 'LATIN SMALL LETTER A WITH RING ABOVE' (U+00E5) normalization form "D"
$char_a_ring_nfd = "a\xCC\x8A";
var_dump(grapheme_strlen($char_a_ring_nfd));
var_dump(mb_strlen($char_a_ring_nfd));
var_dump(strlen($char_a_ring_nfd));
// 'LATIN CAPITAL LETTER A WITH RING ABOVE' (U+00C5)
$char_A_ring = "\xC3\x85";
var_dump(grapheme_strlen($char_A_ring));
var_dump(mb_strlen($char_A_ring));
var_dump(strlen($char_A_ring));
### Domain name TLD's
The [IDN functions](http://us2.php.net/manual/en/function.idn-to-ascii.php) from the INTL library are a big help processing non-ascii domain names.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment