rraval/postgres-collation.md

## postgres-collation.md

      
    Raw
  

              postgres-collation.md
            
          
    This is a slightly stripped down version from our internal bug tracker. The point of posting this publicly is part FYI, part peer review. I'm hoping someone can look at this, disagree, and tell me all the downsides of using the C locale or point out things I've misunderstood. The Recommendations section in particular is contextualized by our database serving a SaaS product for users from many different locales, thus making locale a render level concern. YMMV, caveat emptor, etc.

Collation defines the character ordering for textual data. For Postgres, https://www.postgresql.org/docs/current/static/locale.html:

The locale settings influence the following SQL features:

Sort order in queries using ORDER BY or the standard comparison operators on textual data
The upper, lower, and initcap functions
Pattern matching operators (LIKE, SIMILAR TO, and POSIX-style regular expressions); locales affect both case insensitive matching and the classification of characters by character-class regular expressions
The to_char family of functions
The ability to use indexes with LIKE clauses


Those specific docs fail to mention that collation also affects indices. From https://www.postgresql.org/docs/current/static/sql-createindex.html:

By default, the index uses the collation declared for the column to be indexed or the result collation of the expression to be indexed. Indexes with non-default collations can be useful for queries that involve expressions using non-default collations.

Python and Postgres don't agree on string ordering

In unicode, a has codepoint 97 (U+0061) and { has codepoint 123 (U+007B).
Python has a well defined string ordering (https://docs.python.org/3/tutorial/datastructures.html#comparing-sequences-and-other-types):

The comparison uses lexicographical ordering [...] Lexicographical ordering for strings uses the Unicode code point number to order individual characters.

This means that Python has the same sort order regardless of locale settings:
$ python
Python 3.6.5 (default, May 11 2018, 04:00:52) 
[GCC 8.1.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> 'a' < '{'
True

In production, we use the en_CA.UTF-8 collation. As far as I can tell, this collation ignores specific punctuation characters but compares lexicographically otherwise:
$ psql
db=> SELECT 'a' < '{';
 ?column? 
----------
 f
(1 row)

This means that code that relies on database ordering but Python side comparisons is subtly broken.
This surprising collation behaviour isn't actually Postgres's fault, it just delegates to strcoll, provided by glibc in this case. strcoll is like strcmp in that:

strcoll(a, b) < 0 if a sorted before b in the current locale
strcoll(a, b) == 0 if a has the same sort position as b in the current locale
strcoll(a, b) > 0 if a sorts after b in the current locale

The following C code reproduces the issue:
#include <stdio.h>
#include <string.h>
#include <locale.h>

int main(int argc, const char **argv) {
    if (argc < 4) {
        printf("Usage: coll LANG STR1 STR2\n");
        return 1;
    }

    setlocale(LC_ALL, argv[1]);
    const char *s1 = argv[2];
    const char *s2 = argv[3];

    printf("%s < %s: %d\n", s1, s2, strcoll(s1, s2));
    return 0;
}

$ gcc -o coll coll.c

# C collation is equivalent to strcmp, which is equivalent to
# codepoint comparison, see recommendations below
$ ./coll C 'a' '{'
a < {: -26

# en_CA.UTF8 is what our prod database uses
$ ./coll en_CA.UTF8 'a' '{'
a < {: 1

# en_US.UTF8 is similarly affected
$ ./coll en_US.UTF8 'a' '{'
a < {: 1

Internationalized collation rules change over time

The rules for en_CA.UTF8 (and en_US.UTF8) inherit from iso14651_t1_common. As the name implies, it's based on ISO 14651 (wiki, standard).
However, those rules are occasionally updated (https://fedoraproject.org/wiki/Changes/Glibc_collation_update_and_sync_with_cldr):

The collation data in glibc is extremely out of date, most locales base their collation rules on an iso14651_t1_common file which has not been updated for probably more than 15 years. Therefore, all characters added in later Unicode versions are missing and not sorted at all which causes bugs like Bug 1336308 - Infinite (∞) and empty set (∅) are treated as if they were the same character by sort and uniq

While Unicode is not ISO 14651, they have a similar stance. From http://unicode.org/reports/tr10/:

Collation order is not fixed.
Over time, collation order will vary: there may be fixes needed as more information becomes available about languages; there may be new government or industry standards for the language that require changes; and finally, new characters added to the Unicode Standard will interleave with the previously-defined ones. This means that collations must be carefully versioned.

Nondeterminism in index creation is disaster

Unlike Unicode, which at least recognizes that "careful versioning" is required, glibc (or maybe glibc packagers) seem to arbitrarily update the rules.
TripAdvisor has a cautionary tale (https://www.postgresql.org/message-id/BA6132ED-1F6B-4A0B-AC22-81278F5AB81E@tripadvisor.com) where a streaming replica was running a different version of glibc than the master:

SELECT 'Ｍ' > 'ஐ';
Depending on your charset the first character might look like an ascii 'M'. It is not. The two characters in question are the utf8 representations of http://www.fileformat.info/info/unicode/char/ff2d/index.htm and http://www.fileformat.info/info/unicode/char/0b90/index.htm respectively. Across different machines, running the same version of postgres, and in databases with identical character encodings and collations ('en_US.UTF-8') that select will return different results if the version of glibc is different. This holds whether one pg instance is a pg_basebackup of the other or if you run an initdb and then immediately start postgres.

Granted that story is from 2014 and using PG 9.1.9, and things may have changed since then. Given the yes-this-is-bad-but-hard-to-fix responses from the developers and the current state of https://wiki.postgresql.org/wiki/Todo:ICU, I think this might still be an issue.
For bonus hilarity, see https://wiki.postgresql.org/wiki/Abbreviated_keys_glibc_issue, which features index corruption caused by some buggy implementations of glibc where strxfrm + strcmp != strcoll. That one's caused by bugs in glibc and this is a far more subtle backcompat issue, but serves to illustrate the sheer surface of bugs that can arise.
Recommendations

Convert all Postgres databases to use the UTF8 encoding under the C locale. The C locale does bytewise comparisons. Since UTF8 is encodes codepoints with the higher bits first, this is effectively equivalent to a code point comparison. This means that Postgres and Python now agree on string order AND it's faster to boot (https://www.postgresql.org/docs/current/static/locale.html):

The drawback of using locales other than C or POSIX in PostgreSQL is its performance impact. It slows character handling and prevents ordinary indexes from being used by LIKE. For this reason use locales only if you actually need them.

Putting everything in the C locale is similar to the recommendations for datetimes being stored in UTC in the database. Treat sorting as a render side concern that the client deals with.
Rewrite all server side sorts (for things like pagination) in terms of ints and datetimes. It should be possible to avoid ORDER BY on string columns entirely, and if it isn't, at least the C locale will give the same sort forever.