al45tair/musl-setlocale.md Secret

## musl-setlocale.md

      
    Raw
  

              musl-setlocale.md
            
          
    Musl and setlocale

The problem

Musl has a some design features that motivate some rather unusual behaviour from its setlocale() implementation.  In particular:

Musl wishes to support UTF-8 "out of the box".
Musl also wishes to be 8-bit-safe by default, so the default locale, C, is NOT UTF-8.
In order to use UTF-8, a program must call setlocale(), typically using setlocale(LC_ALL, "").
Unfortunately, the POSIX and C standards specify that setlocale() should fail if the locale name isn't a valid locale, but don't really say what that means precisely, the upshot being that if the locale specified in the environment isn't valid (or Musl doesn't have data for it), the program might find itself in the C locale (non UTF-8) when really it wanted UTF-8 support.

The upshot of this is that setlocale() presently accepts any locale name as valid, and if it doesn't have a definition for that locale, it simply copies the C.UTF-8 locale, giving if the name passed in and returning that.
While this avoids the problem of programs ending up unexpectedly in a non UTF-8 locale and allows gettext() to work for any language without installing locale data for Musl (e.g. if the program has its own localized strings for fr_FR), it also means that there is no way for a program (notably test suites and configure scripts) to determine the presence of data for a locale, because setlocale() will always succeed even if Musl doesn't actually have the data for that locale.
Back in 2017 Rich was proposing to change things so that setlocale(cat, “”) always succeeds, but if the environment specifies an unknown locale, treats it as C.UTF-8, while setlocale(cat, explicit_name) will fail unless a valid definition file is installed for that locale name.
This also avoids the unexpected non UTF-8 problem, although it does mean that gettext() will not work unless a valid locale definition for the language in question is installed for the C library (this is exactly the same situation as with Glibc, so it wouldn't be unexpected behaviour for users reliant on localization support).  The present behaviour where gettext() works anyway is actually somewhat unpleasant because (for instance) if you have a program that itself has a French localization, it will have French text, but it will use . instead of , as a decimal separator (and maybe worse, , as a thousands separator…), "" instead of «» and so on.
Current behaviour

setlocale(cat, "explicit_Locale_Name") always succeeds; if there is no definition file for the given locale, the library copies the C.UTF-8 locale, naming the copy using the string passed to setlocale(), and returns that string.
setlocale(cat, "") looks first in LC_ALL, then in the environment variable corresponding to the category, then in LANG, and finally if all of those are empty, defaults the name to C.UTF-8.  It then proceeds as the previous case.
Proposed behaviour

setlocale(cat, "explicit_Locale_Name") should succeed if the locale actually has a definition file, and fail with a NULL pointer return otherwise.
setlocale(cat, "") should always succeed, treating unknown locales specified in environment variables as C.UTF-8.
The gettext() issue

gettext() uses the current LC_MESSAGES locale name to decide which set of translations to use.  This presupposes that it's possible to set LC_MESSAGES to a locale name that corresponds to a set of translations that the application using gettext() has supplied.  Existing musl behaviour means that this is always allowed, but the proposed change here will alter that so that LC_MESSAGES can only be set to a locale for which musl has data.
At first glance, this seems undesirable.  However, as noted above, running the C library and the application in (effectively) different locales is arguably incorrect behaviour and can give rise to confusing results, particularly in locales like fr_FR where the decimal and thousands separators are the reverse of what they would be in C.UTF-8 or en_US.  To give a concrete example, a French speaker would read 1,024 as being slightly larger than one, while 1.024 is a bit greater than a thousand.
Additionally, this is already the way things work with Glibc.  If Glibc doesn't have locale data for a locale, the setlocale() call will fail, regardless of whether the application has its own translations for that locale.
One might retort that a user is free to set LC_MESSAGES and LC_NUMERIC (for instance) separately, which is certainly true, although in that case, the user has made a conscious decision to do that and is unlikely to be confused at the conflicting presentation of messages and numeric values.
An option we could consider is allowing LC_MESSAGES specifically to retain the current behaviour.  In the C library, LC_MESSAGES is expected to control the messages generated by perror and strerror, and in principle one might take the view that not having translations of system errors was not really a sufficiently large problem to merit preventing the user from seeing translations from gettext().  We could stipulate that this will only happen for an explicit specification using the LC_MESSAGES environment variable and not for LC_ALL or LANG, to avoid the previously mentioned confusion.
Making the fix

Musl uses a function, __get_locale(), to look up locale data.  Presently, this function starts with the following stanza that deals with the "" locale case:
	if (!*val) {
		(val = getenv("LC_ALL")) && *val ||
		(val = getenv(envvars[cat])) && *val ||
		(val = getenv("LANG")) && *val ||
		(val = "C.UTF-8");
	}
This code overwrites the incoming parameter, such that it is no longer possible to tell whether the caller asked for "" or not.  Later on in the function is the code that copies the C.UTF-8 locale as a fallback:
	/* If no locale definition was found, make a locale map
	 * object anyway to store the name, which is kept for the
	 * sake of being able to do message translations at the
	 * application level. */
	if (!new && (new = malloc(sizeof *new))) {
		new->map = __c_dot_utf8.map;
		new->map_size = __c_dot_utf8.map_size;
		memcpy(new->name, val, n);
		new->name[n] = 0;
		new->next = loc_head;
		loc_head = new;
	}
The fix is fairly straightforward; instead of overwriting the incoming argument, we need to preserve it, and if we fail to find the locale definition and the incoming argument was "", we should simply return &__c_dot_utf8.  The upshot will be that any invalid environment variable settings will result in the C.UTF-8 locale as long as "" was requested from setlocale(), which is the behaviour we want.
If we wanted to permit LC_MESSAGES specifically to retain the current behaviour, we could just test for cat != LC_MESSAGES as part of the new test.