nmlgc/Locales and Unicode on Windows.md

## Locales and Unicode on Windows.md

      
    Raw
  

              Locales and Unicode on Windows.md
            
          
    About Locale Emulator and the concept of locale emulation itself

I came across this program a while ago myself, and was just as amazed by its premise. I looked at the source code, and saw lots of involved low-level NT kernel voodoo I don't understand, instead of the code I tend to use to solve this kind of problem. Truly, this must be an amazing piece of software, I thought... until I tested it with th135 and noticed that it rendered the Music Room text as exactly the same garbage you see when running the game without any locale emulation. Yeah, it did display 東方心綺楼 in the title bar and correctly referenced the 御首頂戴帳 folder, but overall I'd consider this to be a worse alternative to AppLocale.
So I did some research, and it turns out that this thing merely implements the bare minimum of functionality necessary (namely, changing the system ANSI code page in memory) to purportedly keep games from crashing. Anything just a bit more involved (and the text rendering in recent Tasofro games certainly is) appears to be outside of its scope. I wouldn't trust it. (But hey, at least it produces nice crash dumps.)
Oh, and of course th14 would still crash if it is in a path containing characters outside its emulated locale. It saddens me that we have to continue to play this locale game, with any inferior AppLocale knock-off being heralded as The Perfect Solution. On the other hand, thcrap is the proof that you can most solve issues with this locale crud and then some by writing wrapper functions that expect and return UTF-8 instead, and defining a fallback code page (not "locale") for compatibility. Yes, it's a mostly stupid and mindless affair (there are some functions you need to wrap more creatively, but they are few and far between), but given thcrap's premise of being a multilingual patcher, I see no other way. And since it's a rather high-level implementation, it also works on every Windows version since 98, as well as in Wine.
So, for those who are on XP, Vista or Linux and need a solution other than AppLocale, this is how you would use thcrap's UTF-8 wrapper part for any application:


Download thcrap and extract it.


Open Notepad, paste {}, and save this as null.js.


Right-click thcrap_loader.exe and create a new shortcut.


Right-click the new shortcut and click Properties.


Add null.js, followed by a space, followed by the full path to your game's executable, to the end of the the Target field. For example, if you wanted to play Hopeless Masquerade, you'd have something like
  "C:\Full\path\to\thcrap_loader.exe" null.js "C:\Full\path\to\Touhou 13.5 - Hopeless Masquerade\th135.exe"

in the Target field.


Rename the shortcut, and put it anywhere you like.


Run the game by clicking the shortcut.


There will be bugs. I'd like to keep this thing small and simple, and hence, I'll only add functionality once I notice that something needs it; I also wouldn't have a test case otherwise. As a result, it might even break things that would work if you ran the game without any locale emulation at all. Please tell me; I'd like to get these fixed, too.


About Unicode and locales on Windows

Firstly, there is no such thing as a "Unicode locale" (or, to use correct terminology, a system ANSI code page for the UTF-8 encoding). Yes, if Windows had one, we wouldn't have these problems to begin with. However, Microsoft shows no signs of ever moving into this direction. They even have valid reasons; this would break with the entire history on Unicode in Windows, but more on that later.
Instead, every Windows system is set to a specific system code page, which can be one of these. And yes, even though UTF-8 appears in this list, it is impossible to use it as an ANSI code page (which is exactly what we would need).
I agree that changing your system locale is an overly drastic measure that should be disapproved of. If you look at the definition of the term "locale", the code page is only one aspect among many others. It saddens me that this term is used almost synonymously to "code page" in the otaku scene, only because locales on Windows are fixed entities (you cannot, for example, run your system in an American English locale using the Shift-JIS encoding, although that would make perfect sense).
You do not want to change your locale. You do not want to change your date format, and you do not want a ¥ as your directory separator. You want to change the system code page (preferably temporarily) and nothing else. Although you really are lucky that on Windows, locales don't actually include the language of the system UI...

(Note that everything below only applies to writing low-level, native code in e.g. C or C++ without using any libraries to wrap the OS functionality for you - such as it's done in both ZUN's and Twilight Frontier's games.)
Secondly, Unicode is used in every localization of Windows, even in Japanese systems. It's up to every developer to choose to use it in favor of the local code page of their language.
"Using Unicode on Windows" means manually converting every string from its original encoding to UTF-16, then passing that directly to the Windows API functions. The decision on UTF-16 dates back to the development of Windows NT 3.1, where Unicode support in Windows was first introduced. Back then, Unicode only covered 65536 characters (enough to fit in 16 bits), so using an encoding that can cover every possible character might have even been a good decision back then. Well, too bad that Unicode was expanded beyond that just three years later, in 1996, thereby nullifying that reason completely. Given that Windows had always been using these code pages, the choice of a 16-bit encoding that would purposely be incompatible to any existing 8-bit, code page encoding is all the more baffling.
However, everything in Windows has of course be backwards-compatible, and Windows 3.0, 2 and 1 didn't have Unicode support. This meant that we ended up with two copies of every core Windows API function: one ending in a capital A that uses the system ANSI code page (with char* input and output parameters), and one ending in W that uses Unicode (with wchar_t* input and output parameters). The A functions are generally nothing else than a Unicode<->code page wrapper around the corresponding W functions. Take a look at the Wine source code if you need proof.
But it gets worse.
In what I can only interpret as an act of capitalist malice, this Unicode support was exclusive to the (more expensive) NT family, and was ripped out of the Windows 95, 98 and Me kernel. Trying to use any W function on these systems would simply fail entirely. Logically, since everything had to run on these systems too, this led to millions of applications being written using the A functions, thus being implicitly tied to a certain system locale.
"Well, but times have changed," you might say. "These systems are long out of use, so why is this still a problem?"
Firstly, legacy code. For instance, the core of ZUN's STG engine, written for TH06 and largely unchanged since (oh, how I love patching the same buffer overflows all over again), had to run on Windows 9x, which were still widely used in 2002. He simply had no other choice but to use the A functions. Since, again, everything on Windows has to be backwards-compatible, it is still possible to use A functions in new software, thus reinforcing this vicious cycle.
Secondly, convenience. If you're writing code in C or C++, you're accustomed to using 8-bit char* strings. It's the default way to express and pass around a sequence of characters in these languages, and the only way a lot of coders know, especially if they didn't make their coding experiences on Windows. Thus, every program, and every library you use to reduce your own programming effort, makes use of it.
Converting every string to and from UTF-16 just to make use of the W functions does seem like an unnecessary nuisance for a lot of coders (and I would know - three years ago, I thought the same). Clearly, why should I bother with Unicode if everything Works on My Machine™ (and with My Set of Test Files)?
But then, one day, you broaden your horizons and obtain some files with names outside of your system's ANSI code page. You can happily store and view them in Explorer (of course, it's a system component, it better be Unicode)... but your char*-using program internally converts every character it doesn't know to a ?, which in turn gets converted to an UTF-16 ?, which of course is not part of your file name! And all that because you didn't play Microsoft's game of using the wchar_t type and the functions that accept such parameters instead. Doing this is the only true, correct way to write Windows software that works anywhere.
This combination of short-sightedness on Microsoft's part, short-sightedness on part of application developers, and the lack of knowledge about all this is what brought us this misery - not the fact that a developer happens to live in a particular country. And this applies to everyone writing native Windows software.
(On a related note, this also puts quite a hilarious spin on the "Windows is not backwards-compatible enough" debate. As a weeaboo fan of Japanese media who is directly affected by this, you should rather complain that Windows is backwards-compatible to this ridiculous and harmful extent. And about the fact that this is even an issue in 2014, when the rest of the computing world has long deprecated anything that is not UTF-8.)