Reinmar/i18n.md

## i18n.md

      
    Raw
  

              i18n.md
            
          
    Process


We code the features using the t() functions in which we define the English value of a string (and, optionally, a context). E.g. t( 'Bold' ) or t( 'Button [context: clothing]' ) (the  [context: clothing] will be automatically removed upon build on, in the dev mode, on runtime). Each context must be defined in lang/contexts.json of a package which uses it or in the ckeditor5-core/lang/contexts.json (for common strings).
From time to time, we run a tool which scans code for all t() usages, extract strings from it, builds a contexts map (based on all lang/contexts.json files) and checks whether all used strings have defined contexts. Then builds a PO file with English strings and contexts and upload it to Transifex.
Then we run a tool which downloads all PO files for all defined language and put them back in lang/ directories.
When building an editor, a specific bundler read PO files and create translation map(s) for chosen language(s) based on them and bundles those as defined in the next section.

Implementation

ckeditor5-core/src/editor~Editor#constructor

I propose to merge the ctx name into the string in order to keep the same t() params in every possible environment t( str, values ). If it was a separate param, then in the build version (in which we replace strings with ids) it would have to be left unused, or, we'd need to change the implementation of t() on the fly.
The CKE_LANG will be defined by the bundler. It will be the editor class that's going to set it which means that it will only need to be set when bundling for the browser environment. Or we could go a bit further than that and define utils/global object which would retrieve the global scope depending on the env. In the browser that would be a window, in Node.js that would... something which is global there. That would allow this code to work without the bundler needing to preset this value.
PS. we already have utils/dom/global, but I think that it makes sense to keep them separated.
this.config.define( 'lang', global.CKE_LANG || 'en' );

this.locale = new Locale( lang );

const t = this.locale.t;

// Usage:
t( 'OK' );
t( 'button' );
t( 'button [context: clothing]' );
ckeditor5-*/lang/contexts.json

Contexts for used messages.
Examples: https://github.com/ckeditor/ckeditor-dev/tree/master/dev/langtool/meta
ckeditor5-core/lang/contexts.json:
{
	"OK": "Label for OK button in a dialog."
}
ckeditor5-form/lang/contexts.json:
{
	"button": "Name of a clickable form control, e.g. 'OK button'."
}
ckeditor5-tailor/lang/contexts.json:
{
	"button [context: clothing]": "Button as used in clothes."
}
The button is first defined in the ckeditor5-form package without a context, because, e.g. historically, we could've used it there without a context (cause, in our case, button is a UI button most of the time). Then, while working on the CKEditor 5 Tailor plugin we realised that button is already used, but in a different context, so we can't use t( 'button' ) as it will point to a wrong context definition (contexts are global for all packages). Instead, we'll use t( 'button [context: clothing] ' ) and add its own definition in the ckeditor5-tailor/lang/contexts.json.
ckeditor5-utils/src/locale

No magic here – uses the translate() function of the utils/translation-service module to get translated string and replaces placeholder values ($1, $2, etc.) with passed args.
import { translate } from './translation-service';

export default class Locale {
	constructor( lang ) {
		this.lang = lang;
	}

	t( str, ...values ) {
		const translatedString = translate( this.lang, str );

		// ... do the rest (values replacement)
	}
}
ckeditor5-utils/src/translation-service

The goal of this module is to encapsulate the whole "translations repository" logic from the rest of the code.
It may be dirty with some preprocessor rules or may need to be generated on the fly depending on the build type (e.g. with or without code splitting – i.e. with separate or built in language file). However, it would be good if it was defined in the repository so for dev purposes we wouldn't have to do anything. It'd just return the passed string.
Development mode implementation:
export function translate( lang, str ) {
	// Remove the ` [context: ...]` suffix.
	return str.replace( / \[context: [^\]]+\]$/, '' );
}
For the bundles, we have two cases:

where only one language is needed,
where there are multiple languages.

In case of just one language, it'll be best to simply replace the str param of t() calls
with the proper value (without the ctx now). This will allow for code splitting and hot-loading plugins without any tricks. It may happen, though, that some strings will then repeat multiple times – e.g. the ones from the core package. While this is going to make an un-gzipped package bigger, there shouldn't be a difference in case of a gzipped code. Besides, this will be just a few strings anyway and we'll save some space having such a simple implementation too.
export function translate( lang, str ) {
	return str;
}
In case of multiple languages we need to have some registry. The translate() implementation will be again simple:
export function translate( lang, str ) {
	return translations[ lang ][ str ];
	// Let this be objects, not maps, because we control the values of lang and str
	// and it will be a bit easier to generate a source of an object programatically.
	// Objects may also be better in terms of code size.
}
What's the str in this case? In order to ensure that we don't have name collisions (important for bundle splitting) I'd say that this should be either:


A totally unique string – like typical uid, but preferably shorter, using a wider range of unicode characters. A 5 chars long string using a range of all Unicode chars will give us comparable complexity to utils/uid:
> Math.pow( Math.pow( 2, 16 ), 5 )
1.2089258196146292e+24
> Math.pow( 35, 16 )
5.070942774902497e+24

The thing which worries me is whether there won't be any issues with encoding – we've seen people sourcing CKEditor in some weird encodings and it was always blowing up. But one could use a minifier which encodes special characters and it was be working again.
Another thing is ability to debug such code. With unreadble ids it may be tricky.
Finally – creating objects in which these uids are keys may be tricky. I can't find now whether there are any characters which needs to be escaped (other than the closing quote).
Anyway, this solution may create unpredictable and stupid issues.


Therefore, we may just use sequential ids from a short range (e.g. [a-z]). What about code splitting? There are two cases:

Code splitting was done when building the whole setup – then there won't be any problems in using non-conflicting ids, because it's done by one bundler at one time.
Someone, first built e.g. ckeditor5-preset-article-editor and then, separately (might be a different person), built ckeditor5-image. In this case, the preset bundle will use normal, short, ids (because it's the main bundle) and the image feature bundle will use prefixed ids (e.g. ckeditor5-image/<id>). The idea is that a developer releasing his/her package will use a special bundler setup which configures CKEditor plugin for Webpack to use prefixed ids.


Anyway, this is nothing we have to worry today, because, most likely, we'll work on releasing standalone package bundles after 1.0.0.
Another thing to notice is that support for multiple languages and code splitting is an optional feature, so we can implement it for just one bundler, i.e. Webpack.
Let's say, that we want to split a package to some preset X and packages Y and Z.
This will make for 3 files – x.js, y.js, z.js.
The idea is that each files will have an accompanying language(s) files defining a translations needed for this file. So there will be:

x.js, x-pl.js, x-en.js, ...
y.js, y-pl.js, y-en.js, ...
z.js, z-pl.js, z-en.js, ...

In order to run an the X preset with plugins with Polish translations one will need to load:
x.js, y.js, z.js, x-pl.js, y-pl.js, z-pl.js.
The language files will be built using entry points like this:
import { define } from '@ckeditor/ckeditor5-utils/src/translation-service';

define( 'pl', {
	'a': 'OK',
	'b': 'Anuluj',
	// ...
} );
And the define() function will merge these translations into other that it already has.
Tadam!
PS. PO file generation

We've been changing the idea which part of t() call is msgctxt, msgid and msgstr twice already, so let's clarify this:
For the following t() calls and contexts.json:
t( 'OK' );
t( 'button' );
t( 'button [context: clothing]' );
ckeditor5-core/lang/contexts.json:
{
	"OK": "Label for OK button in a dialog."
}
ckeditor5-form/lang/contexts.json:
{
	"button": "Name of a clickable form control, e.g. 'OK button'."
}
ckeditor5-tailor/lang/contexts.json:
{
	"button [context: clothing]": "Button as used in clothes."
}
The en.po file would look like this:
msgid "OK"
msgstr "OK"
msgctxt "Label for OK button in a dialog."

msgid "button"
msgstr "button"
msgctxt "Name of a clickable form control, e.g. 'OK button'."

msgid "button"
msgstr "button"
msgctxt "Button as used in clothes."

In this case the msgstr can be empty because then gettext will use msgid. In fact, in the samples I found in ckeditor/ckeditor5#387 (comment) it's empty.
Regarding ckeditor/ckeditor5#387 (comment):

As msgid we use the id of that string, not the "Bold" string. Why? The reasons were explained in the post I quoted – the msgid has limited length and, more importantly, you lose all the translations if you change the id. So, if we'd use the real string "Bold" as msgid, if we decided that for some reason it should be "Bold text", then translations would need to repeat their work and for some time we wouldn't have this string translated at all. This was pointed out by @wwalc and it happened (the English text was changed) in the past in CKE4 multiple times.

We ignore this issue. It's very rare situation.