If there's anything that Santa and his elves ought to know, it's how to make a list. After all, they're reading lists that children send in, and Santa maintains his very famous list. Another thing we know is that Santa and his elves are quite multilingual.
So one day one of the elfs decided that, rather than hand typing out a list of gifts based on the data
they received (requiring elves that spoke all the world's languages), they decided to take advantage of the power of
Unicode's CLDR (Common Linguistic Data Repository).
This is Unicode's lesser-known project.
As luck would have it, Raku has a module providing access to the data, called Intl::CLDR
.
One elf decided that he could probably use some of the data in it to automate their list formatting.
He began by installing Intl::CLDR
and decided to play around with it in the terminal.
The module was designed to allow some degree of exploration in a REPL, so the elf did the following after reading the
provided read me:
# Repl response
use Intl::CLDR; # Nil
my $english = cldr<en> # [CLDR::Language: characters,context-transforms,
# dates,delimiters,grammar,layout,list-patterns,
# locale-display-names,numbers,posix,units]
The module loaded up the data for English and the object returned has a neat gist that
provides information about the elements it contains.
For a variety of reasons, Intl::CLDR
objects can be referenced either as attributes or as keys.
Most of the time, the attribute reference is faster in performance, but the key reference is more flexible (because
let's be honest, $english{$foo}
looks nicer than $english."$foo"()
, and it also enables listy assignment via
e.g. $english<grammar numbers>
).
In any case, the elf saw that one of the data points is list-patterns, so he explored further:
# Repl response
$english.list-patterns; # [CLDR::ListPatterns: and,or,unit]
$english.list-patterns.and; # [CLDR::ListPattern: narrow,short,standard]
$english.list-patterns.standard; # [CLDR::ListPatternWidth: end,middle,start,two]
$english.list-patterns.standard.start; # {0}, {1}
$english.list-patterns.standard.middle; # {0}, {1}
$english.list-patterns.standard.end; # {0}, and {1}
$english.list-patterns.standard.two; # {0} and {1}
Aha! He found the data he needed.
List patterns are catalogued by their function (and-ing them, or-ing them, and a unit one designed for formatting
conjoined units such as 2ft 1in
or similar).
Each pattern has three different lengths.
Standard is what one would use most of the time, but if space is a concern, some languages might allow for even slimmer
formatting.
Lastly, each of those widths has four forms.
The two form combines, well, two elements.
The other three are used to collectively join three or more: start combines the first and second element, end
combines the penultimate and final element, and middle combines all second to penultimate elements.
He then wondered what this might look like for other languages. Thankfully, testing this out in the repl was easy enough:
my &and-pattern = { cldr{$^language}.list-patterns-standard<start middle end two>.join: "\t"'" }
# Repl response (RTL corrected, s/\t/' '+/)
and-pattern 'es' # {0}, {1} {0}, {1} {0} y {1} {0} y {1}
and-pattern 'ar' # {0} و{1} {0} و{1} {0} و{1} {0} و{1}
and-pattern 'ko' # {0}, {1} {0}, {1} {0} 및 {1} {0} 및 {1}
and-pattern 'my' # {0} - {1} {0} - {1} {0}နှင့် {1} {0}နှင့် {1}
and-pattern 'th' # {0} {1} {0} {1} {0} และ{1} {0}และ{1}
He quickly saw that there was quite a bit of variation!
Thank goodness someone else had already catalogued all of this for him.
So he went about trying to create a simple formatting routine.
To begin, he created a very detailed signature and then imported the modules he'd need.
#| Lengths for list format. Valid values are 'standard', 'short', and 'narrow'.
subset ListFormatLength of Str where <standard short narrow>;
#| Lengths for list format. Valid values are 'and', 'or', and 'unit'.
subset ListFormatType of Str where <standard short narrow>;
use User::Language; # obtains default languages for a system
use Intl::LanguageTag; # use standardized language tags
use Intl::CLDR; # accesses international data
#| Formats a list of items in an internationally-aware manner
sub format-list(
+@items, #= The items to be formatted into a list
LanguageTag() :$language = user-language #= The language to use for formatting
ListFormatLength :$length = 'standard', #= The formatting width
ListFormatType :$type = 'and' #= The type of list to create
) {
...
...
...
}
That's a bit of a big bite, but it's worth taking a look at.
First, the elf decided uses declarator POD wherever it's possible.
This can really help out people who might want to use his eventual module in an IDE, for autogenerating documentation,
or for curious users in the REPL.
(If you type in ListFormatLength.WHY
, the text “Lengths for list format … and 'narrow'” will be returned.)
For those unaware of declarator POD, you can use either #|
to apply a comment to the following symbol declaration (
in the example, for the subset and the sub itself), or #=
to apply it to the preceeding symbol declaration (most
common with attributes).
Next, he imports two modules that will be useful.
User::Language
detects the system language, and he uses this to provide sane defaults.
Intl::LanguageTag
is one of the most fundamental modules in the international ecosystem.
While he wouldn't strictly need it (we'll see he'll ultimately only use them in string-like form), it helps to ensure at
least a plausible language tag is passed.
If you're wondering what the +@items
means, it applies a DWIM logic to the positional arguments.
If one does format-list @foo
, presumably the list is @foo
, and so @items
will be set to @foo
.
On the other hand, if someone does format-list $foo, $bar, $xyz
, presumably the list isn't $foo
, but all three
items.
Since the first item isn't a Positional
, Raku assumes that $foo
is just the first item and the remaining positional
arguments are the rest of the items.
The extra ()
in LanguageTag()
means that it will take either a LanguageTag
or anything that can be coerced
into one (like a string).
Okay, so with that housekeeping stuff out of the way, he gets to coding the actual formatting, which is devilishly simple:
my $format = cldr{$language}.list-format{$type}{$length};
my ($start, $middle, $end, $two) = $format<start middle end two>;
if @items > 2 { ... }
elsif @items == 2 { @items[0] ~ $two ~ @items[1] }
elsif @items == 1 { @items.head }
else { '' }
He paused here to check and see if stuff would work. So he ran his script and added in the following tests:
# output
format-list <>, :language<en>; # ''
format-list <a>, :language<en>; # 'a'
format-list <a b>, :language<en>; # 'a{0} and {1}b'
While the simplest two cases were easy, the first one to use CLDR data didn't work quite as expected.
The elf realized he'd need to actually replace the {0} and {1} with the item.
While technically he should use subst
or similar, after going through the CLDR, he realized that all of them begin
with {0}
and end with {1}
.
So he cheated and changed the initial assignment line to
my $format = cldr{$language}.list-format{$type}{$length};
my ($start, $middle, $end, $two) = $format<start middle end two>.map: *.substr(3, *-3);
Now he his two-item function worked well.
For the three-or-more condition though, he had to think a bit harder how to combine things.
There are actually quite a few different ways to do it!
The simplest way for him was to take the first item, then the $start
combining text, then join
the second through
penutimate, and then finish off with the $end
and final item:
if @items > 2 {
~ $items[0]
~ $start
~ $items[1..*-2].join($middle)
~ $end
~ $items[2]
}
elsif @items == 2 { @items[0] ~ $two ~ @items[1] }
elsif @items == 1 { @items.head }
else { '' }
Et voilà! His formatting function was ready for prime-time!
# output
format-list <>, :language<en>; # ''
format-list <a>, :language<en>; # 'a'
format-list <a b>, :language<en>; # 'a and b'
format-list <a b c>, :language<en>; # 'a, b, and c'
format-list <a b c d>, :language<en>; # 'a, b, c, and d'
Perfect! Except for one small problem. When they actually started using this, the computer systems melted some of the snow away because it overheated. Every single time they called the function, the CLDR database needed to be queried and the strings would need to be clipped. The elf had to come up with something to be a slight bit more efficient.
He searched high and wide for a solution, and eventually found himself in the dangerous lands of here be dragons, otherwise known in Raku as EVAL
.
He knew that EVAL
could potentially be dangerous, but that for his purposes, he could avoid those pitfalls.
What he would do is query CLDR just once, and then produce a code block that would do the simple logic based on the number of items in the list.
The string values could probably be hard coded, sparing some variable look ups too.
EVAL
should be used with great caution.
All it takes is one errant unescaped string being accepted from an unknown source and your system could be taken.
This is why it requires you to affirmatively type use MONKEY-SEE-NO-EVAL
in a scope that requires it.
However, in situations like this, where we control all inputs going in, things are much safer.
In tomorrow's article, we'll discuss ways to do this in an even more safer manner, although it adds a small degree of complexity.
To begin, the elf imagined his formatting function as if it had hardcoded values. He just used the English ones for now:
sub format-list(+@items) {
if @items > 2 { @items[0] ~ $start ~ @items[1..*-2].join($middle) ~ $end ~ @items[*-1] }
elsif @items == 2 { @items[0] ~ $two ~ @items[1] }
elsif @items == 1 { @items[0] }
else { '' }
}
That was ... really simple! But he needed this in a string format.
One way to do that would be to just use straight string interpolation, but he decided to use Raku's equivalent of a heredoc, q:to
.
For those unfamiliar, in Raku, quotation marks are actually just a form of syntactic sugar to enter into the Q (for quoting) sublanguage.
Using quotation marks, you only get a few options: ' '
means no escaping except for \\
, and using " "
means interpolating blocks and $
-sigiled variables.
If we manually enter the Q-language (using q
or Q
), we get a LOT more options.
If you're more interested in those, you can check out Elizabeth Mattijsen's 2014 Advent Calendar post on the topic.
Our little elf decided to use the q:to
option to enable him to keep his code as is.
my $format = cldr{$language}.list-format{$type}{$length};
my ($start, $middle, $end, $two) = $format<start middle end two>;
my $code = q:to/FORMATCODE/;
sub format-list(+@items) {
if @items > 2 { @items[0] ~ $start ~ @items[1..*-2].join($middle) ~ $end ~ @items[*-1] }
elsif @items == 2 { @items[0] ~ $two ~ @items[1] }
elsif @items == 1 { @items[0] }
else { '' }
}
FORMATCODE
EVAL $code;
The only small catch is that he'd need to get a slightly different version of the text from CLDR.
If the text and
were placed verbatim where $two
is, that block would end up being @items[0] ~ and ~ @items[1]
which will cause a compile error.
Luckily, Raku has a command here to help out!
By using the .raku
function, we get a Raku code form for most any object.
For instance:
# REPL output
'abc'.raku # "abc"
"abc".raku # "abc"
<a b c>.raku # ("a", "b", "c")
So he just changed his initial assignment line to chain one more method (.raku
):
my ($start, $middle, $end, $two) = $format<start middle end two>.map: *.substr(3,*-3).raku;
Now his code work. His last step was to find a way to reuse it to benefit from this initial extra work. He made a very rudimentary caching set up (rudimentary because it's not theoretically threadsafe, but even in this case, since values are only added, and will be identically produced, there's not a huge problem). This is what he came up with (declarator pod and type information removed):
sub format-list (+@items, :$language 'en', :$type = 'and', :$length = 'standard') {
state %formatters;
my $code = "$language/$type/$length";
# Get a formatter, generating it if it's not been requested before
my &formatter = %cache{$code}
// %cache{$code} = generate-list-formatter($language, $type, $length);
formatter @items;
}
sub generate-list-formatter($language, $type, $length --> Sub ) {
# Get CLDR information
my $format = cldr{$language}.list-format{$type}{$length};
my ($start, $middle, $end, $two) = $format<start middle end two>.map: *.substr(3,*-3).raku;
# Generate code
my $code = q:to/FORMATCODE/;
sub format-list(+@items) {
if @items > 2 { @items[0] ~ $start ~ @items[1..*-2].join($middle) ~ $end ~ @items[*-1] }
elsif @items == 2 { @items[0] ~ $two ~ @items[1] }
elsif @items == 1 { @items[0] }
else { '' }
}
FORMATCODE
# compile and return
use MONKEY-SEE-NO-EVAL;
EVAL $code;
}
And there he was! His function was all finished. He wrapped it up into a module and send it off to the other elves for testing:
format-list <apples bananas kiwis>, :language<en>; # apples, bananas, and kiwis
format-list <apples bananas>, :language<en>, :type<or>; # apples or bananas
format-list <manzanas plátanos>, :language<es>; # manzanas y plátanos
format-list <انارها زردآلو تاریخ>, :language<fa>; # انارها، زردآلو، و تاریخ
Hooray!
Tomorrow, though, another elf took up his work and decided to go even crazier! Stay tuned for more of the antics from Santa's elves.
“generate an list”