Skip to content

Instantly share code, notes, and snippets.

@raiph
Last active January 22, 2021 22:33
Show Gist options
  • Star 3 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save raiph/a9d58825662b5cf2da2cc550cb3c6989 to your computer and use it in GitHub Desktop.
Save raiph/a9d58825662b5cf2da2cc550cb3c6989 to your computer and use it in GitHub Desktop.
trans DWEM

What this is

My documentation of P6's .trans routine for anyone who reads it.

It may be a step toward updating the official doc and/or cleaning up the relevant spec tests and/or functionality.

I have tried to be clear enough that a future me will be able to make sense of it, and hopefully anyone else who reads this.

It's supposedly exhaustive (definitely exhausting!).

Hmm Bugs and known failures appear in Hmm notes at the ends of some sections.

Why I wrote it

On Sept 4th, 2018, after considering .trans as an answer to an SO question "Simultaneous substitutions with s///?", I tried to ensure I understood .trans.

I began by reading the P6 doc page. I found it very hard to understand. Next, I tried experimenting with the function as documented in My early .trans experiment near the end of this gist. I looked at source tests (see Compiler source code / tests at the end of this gist). I went down a series of rabbit holes. When I came up for air I thought I'd document what I'd found.

Arguments

All trans arguments must be pairs:

  • Positional argument pairs are matchers/replacers expressions, explained in the next section. You may pass as many matchers/replacers expression pairs as you like. Passing none means .trans doesn't do anything.

  • The "adverbs" :complement, :delete, and :squash may be passed as named arguments. See the section Positional pairs vs named arguments below to make sure you don't accidentally pass them as positional pairs. (I can't see how this is really an issue right now. I obviously thought so when I first wrote this. I recall falling into this trap. How did I? Why don't I now? Is this really a valid concern?)

Positional pairs: matchers expression => replacers expression

A matchers expression (the key or LHS of a positional pair passed to .trans) generates one or more matchers. A replacers expression (the value or RHS of a pair), coupled with automatic replacer extension (see Replacers extension below), generates one or more replacers that correspond to the matchers generated by the LHS of the pair.

This gist assumes that, at least semantically, all the expressions in all the pairs are converted into a single overall list of matchers and a single corresponding list of replacers, one for each matcher, because that's what .trans seems to do.

Using a single string as a matchers expression or replacers expression

A matchers expression or replacers expression (LHS or RHS of a positional pair argument passed to .trans) that is just a single string of one or more characters generates one or more matchers or replacers, one per character the expression specifies.

Most characters are literal but there are notable powers and quirks.

For example, consider the pair 'abc' => 'de' which has a single string on both sides. The 'abc' turns into three matchers, one for 'a', one for 'b', and one for'c'. The 'de' turns into three corresponding replacers. There's a 'd' replacer corresponding to the 'a' matcher, an 'e' replacer for replacing a 'b', and a 'd' replacer for replacing a 'c'. (See Replacers extension for where the final 'd' replacer comes from. In other scenarios the third replacer would be 'e' or a null string instead.)

A single string matchers/replacers expression supports character ranges that expand to their range. For example, 'a..dm..q' expands to 'abcdmnopq' which in turn becomes 9 matchers or 9 replacers depending on which side of the => it's specified.

Hmm There's crazier stuff too that's been aded to the test files in the roast directory for transliteration. Other than this link I'm not going to mention or document them in this gist for now.

Using regexes in a matchers expression and/or closures in a replacers expression

You can specify a regex as a matchers expression.

You can specify a closure as a replacers expression.

A single closure can be paired with a single regex to make sane use of $/. But if you also have none regexes or multiple regexes things won't turn out well. Here be bugs or at least insane semantics.

Using an array (or list) as a matchers or replacers expression

You can pass an array or list of matchers/replacers as a matchers/replacers expression.

This is how you can specify a string as a single unit, as against a string that specifies each of its characters as its own unit as explained in the previous section: use a matchers or replacers expression that's a list or array or turns into one. For example:

  • <foo bar> is a list specifying two strings, 'foo' and 'bar'. These do not turn into six single character matchers or replacers! Instead they are just two matchers that match three character sub-strings (or two replacers each of which replaces a match with a three character string).

  • ['baz'] would match, or replace a match with, the single (sub-)string 'baz'.

  • The string range 'aa'..'bb' turns into ('aa','ab','ba',bb'), i.e. four two-character string matchers/replacers.

A matchers/replacers expression list/array can include any mix of elements each of which is a:

  • string (which then always mean itself as a unit, not its constituent characters);

  • a string range;

  • a nested list/array (which flattens, recursively, into the outer list/array);

  • a regex (in a matchers expression) or closur (in a replacers expression).

If a list is passed as a matchers or replacers expression, ranges are expanded (and the results coerced to strings if they're numeric), regexes (on LHS of a pair) and closures (on RHS of a pair) are left as is, and if there's anything else it is either coerced into a string or an error is generated.

<foo> and ('foo') are single strings, not lists

In Perl 6 both <foo> and ('foo') are the single string 'foo', not a one element list containing a string. So if it's the only element you pass as a matchers or replacers expression it gets turned into individual characters rather than meaning itself as a string.

The idiomatic way to specify a single string as an indivisible unit is ['foo']. (This is an Array value. Arrays are a type of List. Array literals ([...] where a value is expected) are always arrays unlike (...) which is only a list if it contains one or more commas or semi-colons just as <...> is only a list if it contains multiple "words" separated by one or more spaces.)

Mixing string and array/list specifiers

You can use a single string on the LHS of a transliteration pair to specify a list of individual character matchers to be replaced by their corresponing entry in a list of strings/closures on the RHS.

Or you can use a list to specify a list of sub-string matches/regexes on the LHS to be replaced by the corresponding character according to a single string replacer specification on the RHS.

Hmm If one of the matchers on the left hand side is a null string or regex, and no other matchers match at a given position in the input string then .trans goes into an infinite loop.)

Replacers extension

If the list of matchers resulting from the matcher specification of a transliteration pair passed to .trans is longer than the list of replacers initially generated by the replacer specification of the pair, then the list of replacers is extended to make up the difference.

If :delete has been specified then a null string is repeated to make the two lists the same length.

Otherwise, if a pair consists of just a single string specifier on both the left and right, then the list of replacers (each one an individual character) is extended by repeating the initial list.

Otherwise, the list is extended by repeating the last replacer in the list.

The transliteration process

.trans starts at the start of the input string. It attempts to match at that position. If multiple matchers match it picks a single winner. Depending on what matches and the setting of the :complement, :delete, and :squash adverbs it either keeps, replaces, or deletes the matched character or sub-string or, if nothing matched, then keeps/replaces/deletes the character in the input string at the current matching position. Then it moves the matching position forward to skip over the kept/replaced/deleted character(s).

At each iteration of matching, there are various decisions about what to do including deciding which matcher/replacer pair wins, if any.

If no matchers match at a position, action depends on use (or not) of :complement and :delete

If no matchers match at a character position, then the character at the current matching position is kept if both :complement and :delete are False.

If :complement is True then, provided that there is at least one match somewhere in the string during the overall transliteration processing of the entire string being transformed, the character is replaced by the first replacer in the first pair passed to .trans.

If :complement is False but :delete is True then the character is deleted. (Note that specifying :complement renders :delete irrelevant.)

If several matchers tie, choosing which wins depends on use (or not) of regexes/closures

If several matchers tie for equal longest match, then one of them is chosen:

  • If any matcher is a regex or any replacer is a closure (regardless of whether it matches/replaces any of the input string in this or any other iteration), then the leftmost matcher wins.

  • Otherwise, the rightmost matcher wins.

Hmm This section just isn't right, or isn't the whole story. Investigation continues.

If some matcher wins, action depends on use (or not) of :squash

If one matcher matches more of the input string than any other, then it "wins" outright.

If some matcher wins then its corresponding replacer is used to replace the matching character or sub-string except that if :squash has been specified, and the winning matcher won the previous iteration too, then the matched sub-string is removed and the corresponding replacer is ignored (except that if it's a closure the closure is still called even though its result is ignored).

Positional pairs vs named arguments

The following is intended to remind me and readers of the distinction between these two forms of pairs:

.say for

# Some idiomatic ways to positionally pass pairs with a *single value* for the LHS of a pair:

  # with string 'foo' for key:
  'foo'          => 'baz',
  <foo>          => 'baz',

  # with a regex for key:
  /foo/          => 'baz',

# Some idiomatic ways to positionally pass pairs with a *value list* for the LHS of a pair:
 
  # Array with one element, the string 'foo':
  ['foo']        => 'baz',

  # List with multiple elements, the strings 'foo' and 'bar':
  <foo bar>      => 'baz',

  # List with multiple elements, the string 'foo' and a regex:
  ('foo', /bar/) => 'baz',

  "\n",     

# Some erroneous vs correct ways to pass pairs as named arguments (adverbs):

  # Adverb `foo` is not known to `.trans`. This would get silently ignored.
  foo            => 'baz',
 
  # `.trans` supports these adverbs:
  delete         => True,
  :complement
  :squash

displays:

foo => baz
foo => baz
/foo/ => baz
[foo] => baz
(foo bar) => baz
(foo /bar/) => baz

foo => baz 
delete => True
complement => True

Note that the two initial foo => baz arguments work the same way in the .say for ... context as the one near the end because for is a keyword and it treats all pairs the same way. Only routines make the positional/named argument distinction that pairs have to straddle.

My early .trans experiment

my $a = 'a b c d e f g h i j k l m n o p q r s t u v w x y z';
my $b = 'abcdefghijklmnopqrstuvwxyz';

say .trans: abc         => '1Aa',
            <de>        => '2Dd',
            <fg h>      => '3Ff',
            <ij k>      => <4Ii>,
            <l n>       => <5Ll 6Nn>,
            /o/         => '7Oo',
            /p/         => <8Pp 9Qq>,
            (/r/, /s/)  => <0Rr 1Ss>,
            'tuv'       => '2Tt',
            'wxy'       => <3Ww 4Xx>

for $a, $b;

# a b c 2 D f g F i j I 5Ll m 6Nn 7Oo 8Pp 9Qq q 0Rr 1Ss 2 T t 3Ww 4Xx 4Xx z
# abc2D3F4I5Llm6Nn7Oo8Pp 9Qqq0Rr1Ss2Tt3Ww4Xx4Xxz

Compiler source code / tests

On my journey I checked out the method trans implementation in the relevant Rakudo source code. It was too complex for me to figure out what was intended.

(As a result of this SO I also checked out the tr "nibbler" in its relevant Rakudo source code. It was also too complex for me to figure out what was intended.)

I also checked the test files in the roast directory for transliteration. (This showed me how much appeared to technically be part of 6.c but undocumented, especially the powers and quirks when specifying a single string as a matchers or replacers argument.)

@jubilatious1
Copy link

jubilatious1 commented Sep 9, 2020

Hi @raiph, I'm not seeing proper return values when I reduce the alphabet to 'a..z' (the subject of NB#1, in general) :

~$ raku #enter the REPL
To exit type 'exit' or '^D'
> 
> my @a = "wall".comb;
[w a l l]
> @a.trans('abcdefghijklmnopqrstuvwxyz' => ords('abcdefghijklmnopqrstuvwxyz') ).put;
119 97 108 108
> 
> my @a = "wall".comb;
[w a l l]
> @a.trans('a..z' => ords('a..z') ).put;
122 97 122 122
>
> $*VM
moar (2020.06)
>

So the first "wall" example returns 119, 97, 108, 108 but the second "wall" example returns 122, 97, 122, 122. So I'm unable to simplify the concern noted in NB#1 for the moment.

https://stackoverflow.com/a/63803344/7270649

@jubilatious1
Copy link

jubilatious1 commented Sep 9, 2020

Hi @raiph, this is in regards to NB#2, not being able to delete unmatched letters (i.e. anything out of the alphabetic ascii range). I can take @b2gills solution and adapt it to use trans, but for now I've given up on trying to get :delete to work:

> my @a = "wall".comb;
[w a l l]
> @a.grep('a'..'z'.comb.any).trans('abcdefghijklmnopqrstuvwxyz' => ords('abcdefghijklmnopqrstuvwxyz') ).put
119 97 108 108
>
> my @a = "wallé".comb;
[w a l l é]
> @a.grep('a'..'z'.comb.any).trans('abcdefghijklmnopqrstuvwxyz' => ords('abcdefghijklmnopqrstuvwxyz') ).put
119 97 108 108
>
> $*VM
moar (2020.06)
>

https://stackoverflow.com/a/63803344/7270649

@raiph
Copy link
Author

raiph commented Nov 28, 2020

Hi @jubiliatious1,

Only just spotted your comments.

The 'a..z' feature that I discussed is only for the LHS or RHS of a pair being passed to trans, not an argument to ords.

So it works fine for the LHS. And it would work for the RHS too, but not as an argument passed to some other routine as you're doing with ords. (What ords('a..z') does is return the ordinal values for the literal four characters a, ., ., z!)

That all said, given that you can't usefully pass 'a..z' to ords, and thus must instead use 'a'..'z' So I think it's best to use 'a'..'z' on both sides. And you can't just write ords('a'..'z') either, because ords will coerce the list to a string, which will result in 'a b c d ... x y z', i.e. with spaces injected.

Putting all the foregoing together, this works:

put "wallé" .comb .trans: :delete, ('a'..'z', 'é') .flat => ('a'..'z') .join .ords; # 119 97 108 108

@jubilatious1
Copy link

Thank you @raiph, but I don't see the point of the final code. If you have to write out ('a'..'z', 'é') .flat in order to identify the "é" character that needs to be "delete-d", doesn't that make the code non-generalizable?

Are you admitting in your code that when using the :delete adverb basically you have to know ALL non-alphabetic characters in order to just retain the (ascii) alphabetic ones?

I've adapted some of your code (below). I'm filtering using grep() as previously presented and frankly I don't see any way to avoid this step. Please write back if you come up with anything simpler that the code below. Thanks!

> put "wallé".comb.grep('a'..'z'.comb.any).trans: ('a'..'z') => ('a'..'z').join.ords;
119 97 108 108

@raiph
Copy link
Author

raiph commented Dec 2, 2020

Thank you @raiph, but I don't see the point of the final code.

I think I lost touch with your evolving explanation of what you were after.

:delete lets you explicitly specify on the LHS of a pair which characters/strings you wish to translate or delete. You had mentioned :delete, so I had presumed you wanted to use it. And you provided an example in which you explicitly listed a character. So I thought it would help to show how to use it with that example.

Are you admitting in your code that when using the :delete adverb basically you have to know ALL non-alphabetic characters in order to just retain the (ascii) alphabetic ones?

I wouldn't put it that way. I'd say the :delete adverb follows the normal trans semantics in which you explicitly list which characters/strings you wish to affect.

If you have to ... identify the "é" character that needs to be "delete-d", doesn't that make the code non-generalizable?

I presume by "generalizable" you mean without specifying which characters/strings you wish to delete.

To delete all but the characters/strings you specify, you can't use :delete.

One option is to instead use :complement. This is the adverb specifically designed to do the complement of the usual trans semantics, so that you specify which characters/strings you do not want to affect. The translation effect is then applied to all characters/strings that are not specified. To delete them, you specify a null string as the result:

put "wallé" .comb .trans(:complement, 'a..z' => ''); # wall

(Note that you can't simultaneously :complement and :delete. The doc pretends you sort of can, but my gist is righter than the doc. The simple truth is that if you specify :complement then the :delete adverb is ignored.)

Please write back if you come up with anything simpler that the code below. Thanks!

I think I'd probably go with a variant of one of Brad's solutions in the SO.

eg put "wallé".comb(/<[a..z]>/)».ord

@jubilatious1
Copy link

Thank you @raiph, for your kind reply. I never would have guessed that the :delete adverb was used that way.

Actually, let me re-phrase that: I never would have guessed the proper syntax to use the :delete adverb. I tried :delete() and :delete => 'é' but couldn't get the syntax right.

Is there some guide to adverb usage? I did the typical .trans() call but I keep getting erroneous answers (below):

> put "wallé" .comb .trans: :delete, ('a'..'z', 'é') .flat => ('a'..'z') .join .ords; # 119 97 108 108
119 97 108 108
> put "wallé" .comb .trans(:delete, ('a'..'z', 'é') .flat => ('a'..'z') .join).ords
119 32 97 32 108 32 108 32
> put "wallé" .comb .trans(:delete, ('a'..'z', 'é') .flat => ('a'..'z')).join .ords;
119 32 97 32 108 32 108 32
>

Meanwhile, I'll look at your usage of the :complement adverb. Thank you!

@raiph
Copy link
Author

raiph commented Dec 2, 2020

put "wallé" .comb .trans(:delete, ('a'..'z', 'é') .flat => ('a'..'z') .join) .ords
  • .comb always produces a list. In this case w, a, l, l, é.

  • .trans always coerces its input to a string. The default stringification of a list is its elements separated by a space. So it coerces w, a, l, l, é, to w a l l é. The sole pair argument in the trans call does not have a space. So the four spaces remain in the result. Which is w a l l .

  • .ords always produces a Seq (list). In this case, a list of 8 ordinals.

  • put coerces its list of argments to strings. Again, the default stringification of a list is its elements separated by a space. So you get the result you see.

Dropping the .comb fixes all of that:

put "wallé" .trans(:delete, ('a'..'z', 'é') .flat => ('a'..'z') .join) .ords

put "wallé" .comb .trans(:delete, ('a'..'z', 'é') .flat => ('a'..'z')).join .ords;
  • The .join in ('a'..'z').join on the RHS of a trans pair argument makes no difference if it isn't immediately followed by some other transformation acting on its result. That's because trans treats lists of single characters exactly the same as a string of those characters. So dropping it has no effect.

  • Putting .join outside the trans also has no effect, for a different reason. The result of .trans is a single string. .join is for joining a list of values; it has no effect if its invocant is a single value.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment