chrispsn/we-need-to-talk-about-group.md

## we-need-to-talk-about-group.md

      
    Raw
  

              we-need-to-talk-about-group.md
            
          
    We need to talk about group.

I'm sorry to tell you, but group is gone.
If you check out the latest k.d on shakti.com as at 28 March 2023, you'll see that 'unary' = is now 'freq' (frequency).
Group had a long life: it's been around since at least k2, or 1996.
So why did group go? And what should we use instead?
Code examples will be in ngn/k unless otherwise stated.
What does group do?

Group tells you where each item occurs in a list. It generates lists of indices. Here are some examples in the ngn/k web editor.
Here, we group numbers by whether they are even or odd:
 x:1 4 2 8 5 7
 =2!x
1 0!(0 4 5;1 2 3)  / indices of elements in input list
 x@=2!x
1 0!(1 5 7;4 2 8)  / index back into original list to group the elements
 +/'x@=2!x         / do something with them, eg...
13 14              / ... subtotals of odd and even elements

You could also take that dict of indices resulting from group and use it to index into another data structure. For example:
 player:`alice`bob`carol`alice`bob`carol
 score:1 4 2 8 5 7
 score@=player
`alice`bob`carol!(1 8;4 5;2 7)  / scores of each player
 |/'score@=player               
`alice`bob`carol!8 5 7          / max score of each player

There are differences in the way different k dialects implement group (list vs dict; entry order):

k2 returns indices lists corresponding to the unique values in the input list (in the order in which they appeared)
k4/q to k6 (including ngn/k) returns the same, but as a dict keyed by those unique values
k7 and k9 2021 does the same, but sorted by the keys. In k9 2021 you can get it back to key occurrence order with <.

 / k2
 = 2 1 2 2 1 1
(0 2 3   / indices of 2
 1 4 5)  / indices of 1

 / k4/q
 group "mississippi"
m| ,0
i| 1 4 7 10
s| 2 3 5 6
p| 8 9

 / k7
2019-09-25 10:44:44 α 4core 512mb wsm32 (c) shakti
 ="mississippi"
i|1 4 7 10
m|,0
p|8 9
s|2 3 5 6

 / k9 2021
w2021.03.27 1gb chrome (c)shakti 2.0
 ="mississippi"
i|1 4 7 10
m|,0      
p|8 9     
s|2 3 5 6

 <="mississippi"  / sorts dict by values
m|,0      
i|1 4 7 10
s|2 3 5 6 
p|8 9 

Why change = from group to freq?

Only Arthur knows for sure, but here are some guesses:
1) Group is too slow

We know one of Arthur's k design principles is that "the slowest primitive to the fastest primitive have to live within one order of magnitude in speed... you can't have operations that are powerful but impractical from a CPU point of view". [1]
So you'd expect the slowest operation to take at most 10x the fastest operation.
On this test, group is on rocky ground. I won't use performance numbers from Arthur's k implementations, but data from other array langs suggest the performance difference is closer to two orders of magnitude, ie 100x:

In ngn/k's web REPL, with d:10000?100, I was seeing =d take ~100x more than +/d.
We can also look to Marshall Lochbaum's BQN bencharray. Depending on the data type, BQN's group takes (roughly!) around 10ns per element, which is ~20-200x more than the fastest operation family (arithmetic without overflow). [2]

Outside of searching and sorting, not many primitives hit the upper bound of that 20-200x speed difference. For example, again according to bencharray, frequency (replicate-undo or /⁼ in BQN) hits up to 2ns per element for large inputs, so 1/5 of of the time to do a group.
(Keep in mind BQN's performance won't always be comparable to k's since BQN's group and replicate-inverse take list indices as input, while k's group and freq (a) can take more general input, and (b) output in a different format.)
It makes sense that group is slow: it does a lot of work!
Implementing group in k

A simple model of group is:
{&'k!(k:?x)=\:x}

We can group faster by cutting up a grade, potentially with some extra work depending on whether we want the dict entries (a) in order of first occurrence, or (b) sorted by the keys.
The trick is finding where to cut. We could sort the input and see where the elements change:
{g:(&~x~':x i)_i:<x  / grade; cut; where; match-each-prior
 x[*'g]!g@:<g}       / grade again (skip this step if you want the result sorted by key); index

(The above is ngn/k's k-string implementation of group.)
Or we could find where to cut by sorting a list of bools marking the first occurrence of each element:
/ k9 2021 syntax
{u:a=!#a:x?x         / self-find to mark first occurrences (boolean list)
 x[&u]!(u@)^<a}      / grade; cut; index; where; dict. (no empty case handling)

Compare this to a frequency implementation: data stays nice and flat, with no sorts or cuts required. Just create a histogram, filter it to exclude empty bins, and slap some labels on top.
{a:@[&#x;x?x;+;1]    / no sort or cut;
 !/(x;a)[;&0<a]}     / just self-find, 0<, where, indexing.

2) Group's generality is wasted

Notwithstanding the k design principle violation, maybe group is so useful that it's worth a slot.
Thankfully we have data to test this: ngn/k's test corpus! I classified 400 lines of k code containing the = character [3]:
Category  Description    Freq #'=  Group values x@=f x  Other/dunno  # group uses  # tests  % that use group
========  ============   ========  ===================  ===========  ============  =======  ================
aoc/15    AdventOfCode          2                    0            2             0       21               10%
aoc/16    AdventOfCode          2                    0            0             0       15               13%
aoc/17    AdventOfCode          0                    0            0             0       14                0%
aoc/18    AdventOfCode          0                    0            0             0       11                0%
aoc/19    AdventOfCode          0                    0            2             2       22                9% 
aoc/20    AdventOfCode          2                    0            1             3       25               12%
aoc/21    AdventOfCode          3                    0            0             0        3               13%
aoc/22    AdventOfCode          0                    0            0             0       24                0%
dy        TryAPL                5                    1            5            11      250                4%
e         ProjectEuler          1                    4            0             5      100                5%
g         CodeGolf             10                    3            4            17      263                6%
l         ngn/k libs            0                    0            1             1        4               25%
          TOTAL                23                    8           14            46      724                6%

So:

roughly 5-10% of problems involve group
of those uses, half are freq (#'=)
the rest are a mix of uses of indexing back into the original list (then doing more work, eg +/')
at least one of the 'other' uses is a #= which is probably better done as #? - suggestive...

Does having the user do freq as #'= impact speed? Maybe!

While #'= could be recognised as an idiom and be computed by fast code that does less than a 'full' group, the need for idiom recognition is running thin:

*| is better as :/ since the latter is a verb that doesn't ever require brackets, and (especially in Arthur's latest ks) there's little to no need for a space before the colon
*< and *> (min- and max-index) might qualify, as could #? (unique-count) and *'= (first-of-each-group)
but maybe the interpreter is simpler if those cases are just keywords and idiom recognition is left out.


Having freq as a primitive encourages its use, so 'normal' user code will tend to use that function more.
Shorter code is easier to simplify and optimise. Every character counts.

'Undo' to the rescue?

Let's say you're an implementer and you want a fast way to freq, without losing = or relying on idiom recognition of #'=.
k uses noun? as function inverse. So `j? decodes a JSON string into k data.
Given where (&:) is run-length-decode, we can get something like freq with (&:)?. Now, there could be duplicate keys...
 (&:)?"abbaaacccc"
"abac"!1 2 3 4 

...so we could either:

sort the input in advance
exchange the domain and range of the dict, group that, then +/'.

It's similar to how BQN implements freq /⁼ as the inverse (⁼) of replicate (/) (docs), but it's even better because k has dicts, so it can validly express cases where elements appear out of sorted order.
And run-length-encode could be handy in other situations, such as for simple data compression.
Having said all that, from some brief testing in ngn/k, the k-string of freq above is way faster than one based on run-length-encoding. So this path might be fruitless.
3) Group sucks, actually

Sometimes group is just not the right tool for the job.

You might accidentally assume the group's entries are in a particular order.
You can't group dict entries.
The result of outdexing a group may not be what you expected.
Group inherently nests data - flatter solutions using other primitives may do the job faster and with less complication.

Let's go through examples:
Group entry order

kcc - Kelas's k tutorial - has a nice quicksort implementation (translated to ngn/k below; try it here):
 f:{$[2>#?x;x;,/o'x@&'~:\x<*1?x]}
 f "quicksort"
"cikoqrstu"

A naive person (me) might say, "that not-converge ~:\ looks wasteful - why can't we just group the bools that result from x<1?x? In other words, replace x@&'~:\ with =x!x?":
 f:{$[2>#?x;x;,/o'=x!x<*1?x]}
 f "quicksort"
"qrsutoick"                     / uh oh

What happened? If we put a print \ before the recursive call o', we see that group's keys are not always sorted:
 f:{$[2>#?x;x;,/o' \ =x!x<*1?x]}
 f "quicksort"

0 1!("qusort";"ick")
0 1!("qusrt";,"o")
1 0!("qsr";"ut")
1 0!("qr";,"s")
1 0!(,"q";,"r")
0 1!(,"u";,"t")
1 0!("ic";,"k")
(,0)!,"ic"
0 1!(,"i";,"c")

Note that implementations that sort group keys, like k9 2021, can avoid this issue:
 w2021.03.27 1gb firefox (c)shakti 2.0
 f:{$[2>#?x;x;,/f'|=x!x<*1?x]}
 f"quicksort"
"cikoqrstu"

Grouping dict entries

We can certainly group a dict:
 =`a`b`c`a`b`c!1 4 2 8 5 7
1 4 2 8 5 7!(,`a;,`b;,`c;,`a;,`b;,`c)

But what if we want to group dict entries by group of something else? dict@=groups won't work, because that does indexing.
While k doesn't let you pick arbitrary dict entries based on order, k9 2021's other bread-and-butter ops for tables (sorts and filters) can process dict entries:
w2021.03.27 1gb chrome (c)shakti 2.0
 d:`a`b`c`a`b`c!1 4 2 8 5 7
 ^d  / sort by key
a|1
a|8
b|4
b|5
c|2
c|7
 <d  / sort by value
a|1
c|2
b|4
b|5
c|7
a|8
 `a`b#d  / it doesn't keep all matching entries, but at least it doesn't just give values
a|1
b|4

Group is therefore unusual because it can't group a dict's entries, only its values. (This comes up later.)
Missing entries and prototype issues

Group returns a dict of key lists. If I index into that dict using a key that isn't in its domain, I'd expect to receive an empty list.
But as at 28 March 2023, that is not what happens!
/ k7
2019-09-25 10:44:44 α 4core 512mb wsm32 (c) shakti
 (="ab")"c"
,Ø

/ k9 2021
w2021.03.27 1gb chrome (c)shakti 2.0
 (="ab")"c"
,0

/ oK
Welcome to oK v0.1
(inspired by K5; \h for help)
  (="ab")"c"
0N

/ ngn/k (28 Mar 2023)
ngn/k, (c) 2019-2023 ngn, GNU AGPLv3. type \ for more info
 (="ab")"c"
,0N

In all of these cases, we receive a list of one null (or a null atom in oK). This can be explained as the result of outdexing on a list: such cases return the prototype of the values list, which is a value with the same structure as the first element of the values list, but with nulls replacing any atoms.
Why do prototypes matter? Well, remember Wordle? Let's write a Wordle guess scorer. The problem is stated here.
We could:

find letters that match in value and position (greens)
of the remaining letters, take as many from the guess as there are in the answer (yellows)
assign code 1 for greens, 2 for yellows, 0 for greys.

A first cut:
f:{[answer;guess]
   green:answer=guess    / green if value and position match
   d:='(answer;guess)    / indices lists for each letter in each word
   d:d^\:\:&green        / strip green letters from index lists
   yi:#'/(#:';::)@'d     / yellow if letter in different position
   @[green;yi;:;2]}      / green 1, yellow 2, grey 0

If you run this, you'll find a problem at this line:
yi:#'/(#:';::)@'d

It's not truncating the indices lists in the guess dict as expected.
The problem is the assumption about how dict #' dict (take-each) behaves:

I had thought that if letter key were in the guess (RHS) but not in the answer (LHS), it would take zero of that letter's indices from those remaining in the grouped answer dict.
But if a key is missing from a dict, indexing in returns a value with the structure of the first item in the dict's values list, but filled with nulls.
In this case, the first item in the dict's values list is a number atom (since we ran #' over it to get the counts), so the value used by the take-each is 0N. And in ngn/k, 0N# returns the whole list. [4]

(In this case k9 2021 wouldn't have this problem since null is 0 rather than 0N.)
Arguably, using dict #' dict is the wrong move here because we don't get an opportunity to fill the nulls in the left hand side with zero (0^) before doing the take.
We hit a similar problem if a letter was in the answer (LHS), but not in the guess (RHS). Take-each will return a list of nulls equal to the number of remaining times that letter appears in the answer. (This is more of an issue with using take-each: unlike drop-each, if you overtake from a list, it will expand the list by repeating the chars (or nulls if the list was empty).)
Unlike the case where a letter was in the guess but not in the answer, we can solve this problem by prepending !/,'(" ";!0) to the dict. That way, empty list will get used as the prototype fill for missing letters.
We can avoid these outdexing issues by only keeping letters common to the answer and guess (once removing the greens):
f:{[answer;guess]
   green:answer=guess    / green if value and position match
   d:='(answer;guess)    / indices lists for each letter in each word
   d:d^\:\:&green        / strip green letter indices
   d:(~#:')_'d           / strip letters with no indices left   <- new step
   d:(^/^\!'d)#/:d       / keep only entries of shared letters  <- new step
   yi:#'/(#:';::)@'d     / indices of yellow letters in guess
   @[green;yi;:;2]}      / green 1, yellow 2, grey 0

Try this revised solution online here.
It was a lot of work to prepare these dicts for the take-each. It suggests that groups and dict dyad' dict ops are more useful when you know that all the keys are present in both dicts.
Other tools are more appropriate

How else could we score a Wordle guess? If we look at the other array language solutions:

ovs's ngn/k solution does use group, but does a simpler @' instead of the #'. It uses find to effectively ignore nulls that result from outdexing.
Bubbler's ngn/k solution eschews groups completely. It instead deletes (list_idx) letters if common to both words, and keeps a record of the word prior to each deletion decision: if the word changed, the letter's yellow, else grey.
Both solutions reorder the words to put greens at the front so that any 'excess' values resulting from deletion or indexing are regarded as yellow. They also sort words based on category, rather than grouping (prioritising greens). In particular, Bubbler processes under sort (permute list, transform list, return list to original order) and the data is kept flatter.
Both solutions are also doing a progressive search, where a successfully found element in the target is disqualified from being a match for the next searched-for element. BQN has built-in progressive find, and ovs's BQN solution uses it, doing a "find under permute" (which funnily enough uses the equivalent of sorted-keys group). Here's a rough k translation.

Here's another problem, this time from the APL Farm chatroom: Miguel Raz's 'pigdog' challenge.

Given a string comprising ASCII words separated by 1+ whitespace characters, output the Pigdog Latin version of the string. Preserve whitespace and case. For each word:

if it starts with a vowel, it becomes 'pig'+word (eg igloo => pigigloo)
else, it becomes word+'dog' (eg cup => cupdog)


A non-array-style solution could be to split the words, classify each one based on whether they start with a consonant, vowel or whitespace; then raze them back together:
{x:(&~=':^x)_x
 c:+/(2*~:;^:)@\:" aeiou"?_*'x           / 0=vowel 1=consonant 2=wspace
 ,/(("pig",;{x,"dog"};::)c)@'x}

But it might be more array-style to group the words by class and transform each group in one hit (at least, as much as we can when working with strings); then put the words back in the order they started, and raze.
This time we know the footguns! We'll override the groups' prototypes so that outdexing generates an empty list, and we'll force key order and presence so that everything lines up for the final amend...
{x:(&~=':^x)_x
 x@:i:=+/(2*~:;^:)@\:" aeiou"?_*'x      / 0=vowel 1=consonant 2=wspace
 (x;i):((,3)!/:,'(();!0)),'(x;i)        / group prototype override
 x:(0 1 2!("pig",/:;{x,"dog"}';::))@'x  / force order, entry presence in x
 ,/@[,/x;0 1 2 3#i;:;x]}                / force order, entry presence in i

...and we will get absolutely smoked by Marshall's version [5] that shoves a bunch of "pig"s and "dog"s onto the end of the input and permutes it:
{d:~1=':s:^x             / s‿e←1⊸»⊸(>⋈<)' '=𝕩
 c:^"aeiou"?_x@&d&~s     / c←¬(s/𝕩)∊"aeiou"
 x,:,/$`pig`dog c        / ins←⥊c⊏["pig","dog"]
 x@<(+\d),{3}#c+2*!#c}   / ((+`s+e)∾3/c+2×↕≠c) ⍋⊸⊏ 𝕩∾ins

Marshall's is shorter and about 6x faster. In fact, even the 'listy' non-array-style version is faster than our group version.
So just because a bunch of elements should be processed in the same way, doesn't mean it always makes sense to group them and process them together.
None of this is intended to advocate for a particular behaviour of group (well, maybe prototypes need another look). It's just to suggest that details matter. Group may be hard for an inexperienced user to use correctly, and sometimes, another route may be more direct and lead to more robust and potentially faster code.
When is group the right tool?

In pigdog, we didn't have to consider the context of each word or which other words fell into the same group.
But sometimes we need to process items in a group in totality rather than separately (such as a sum-over). Maybe the order of the items in each group matters too (such as a sum-scan).
And maybe those cases come up frequently for Shakti's clients - such as time series where the current state depends on the order, values and categories of data in a stream.
To illustrate this, we naturally turn to traded volumes for each ticker Brainfuck.
Problem: Implement a simplified Brainfuck interpreter.


Op
Meaning


>
Increment the data pointer (to point to the next cell to the right).


<
Decrement the data pointer (to point to the next cell to the left).


+
Increment (increase by one) the byte at the data pointer.


-
Decrement (decrease by one) the byte at the data pointer.


.
Output the byte at the data pointer.


Marshall Lochbaum's BQN code:
BF ← {
  S←-˝=⌜⟜𝕩
  ('.'=𝕩)/+`¨⌾((⊐+`S"><")⊸⊔)S"+-"
}

The BQN code does a "sum-scan incs/decs under 'group by instruction pointer'", so the sum values get put back in their original positions automatically.

We can't quite do that in k by giving a group result to an amend, because (at least in ngn/k) amend executes elements sequentially (rather than as a group) so that it can handle repeated indices (simple example).
We also can't group a dict's entries, to keep info about the original keys/positions of the values.
But we can add the order info back in by re-keying (i!' in the code below). We can then merge the groups and index in.

My tweak of Marshall's k and BQN code, try it here:
BF:{s:-/x=/:
    i:=+\s"><"     / at which steps is the data pointer index at each address?
    e:i!'s["+-"]i  / these are the incs/decs while pointer was at this addr; 
                   / group value lists are each keyed by the step of exec
    e:+\'e         / running value for each pointer
    e:{x,y}/e      / byte value at active pointer at each step of exec
    e@&x="."}      / print value at active pointer where instruction is dot

BF ">-->+.-<..>-"  / 1 -2 -2

At the second-last step above, we have a dict of byte values indexed by the locations in the original list (thanks to i!'). This lets us get back the original order by just indexing in the last step. But another approach (similar to the Wordle golfs) is to skip the i!' step, and just raze the dict then sort by the grade of the original grouped indices: ({x,y}/e)@<,/i. Try here.
All of these approaches are a bit annoying, and they won't help if group is going away.
But there is a solution. q's update ... by can do sum-scan-under-group, as Coltim shows (mild edits below):
{select from (update sums op by sums move from t:([]x
                                                  op:(-/)x=/:"+-"
                                                  move:(-/)x=/:"><")) where x="."}

It looks like KSQL will stick around, so even if group-as-primitive goes, we'll still have a way to execute Brainfuck.
Summary

Freq has replaced group. Should we be happy?
Look: I think group is hard to use for more than simple things like freq or self-group (x@=x). It might be more useful if prototypes worked differently, we could address dict entries, and amend did operations 'under' nesting (like update ... by).
If we had run-length-encode (&:)? it might provide the speed benefits of a dedicated freq op without losing group.
In group's absence, we might see more:

inline DIY implementations of group that preserve only the properties the user needs
inline progressive ops that keep data flat, like occurrence count and progressive searches (position or existence)
hand-crafted operations for special cases, such as Coltim's flat partitioned max-scan.

To sum up: if Arthur's objective is speed and simplicity, I can understand why he'd remove group from his latest k.

[Arthur] worked on a generalization of the Axis operator. Which is a big deal in APL. And he figured out how you can define like 20 different primitives ... in terms of the axis operator, right? So, you think that he'd be invested in whatever. He'll drop an idea if it doesn't work. I mean, it's interesting how he's not attached, necessarily, to any particular thing, because later on, you know, when he went and he is doing K, he dropped the idea altogether.

— Joel Kaplan, Ep 27 The Array Cast
And who knows: maybe Arthur will find another place for group-y ops in native k.

If you liked this, follow @kcode on Bluesky.
[1] Joel Kaplan, interviewed on The Array Cast. I'm pretty sure there is a direct quote from Arthur to the same effect on the internet somewhere - please comment if you find it!
[2] As at 28 March 2023.
[3] Fun fact: leaving aside the obvious 'equals', there was also a surprising number of identity matrices (=n).
[4] In ngn/k, 0N# takes all values by design so that we can use {(x?y)#x} to return the whole word if the letter is missing.
[5] Based on this generic BQN string amendment function. Translated to k.

Appendix: how each ngn/k test case uses group

freq
----
aoc/16/01.k
aoc/16/04.k
aoc/20/20.k
aoc/20/24.k
aoc/21/08.k
aoc/21/09.k
aoc/21/21.k
dy/a.k:p146
dy/a.k:p163
dy/a.k:p165
dy/a.k:p183
dy/a.k:p193
e/39.k
g/append-and-erase.k
g/bedevil-your-hangman-opponent.k
g/code-the-huffman.k
g/marquee-sign-letters.k
g/naive-compression.k
g/quadratic-residues-are-so-much-fun.k
g/rotation-invariant-fingerprinting.k
g/score-a-game-of-boggle.k
g/square-chunk-my-matrix.k
g/sort-the-distinct-elements-of-a-list-in-descending-order-by-frequency.k

group-index
-----------
dy/a.k:p216
e/49.k
e/60.k
e/62.k
e/98.k
g/shortest-program-to-split-a-string-at-non-digits-without-regexps.k

group-sums
----------
g/gravitational-force-between-numbers.k
g/most-contributing-rows.k

other / dunno  (i got lazy)
-------------
aoc/15/05.k
aoc/15/06.k
aoc/19/10.k
aoc/19/20.k
aoc/20/11.k
dy/a.k:p18X
dy/a.k:/skipping p135, as =: is a primitive k function
dy/a.k:p153
dy/a.k:p168
dy/a.k:p217  / unique-count (weird as why not #?)
g/classify-a-region-by-its-slope.k
g/golf-a-parentheses-matching-algorithm.k
g/implement-swap-encoding.k
g/split-a-list-at-the-second-occurrence-of-the-first-element.k
l/fmt.k:m
Op	Meaning
`>`	Increment the data pointer (to point to the next cell to the right).
`<`	Decrement the data pointer (to point to the next cell to the left).
`+`	Increment (increase by one) the byte at the data pointer.
`-`	Decrement (decrease by one) the byte at the data pointer.
`.`	Output the byte at the data pointer.