joewiz/group-by.xq

## group-by.xq
xquery version "3.1";

(:

## How variables in XQuery FLWOR expressions change when using the `group by` clause

Sometimes, when working with a `group by` clause, an XQuery FLWOR expression
might suddenly seem to act strangely, or at least unintuitively. In particular,
variables defined before the `group by` clause might suddenly seem to go haywire.

The key to understanding what happens with variables when you use the `group by`
clause is the distinction between "pre-grouping tuples" and "post-grouping tuples".
Let's see the original discussion of these terms in the spec, from
https://www.w3.org/TR/xquery-31/#id-group-by, and then unpack these with a concrete
example.

> The `group by` clause assigns each pre-grouping tuple to a group, and
generates one post-grouping tuple for each group. In the post-grouping tuple
for a group, each grouping key is represented by a variable that was specified
in a GroupingSpec, and every variable that appears in the pre-grouping tuples
that were assigned to that group is represented by a variable of the same name,
bound to a sequence of all values bound to the variable in any of these
pre-grouping tuples. Subsequent clauses in the FLWOR expression see only the
variable bindings in the post-grouping tuples; they no longer have access to
the variable bindings in the pre-grouping tuples. The number of post-grouping
tuples is less than or equal to the number of pre-grouping tuples.

In our example query below, we will take a list of terms (apple, banana, and
blackberry) and group them together based on their first letter (a => apple,
b => banana, blackberry). (Imagine a back-of-book index, in which the terms
are grouped by first letter.)

Here's the code:

:)

let $terms := ("apple", "banana", "blackberry")
for $term in $terms
group by $first-letter := substring($term, 1, 1)
let $new-terms := ("apple", "banana", "blackberry")
return
    map {
        "first-letter": $first-letter,
        "term": $term,
        "terms": $terms,
        "new-terms": $new-terms
    }

(:

Since our FLWOR expression's `for` clause iterates over a sequence of 3 terms,
there are 3 "pre-grouping tuples":

    1.
        - "term": "apple"
        - "first-letter": "a"
        - "terms": ("apple", "banana", "blackberry")
    2.
        - "term": "banana"
        - "first-letter": "b"
        - "terms": ("apple", "banana", "blackberry")
    3.
        - "term": "blackberry"
        - "first-letter": "b"
        - "terms": ("apple", "banana", "blackberry")

As soon as we apply the `group by` clause and group the numbers by the
"first-letter" grouping key, the post-grouping tuples are generated. There are
only 2 values for the grouping key, true and false, so our 3 pre-grouping
tuples are grouped into 2 post-grouping tuples (I've starred the grouping
key):

    1.
        "first-letter"*: "a"
        "term": "apple"
    2.
        "first-letter"*: "b"
        "term": ("banana", "blackberry")

But there was another variable in our pre-grouping tuples—namely, "terms". And
we defined a "new-terms" variable after the `group by` clause. What happened to
these variables in the process of grouping? Quoting again from the spec:

> ... every variable that appears in the pre-grouping tuples that were
assigned to that group is represented by a variable of the same name, bound
to a sequence of all values bound to the variable in any of these
pre-grouping tuples.

Applying this to our example, here is the full set of 2 tuples:

    1.
        "first-letter"*: "a"
        "term": "apple"
        "terms": ("apple", "banana", "blackberry")
        "new-terms": ("apple", "banana", "blackberry")
    2.
        "first-letter"*: "b"
        "term": ("banana", "blackberry")
        "terms": ("apple", "banana", "blackberry", "apple", "banana", "blackberry")
        "new-terms": ("apple", "banana", "blackberry")

Note the following changes to our variables:

1. The "term" variable (defined before the `group by` clause) is no longer a
single term in the post-grouping tuples, but a sequence of terms that match
the grouping key: the "a" term in the first post-grouping tuple, and two "b"
terms in the second post-grouping tuple.

2. In the 2nd tuple, the "terms" variable (defined before the `group by`
clause) is no longer just the sequence of 3 terms; there are now 6 terms.
This is because the "terms" variable now contains the "sequence of all
values bound to the variable" in the pre-grouping tuple. Effectively, since
there are 2 terms that start with the letter "b", the "terms" variable
now contains 2 sets of terms.

3. The "new-terms" variable (defined after the `group by` clause) remains
unchanged.

These changes occurred because, as the spec says:

> Subsequent clauses in the FLWOR expression see only the variable bindings in
the post-grouping tuples; they no longer have access to the variable bindings
in the pre-grouping tuples.

This loss of "access" is important, because an expression like `count($terms)`
will now return different results depending on which post-grouping tuple it is
called in—3 for "a", 6 for "b".

In contrast, the "new-terms" variable didn't change, because it was bound
after the `group by` clause—i.e., in the post-grouping tuple itself. So an
expression like `count($new-terms)` will always return 3.

:)


## group-by_results.xq
map {
    "first-letter": "a",
    "term": "apple",
    "terms": ("apple", "banana", "blackberry"),
    "new-terms": ("apple", "banana", "blackberry")
},
map {
    "first-letter": "b",
    "term": ("banana", "blackberry"),
    "terms": ("apple", "banana", "blackberry", "apple", "banana", "blackberry"),
    "new-terms": ("apple", "banana", "blackberry")
}
	xquery version "3.1";

	(:

	## How variables in XQuery FLWOR expressions change when using the `group by` clause

	Sometimes, when working with a `group by` clause, an XQuery FLWOR expression
	might suddenly seem to act strangely, or at least unintuitively. In particular,
	variables defined before the `group by` clause might suddenly seem to go haywire.

	The key to understanding what happens with variables when you use the `group by`
	clause is the distinction between "pre-grouping tuples" and "post-grouping tuples".
	Let's see the original discussion of these terms in the spec, from
	https://www.w3.org/TR/xquery-31/#id-group-by, and then unpack these with a concrete
	example.

	> The `group by` clause assigns each pre-grouping tuple to a group, and
	generates one post-grouping tuple for each group. In the post-grouping tuple
	for a group, each grouping key is represented by a variable that was specified
	in a GroupingSpec, and every variable that appears in the pre-grouping tuples
	that were assigned to that group is represented by a variable of the same name,
	bound to a sequence of all values bound to the variable in any of these
	pre-grouping tuples. Subsequent clauses in the FLWOR expression see only the
	variable bindings in the post-grouping tuples; they no longer have access to
	the variable bindings in the pre-grouping tuples. The number of post-grouping
	tuples is less than or equal to the number of pre-grouping tuples.

	In our example query below, we will take a list of terms (apple, banana, and
	blackberry) and group them together based on their first letter (a => apple,
	b => banana, blackberry). (Imagine a back-of-book index, in which the terms
	are grouped by first letter.)

	Here's the code:

	:)

	let $terms := ("apple", "banana", "blackberry")
	for $term in $terms
	group by $first-letter := substring($term, 1, 1)
	let $new-terms := ("apple", "banana", "blackberry")
	return
	map {
	"first-letter": $first-letter,
	"term": $term,
	"terms": $terms,
	"new-terms": $new-terms
	}

	(:

	Since our FLWOR expression's `for` clause iterates over a sequence of 3 terms,
	there are 3 "pre-grouping tuples":

	1.
	- "term": "apple"
	- "first-letter": "a"
	- "terms": ("apple", "banana", "blackberry")
	2.
	- "term": "banana"
	- "first-letter": "b"
	- "terms": ("apple", "banana", "blackberry")
	3.
	- "term": "blackberry"
	- "first-letter": "b"
	- "terms": ("apple", "banana", "blackberry")

	As soon as we apply the `group by` clause and group the numbers by the
	"first-letter" grouping key, the post-grouping tuples are generated. There are
	only 2 values for the grouping key, true and false, so our 3 pre-grouping
	tuples are grouped into 2 post-grouping tuples (I've starred the grouping
	key):

	1.
	"first-letter"*: "a"
	"term": "apple"
	2.
	"first-letter"*: "b"
	"term": ("banana", "blackberry")

	But there was another variable in our pre-grouping tuples—namely, "terms". And
	we defined a "new-terms" variable after the `group by` clause. What happened to
	these variables in the process of grouping? Quoting again from the spec:

	> ... every variable that appears in the pre-grouping tuples that were
	assigned to that group is represented by a variable of the same name, bound
	to a sequence of all values bound to the variable in any of these
	pre-grouping tuples.

	Applying this to our example, here is the full set of 2 tuples:

	1.
	"first-letter"*: "a"
	"term": "apple"
	"terms": ("apple", "banana", "blackberry")
	"new-terms": ("apple", "banana", "blackberry")
	2.
	"first-letter"*: "b"
	"term": ("banana", "blackberry")
	"terms": ("apple", "banana", "blackberry", "apple", "banana", "blackberry")
	"new-terms": ("apple", "banana", "blackberry")

	Note the following changes to our variables:

	1. The "term" variable (defined before the `group by` clause) is no longer a
	single term in the post-grouping tuples, but a sequence of terms that match
	the grouping key: the "a" term in the first post-grouping tuple, and two "b"
	terms in the second post-grouping tuple.

	2. In the 2nd tuple, the "terms" variable (defined before the `group by`
	clause) is no longer just the sequence of 3 terms; there are now 6 terms.
	This is because the "terms" variable now contains the "sequence of all
	values bound to the variable" in the pre-grouping tuple. Effectively, since
	there are 2 terms that start with the letter "b", the "terms" variable
	now contains 2 sets of terms.

	3. The "new-terms" variable (defined after the `group by` clause) remains
	unchanged.

	These changes occurred because, as the spec says:

	> Subsequent clauses in the FLWOR expression see only the variable bindings in
	the post-grouping tuples; they no longer have access to the variable bindings
	in the pre-grouping tuples.

	This loss of "access" is important, because an expression like `count($terms)`
	will now return different results depending on which post-grouping tuple it is
	called in—3 for "a", 6 for "b".

	In contrast, the "new-terms" variable didn't change, because it was bound
	after the `group by` clause—i.e., in the post-grouping tuple itself. So an
	expression like `count($new-terms)` will always return 3.

	:)
	map {
	"first-letter": "a",
	"term": "apple",
	"terms": ("apple", "banana", "blackberry"),
	"new-terms": ("apple", "banana", "blackberry")
	},
	map {
	"first-letter": "b",
	"term": ("banana", "blackberry"),
	"terms": ("apple", "banana", "blackberry", "apple", "banana", "blackberry"),
	"new-terms": ("apple", "banana", "blackberry")
	}