How variables in XQuery FLWOR expressions change when using the "group by" clause
xquery version "3.1"; | |
(: | |
## How variables in XQuery FLWOR expressions change when using the `group by` clause | |
Sometimes, when working with a `group by` clause, an XQuery FLWOR expression | |
might suddenly seem to act strangely, or at least unintuitively. In particular, | |
variables defined before the `group by` clause might suddenly seem to go haywire. | |
The key to understanding what happens with variables when you use the `group by` | |
clause is the distinction between "pre-grouping tuples" and "post-grouping tuples". | |
Let's see the original discussion of these terms in the spec, from | |
https://www.w3.org/TR/xquery-31/#id-group-by, and then unpack these with a concrete | |
example. | |
> The `group by` clause assigns each pre-grouping tuple to a group, and | |
generates one post-grouping tuple for each group. In the post-grouping tuple | |
for a group, each grouping key is represented by a variable that was specified | |
in a GroupingSpec, and every variable that appears in the pre-grouping tuples | |
that were assigned to that group is represented by a variable of the same name, | |
bound to a sequence of all values bound to the variable in any of these | |
pre-grouping tuples. Subsequent clauses in the FLWOR expression see only the | |
variable bindings in the post-grouping tuples; they no longer have access to | |
the variable bindings in the pre-grouping tuples. The number of post-grouping | |
tuples is less than or equal to the number of pre-grouping tuples. | |
In our example query below, we will take a list of terms (apple, banana, and | |
blackberry) and group them together based on their first letter (a => apple, | |
b => banana, blackberry). (Imagine a back-of-book index, in which the terms | |
are grouped by first letter.) | |
Here's the code: | |
:) | |
let $terms := ("apple", "banana", "blackberry") | |
for $term in $terms | |
group by $first-letter := substring($term, 1, 1) | |
let $new-terms := ("apple", "banana", "blackberry") | |
return | |
map { | |
"first-letter": $first-letter, | |
"term": $term, | |
"terms": $terms, | |
"new-terms": $new-terms | |
} | |
(: | |
Since our FLWOR expression's `for` clause iterates over a sequence of 3 terms, | |
there are 3 "pre-grouping tuples": | |
1. | |
- "term": "apple" | |
- "first-letter": "a" | |
- "terms": ("apple", "banana", "blackberry") | |
2. | |
- "term": "banana" | |
- "first-letter": "b" | |
- "terms": ("apple", "banana", "blackberry") | |
3. | |
- "term": "blackberry" | |
- "first-letter": "b" | |
- "terms": ("apple", "banana", "blackberry") | |
As soon as we apply the `group by` clause and group the numbers by the | |
"first-letter" grouping key, the post-grouping tuples are generated. There are | |
only 2 values for the grouping key, true and false, so our 3 pre-grouping | |
tuples are grouped into 2 post-grouping tuples (I've starred the grouping | |
key): | |
1. | |
"first-letter"*: "a" | |
"term": "apple" | |
2. | |
"first-letter"*: "b" | |
"term": ("banana", "blackberry") | |
But there was another variable in our pre-grouping tuples—namely, "terms". And | |
we defined a "new-terms" variable after the `group by` clause. What happened to | |
these variables in the process of grouping? Quoting again from the spec: | |
> ... every variable that appears in the pre-grouping tuples that were | |
assigned to that group is represented by a variable of the same name, bound | |
to a sequence of all values bound to the variable in any of these | |
pre-grouping tuples. | |
Applying this to our example, here is the full set of 2 tuples: | |
1. | |
"first-letter"*: "a" | |
"term": "apple" | |
"terms": ("apple", "banana", "blackberry") | |
"new-terms": ("apple", "banana", "blackberry") | |
2. | |
"first-letter"*: "b" | |
"term": ("banana", "blackberry") | |
"terms": ("apple", "banana", "blackberry", "apple", "banana", "blackberry") | |
"new-terms": ("apple", "banana", "blackberry") | |
Note the following changes to our variables: | |
1. The "term" variable (defined before the `group by` clause) is no longer a | |
single term in the post-grouping tuples, but a sequence of terms that match | |
the grouping key: the "a" term in the first post-grouping tuple, and two "b" | |
terms in the second post-grouping tuple. | |
2. In the 2nd tuple, the "terms" variable (defined before the `group by` | |
clause) is no longer just the sequence of 3 terms; there are now 6 terms. | |
This is because the "terms" variable now contains the "sequence of all | |
values bound to the variable" in the pre-grouping tuple. Effectively, since | |
there are 2 terms that start with the letter "b", the "terms" variable | |
now contains 2 sets of terms. | |
3. The "new-terms" variable (defined after the `group by` clause) remains | |
unchanged. | |
These changes occurred because, as the spec says: | |
> Subsequent clauses in the FLWOR expression see only the variable bindings in | |
the post-grouping tuples; they no longer have access to the variable bindings | |
in the pre-grouping tuples. | |
This loss of "access" is important, because an expression like `count($terms)` | |
will now return different results depending on which post-grouping tuple it is | |
called in—3 for "a", 6 for "b". | |
In contrast, the "new-terms" variable didn't change, because it was bound | |
after the `group by` clause—i.e., in the post-grouping tuple itself. So an | |
expression like `count($new-terms)` will always return 3. | |
:) | |
map { | |
"first-letter": "a", | |
"term": "apple", | |
"terms": ("apple", "banana", "blackberry"), | |
"new-terms": ("apple", "banana", "blackberry") | |
}, | |
map { | |
"first-letter": "b", | |
"term": ("banana", "blackberry"), | |
"terms": ("apple", "banana", "blackberry", "apple", "banana", "blackberry"), | |
"new-terms": ("apple", "banana", "blackberry") | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment