Instantly share code, notes, and snippets.

joewiz/an-introduction-to-recursion-in-xquery.md

Last active January 3, 2024 15:30
Show Gist options
• Save joewiz/6762f1d8826fc291c3884cce3634eb77 to your computer and use it in GitHub Desktop.
An introduction to recursion in XQuery

An introduction to recursion in XQuery

• Created: Nov 28, 2017
• Updated: Nov 29, 2017: Now covers transformation of XML documents

Recursion is a powerful programming technique, but the idea is simple: instead of performing a single operation, a function calls itself repeatedly to whittle through a larger task. In XQuery, recursion can be used to accomplish complex tasks on data that a plain FLWOR expression (which iterates through a sequence) cannot, such as transforming an entire XML document from one format into another, like TEI or DocBook into HTML, EPUB, LaTeX, or XSL-FO. Transforming a document is well-suited to recursion because each of the document's nodes may need to be examined and manipulated based on the node's type, name, and location in the document; and once a node has been processed, the transformation must continue processing the nodes' children and descendants until the deepest leaf node has been processed. But learning the technique of recursion is often hard for a beginning programmer, perhaps because tutorials jump directly to complex operations such as document transformation. So let's start with a simple illustration of the technique of recursion in XQuery and build up to document transformation.

To illustrate basic recursion in XQuery, here is a simple function, `local:countdown()`, which takes a number, returns that number, and then calls itself with a new input (the original number minus one), until the input is 1, at which point it returns the words "lift off!" and ends:

```xquery version "1.0";

declare function local:countdown(\$n as xs:integer) {
\$n,
if (\$n eq 1) then
"lift off!"
else
local:countdown(\$n - 1)
};

local:countdown(5)```

The query returns the sequence, `(5, 4, 3, 2, 1, "lift off!")`.

We could also have accomplished the same result with a FLWOR expression:

```xquery version "1.0";

let \$start := 5
for \$n in reverse(1 to \$start)
return
(
\$n,
if (\$n eq 1) then
"lift off"
else
()
)```

This query returns the identical sequence as the recursive query above, and it is a perfectly valid, even preferable, approach to the challenge of a simple countdown. Understanding how these different approaches arrived at the identical result is a critical first step to understanding recursion. To guide your thinking, consider this: The FLWOR expression is the XQuery language's powerful and built-in technique for iterating through sequences. In contrast, recursion, which is a technique rather than a built-in expression, opens the door to processing more complex structures than flat sequences. The FLWOR expression may meet all your needs. But if you ever find yourself struggling with the limitations of the sequence-oriented design of the FLWOR expression, you may find that recursion is what you need to accomplish your goal.

Before demonstrating how to apply recursion to document transformation challenges, let's reinforce the concepts above by exploring new facilities in XQuery 3.0 that improve the expressiveness and convenience of using recursion in your code.

With XQuery 3.0, XQuery programmers are no longer limited to "named functions", like the one above, but can now also use inline function expressions, which are "anonymous" functions defined directly in an expression rather than a traditional, standalone, named function that can be defined only in a query prolog. Inline functions are useful when a function only has local relevance and doesn't need to be invoked from other contexts. They are also useful for higher-order functions, functions that themselves accept function items as a parameter.

Let's apply these ideas to our simple recursive countdown query from earlier. Adapting a named function to be an inline function can require some minor but important changes to a function's signature when the function calls itself recursively. In the case of our function, you might think you could use the same function signature, with the single parameter `\$n as xs:integer`. For example:

```xquery version "3.0";

let \$countdown :=
function(\$n as xs:integer) {
if (\$n eq 1) then
"lift off!"
else
local:countdown(\$n - 1)
}
return
\$countdown(5)```

However, this query fails with the following error:

err:XPDY0002 variable '\$countdown' is not set. [at line 7, column 9]

To avoid this error, we need to add an additional parameter to the inline function, which itself is a function. Then we can adapt the function call in the function body to use this parameter instead of the "not set" `\$countdown` variable. Here is the result of these changes:

```xquery version "3.0";

let \$countdown :=
function(\$n as xs:integer, \$f as function(*)) {
\$n,
if (\$n eq 1) then
"lift off!"
else
\$f(\$n - 1, \$f)
}
return
\$countdown(5, \$countdown)```

At first glance this workaround makes the use of recursion with inline function expressions a bit more cumbersome, but at least it works! This workaround is a pedagogical excuse to illustrate of the "higher-order function" feature of XQuery 3.0 mentioned above, as well as a particular kind of use of higher-order functions, called "callback functions."

Callback functions are functions that are passed as a parameter to a function, which the receiving function "calls back" to with its results. For example, you could design function A to accept a callback function as a parameter, to which it will send its results. Now, when you call function A, you will pass function B to it as this parameter. Function A will then send its results to function B.

Adapting this idea to our examples above, we can rework our "countdown" function to be a little more generic. Instead of hard-coding the "lift off!" announcement in this function, we can adapt the function to accept a third parameter, `\$announce`, which will take an annonymous function that we will use as a callback function, i.e., calling it when the counting job is done:

```xquery version "3.0";

let \$countdown :=
function(\$n as xs:integer, \$do as function(*), \$announce as function(*)) {
\$n,
if (\$n eq 1) then
\$announce()
else
\$do(\$n - 1, \$do, \$announce)
}
let \$announce := function() { "lift off at " || current-time() || "!" }
return
\$countdown(5, \$countdown, \$announce)```

The query returns the sequence, `(5, 4, 3, 2, 1, "lift off at 13:00:25.779-05:00!")`. With this change, our "countdown" function now focuses on its core task of counting down, and the task of "announcing" the lift off is delegated to a separate function.

We have one more variation to introduce. What if, instead of defining a new custom "announce" function, we wanted our "announce" function to call a built-in function, or a named function defined in the same module or other library module? To do this we can refer to the built-in or already-defined function using a named function reference. For example, we can refer to the `fn:current-time()` function using the syntax `current-time#0`. Here, the `#0` indicates that we are referring to the version of this function that takes zero parameters. (Every function takes a certain number of arguments; this number is known as the function's arity.) Changing the value of `\$announce` to this named function reference will mean that our query's final item will just be the current time. Between inline functions and named function references, you now know everything you need to be able to pass functions around as arguments in your XQuery code.

Let's leverage all of the techniques here to transform an XML document from one format to another (HTML to TEI), not only renaming elements, but also enriching the content in the process. We will perform the transformation with a function, `local:transform()` that takes two parameters: node(s) from the source document and a custom function for enriching text nodes. The heart of this function will use the XQuery FLWOR and typeswitch expressions to recusively step through the document structure, evaluate each input node by type and name, and apply the following rules:

1. Rename certain HTML elements: Turn `<html:h1>` (and `<html:h2>`, etc.) elements into `<tei:head>` elements, stripping off attributes and recursively passing all child nodes back to the `local:transform()` function.
2. Reuse certain HTML element names: Turn other "known" HTML elements into a TEI element of the same name, stripping off attributes and recursively passing all child nodes back to the `local:transform()` function.
3. For any other "unknown" elements from the source document, preserve the original element names and attributes, recursively passing all child nodes back to the `local:transform()` function.
4. Enrich text nodes by applying a custom function.
5. Let all other types of nodes through unchanged.

We will define our function for enriching text nodes as an inline function, using the `fn:analyze-string()` function to find substrings that match a simple regular expression date pattern and wrapping the matches with a `<tei:date>` element. Lastly, we will deploy a named function reference by referring to a verbosely named function (`local:make-tei-element-name()`) using a shorthand variable (`\$tei`).

Here is the full query that accomplishes what we have described:

```xquery version "3.1";

declare namespace html = "http://www.w3.org/1999/xhtml";
declare namespace tei = "http://www.tei-c.org/ns/1.0";

(:~ A convenience function for creating element names in the TEI namespace :)
declare function local:make-tei-element-name(\$local-name as xs:string) {
QName("http://www.tei-c.org/ns/1.0", \$local-name)
};

(:~ A skeleton function for recursively transforming HTML into TEI :)
declare function local:transform(\$nodes as node()*, \$enrich-text as function(*)) {
let \$tei := local:make-tei-element-name#1
for \$node in \$nodes
return
typeswitch (\$node)
case element(html:h1) | element(html:h2) | element(html:h3) return
element
{ local:transform(\$node/node(), \$enrich-text) }
(: html:div => tei:div, html:p => tei:p :)
case element(html:div) | element(html:p) return
element
{ \$tei(\$node/name()) }
{ local:transform(\$node/node(), \$enrich-text) }
(: leave unknown elements intact, including attributes :)
case element() return
element
{ node-name(\$node) }
{ \$node/@*, local:transform(\$node/node(), \$enrich-text) }
case text() return
\$enrich-text(\$node)
default return
\$node
};

let \$input :=
<div xmlns="http://www.w3.org/1999/xhtml">
<h1>Introduction</h1>
<p class="fact">Pearl Harbor was attacked on December 7, 1941.</p>
<video width="320" height="240">
<source src="movie.mp4" type="video/mp4"/>
</video>
<!-- to be continued -->
</div>
let \$enrich-text :=
(: An inline function for enriching text :)
function(\$text as text()) {
let \$analysis := analyze-string(\$text, "\w+ \d{1,2}, \d{4}")
for \$result in \$analysis/*
return
if (\$result instance of element(fn:match)) then
element
{ local:make-tei-element-name("date") }
{ \$result/string() }
else
\$result/string()
}
return
local:transform(\$input, \$enrich-text)```

This query returns the HTML document as a TEI document, with the date string wrapped in a TEI `<date>` element and the comment node and "unexpected" HTML elements intact:

```<div xmlns="http://www.tei-c.org/ns/1.0">
<p>Pearl Harbor was attacked on <date>December 7, 1941</date>.</p>
<video xmlns="http://www.w3.org/1999/xhtml" width="320" height="240">
<source src="movie.mp4" type="video/mp4"/>
</video>
<!-- to be continued -->
</div>```

Take some time to unpack the logic and flow of the transformation here, making note of how we combined recursion with FLWOR and typeswitch expressions, as well as named, inline, and higher-order functions, to perform targeted operations on different parts of the source document. You may well be able to use a subset of these techniques for more straightforward transformations (for example, transforming elements but omitting text enrichment, or only performing text enrichment while keeping the vocabulary intact), but you can also build on these patterns to do more than is shown here.

Learning to use recursion and these other techniques takes some practice, so be on the look out for opportunities to use them. Before long, you'll find yourself reaching for them to help you manipulate and transform data and get work done.