Skip to content

Instantly share code, notes, and snippets.

@joewiz
Last active December 21, 2021 22:36
Show Gist options
  • Save joewiz/99e5fa283086e4bf74fe8143cab3f92c to your computer and use it in GitHub Desktop.
Save joewiz/99e5fa283086e4bf74fe8143cab3f92c to your computer and use it in GitHub Desktop.
Converting an eXist application from old-style fields to new, Lucene-based facets and fields

Converting an eXist application from old-style fields to new, Lucene-based facets and fields

This article walks through the process of migrating an eXist application from using old-style fields to using the new, Lucene-based facets and fields. For more information, see the eXist documentation's Lucene article.

Old-style approach

In the old-style approach to fields, fields were constructed and maintained manually via the ft:index() function. To add or update fields for a document, a <doc> element containing <field> elements was passed to this function, along with the URI of the resource to be indexed.

For example, in one application, fields were constructed with in the hsa/modules/index.xq library module, whose index:index-one-document() function constructed the <field> elements and passed them to the ft:index() function:

declare function index:index-one-document($doc) {
    let $titleStmt := (
        $doc//tei:sourceDesc/tei:biblFull/tei:titleStmt,
        $doc//tei:fileDesc/tei:titleStmt
    )
    let $index :=
        <doc>
            <field name="study-id" store="yes">
                { $doc/@xml:id/string() }
            </field>
            <field name="last-modified" store="yes">
                { xmldb:last-modified($config:data-root, util:document-name($doc)) }
            </field>
            {
                for $title in $titleStmt/tei:title
                return
                    <field name="title" store="yes">
                        { string-join($title/string(), " ") }
                    </field>
            }
            {
                for $author in $titleStmt/tei:author
                let $normalized := replace($author/string(), "^([^,]*,[^,]*),?.*$", "$1")
                return
                    <field name="author" store="yes">
                        { $normalized }
                    </field>
            }
            {
                for $tag in (tokenize($doc//tei:catRef[@scheme eq "hsg-taxonomy"]/@target, "\s+") ! substring-after(., "#"))
                return
                    <field name="tag" store="yes">
                        { replace($tag, "-", " ") }
                    </field>
            }
            <field name="year" store="yes">
                { $doc/tei:teiHeader/tei:fileDesc/tei:publicationStmt/tei:date/@when/string() ! substring(., 1, 4) }
            </field>
            <field name="evan-id" store="yes">
                { substring-after($doc/@xml:id, "hsa") }
            </field>
            {
                let $ps-number := $doc//tei:bibl[@type="legacy-policy-studies-number"]/string()
                return
                    <field name="ps-number" store="yes">
                        { $ps-number }
                    </field>
                ,
                let $availability := $doc/tei:teiHeader/tei:fileDesc/tei:publicationStmt/tei:availability/@status
                let $normalized := replace($availability/string(), "^([^,]*,[^,]*),?.*$", "$1")
                return
                    <field name="availability" store="yes">
                        { $normalized }
                    </field>
            }
        </doc>
    return
        ft:index(document-uri(root($doc)), $index)
};

In other words, this function created the following fields for the document:

  1. study-id: $doc/@xml:id/string()
  2. last-modified: xmldb:last-modified($config:data-root, util:document-name($doc))
  3. author: based on $titleStmt/tei:author
  4. tag: based on $doc//tei:catRef/@target
  5. year: based on $doc/tei:teiHeader/tei:fileDesc/tei:publicationStmt/tei:date/@when
  6. evan-id: substring-after($doc/@xml:id, "hsa")
  7. ps-number: $doc//tei:bibl[@type="legacy-policy-studies-number"]/string()
  8. availability: based on $doc/tei:teiHeader/tei:fileDesc/tei:publicationStmt/tei:availability/@status

Having populated a document's fields with this information, the indexed documents could be searched by field, using the ft:search() function, as found in this excerpt from the app:search() function in the hsa/modules/app.xql library module:

for $item in ft:search($rootCol, $browse || ":" || search:sanitize-query($filter), ("study-id", "title", "ps-number", "author", "tag", "year", "evan-id", "availability"))/search
let $author := $item/field[@name = "author"]
order by $author[1], $author[2], $author[3]
return
    $item

In addition, when applications needed to apply both a field-based query with the Lucene-based full text query of documents (which use the ft:query() function), the implementation was quite complex. Together with the above-mentioned need to manage the contents of the field index and the lack of any purpose-built faceting facility, the old approach had inherent limitations and left developers with a lot of complexity to manage.

New, Lucene-based approach to facets and fields

In the new Lucene-based approach, the contents of the field index is defined in the same collection.xconf file as the Lucene-based full text index. This way, when a document is added or updated, the fields are updated automatically along with all of the other indexes in a single pass. The new approach also exposes a native facets facility, and facets are easy to define and query. Querying of indexed collections is also performed by the ft:query() function, paired with two new functions for accessing field contents and facet counts, ft:field() and ft:facets().

Updating an application to use this new approach requires the following steps:

  1. Remove the old index.xq file for managing index contentx
  2. Remove any triggers responsible for calling the index.xq functions upon storing documents
  3. Add field definitions to collection.xconf
  4. Adapt search functions to use the new, consolidated ft:query() function.
  5. Add any required facet definitions to collection.xconf, and extend the search functions to use these.

Let's take one such index as an example, author. We will associate the fields for a document with the full text index on the <tei:body> node in the relevant collection.xconf file for the indexed collection, /db/apps/hsa-data/data:

<collection xmlns="http://exist-db.org/collection-config/1.0">
    <index xmlns:tei="http://www.tei-c.org/ns/1.0" xmlns:xs="http://www.w3.org/2001/XMLSchema">
        <lucene>
            <text qname="tei:body">
                <field name="author" expression="./ancestor::tei:TEI//tei:author"/>
                <!-- other fields and facets -->
            </text>
        </lucene>
    </index>
</collection>

Having applied this index, we can now perform a combined query of this field with a full text query of the document's contents, as follows:

xquery version "3.1";

declare namespace tei="http://www.tei-c.org/ns/1.0";

let $author-query := "david"
let $keyword-query := "germany"
let $q := $keyword-query || " author:" || $author-query
return
    collection("/db/apps/hsa-data/data")//tei:body[
        ft:query(
            ., 
            $q
        )
    ]

This query quickly returns all 18 studies that mention "Germany" by authors whose name contains "David".

If we want to retrieve the contents of the author field from the resulting studies, we can extend our query, as follows:

xquery version "3.1";

declare namespace tei="http://www.tei-c.org/ns/1.0";

let $author-query := "david"
let $keyword-query := "germany"
let $q := $keyword-query || " author:" || $author-query
let $hits :=
    collection("/db/apps/hsa-data/data")//tei:body[
        ft:query(
            ., 
            $q,
            map {
                "fields": "author"
            }
        )
    ]
return
    <hits>{
        for $hit in $hits
        return    
            <hit>
                <study-id>{$hit/ancestor::tei:TEI/@xml:id/string()}</study-id>
                { ft:field($hit, "author") ! <author>{.}</author> }
            </hit>
    }</hits>

This query will return the same 18 studies, in the form below (which I've trimmed to illustrate how many different authors have the name David):

<hits>
    <hit>
        <study-id>hsa1645</study-id>
        <author>David S. Painter</author>
    </hit>
    <hit>
        <study-id>hsa1427</study-id>
        <author>David Lawrence</author>
        <author>N. Stephen Kane</author>
    </hit>
    <hit>
        <study-id>hsa1829</study-id>
        <author>David S. Patterson</author>
        <author>William F. Sanford, Jr.</author>
    </hit>
    <hit>
        <study-id>hsa2074</study-id>
        <author>David F. Trask</author>
    </hit>
    <hit>
        <study-id>hsa1386</study-id>
        <author>David M. Baehler</author>
    </hit>
    <hit>
        <study-id>hsa1599</study-id>
        <author>David W. Mabon</author>
        <author>Charles S. Sampson</author>
    </hit>
</hits>

If, besides the studies themselves, we want to return a facet count of the authors of these studies, we can add a facet to our index definition:

<collection xmlns="http://exist-db.org/collection-config/1.0">
    <index xmlns:tei="http://www.tei-c.org/ns/1.0" xmlns:xs="http://www.w3.org/2001/XMLSchema">
        <lucene>
            <text qname="tei:body">
                <field name="author" expression="./ancestor::tei:TEI//tei:author"/>
                <facet dimension="author" expression="./ancestor::tei:TEI//tei:author"/>
                <!-- other fields and facets -->
            </text>
        </lucene>
    </index>
</collection>

Here is a query returning information on the authors facet:

xquery version "3.1";

declare namespace tei="http://www.tei-c.org/ns/1.0";

let $author-query := "david"
let $keyword-query := "germany"
let $q := $keyword-query || " author:" || $author-query
let $hits :=
    collection("/db/apps/hsa-data/data")//tei:body[
        ft:query(
            ., 
            $q,
            map {
                "fields": "author"
            }
        )
    ]
return
    ft:facets($hits, "author")

The facet values and counts are returned as a map:

map {
    "Elizabeth B. Ballard": 1,
    "Stanley Shaloff": 1,
    "James E. Miller": 2,
    "William Z. Slany": 2,
    "David M. Baehler": 3,
    "David Lawrence": 1,
    "Kay Herring": 1,
    "Mary E.P. Grant": 1,
    "N. Stephen Kane": 1,
    "Peter L. Tester": 1,
    "Sherill B. Wells": 1,
    "Louis J. Smith": 1,
    "Neal H. Petersen": 1,
    "David F. Trask": 5,
    "David W. Mabon": 1,
    "Aaron D. Miller": 1,
    "Evan M. Duncan": 1,
    "William F. Sanford, Jr.": 2,
    "David S. Painter": 5,
    "Arthur G. Kogan": 1,
    "Gabrielle S. Mallon": 2,
    "Karen A. Collias": 2,
    "Edward C. Keefer": 2,
    "Charles S. Sampson": 2,
    "Nancy Golden": 1,
    "Ronald D. Landa": 2,
    "Nina J. Noring": 4,
    "David S. Patterson": 3,
    "Robert J. McMahon": 2,
    "S.Q. Johnson": 1,
    "Carl N. Raether": 1
}

If, instead, we know the exact author at the time of querying, we can adjust the query to query the facets:

xquery version "3.1";

declare namespace tei="http://www.tei-c.org/ns/1.0";

let $author-facet := "David S. Patterson"
let $keyword-query := "germany"
let $hits :=
    collection("/db/apps/hsa-data/data")//tei:body[
        ft:query(
            ., 
            $keyword-query,
            map {
                "facets": map { "author": $author-facet }
            }
        )
    ]
return
    ft:facets($hits, "author")

This returns an updated count of the number of studies mentioning Germany in which one of the authors was David S. Patterson:

map {
    "Neal H. Petersen": 1,
    "William F. Sanford, Jr.": 2,
    "Edward C. Keefer": 1,
    "Ronald D. Landa": 1,
    "Nina J. Noring": 1,
    "David S. Patterson": 3,
    "Robert J. McMahon": 1
}

To get a facet count of all authors in the database, without performing a keyword query, just supply the empty sequence in place of the full text query:

ft:query(
    ., 
    (),
    map {
        "facets": map { "author": $author-facet }
    }
)

Thus, the new facets and fields facility allows the flexible combination of full text, fields, and facets queries.

Please see the eXist documentation's Lucene article for more information.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment