Skip to content

Instantly share code, notes, and snippets.

@joewiz
Last active January 9, 2018 17:26
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save joewiz/b3ec2955a9b9fe551cc8cc291c0f0d34 to your computer and use it in GitHub Desktop.
Save joewiz/b3ec2955a9b9fe551cc8cc291c0f0d34 to your computer and use it in GitHub Desktop.
Full text proximity search of Cyrillic text encoded in TEI XML using eXist-db

This gist was prompted by a question on the eXist-open mailing list. See http://markmail.org/message/zudf7qp4pqjx6xhi.

To run these files:

  • Download this gist as a .zip file
  • Uncompress the .zip file
  • Create collection /db/test in eXist
  • Upload the contents of the zip file into the /db/test collection
  • Reindex /db/test with xmldb:reindex("/db/test")
  • Run /db/test/test.xq
  • The result should match that of the test_results.xml file included here.
<collection xmlns="http://exist-db.org/collection-config/1.0">
<index xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:tei="http://www.tei-c.org/ns/1.0">
<fulltext default="none" attributes="false"/>
<lucene>
<text qname="tei:p"/>
<text qname="tei:w"/>
</lucene>
</index>
</collection>
<?xml version="1.0" encoding="UTF-8"?>
<teiCorpus xmlns="http://www.tei-c.org/ns/1.0">
<TEI>
<p xml:id="p17">
<s>
<w xml:id="w59">је</w>
<w xml:id="w61">виши</w>
<w xml:id="w63">земаљски</w>
<w xml:id="w65">судски</w>
<w xml:id="w67">саветник</w> (
<w xml:id="w69">апелациони</w>
<w xml:id="w71">саветник</w>)
<w xml:id="w73">у</w>
<w xml:id="w77">веома</w>
<w xml:id="w79">познат</w>
<w xml:id="w81">на</w>
<w xml:id="w83">пољу</w>
<w xml:id="w85">српске</w>
<w xml:id="w87">књижевности</w>,
<w xml:id="w89">како</w>
<w xml:id="w91">због</w>
<w xml:id="w93">вишегодишњег</w>
<w xml:id="w95">уређивања</w> „
<w xml:id="w97">Далматинског</w>
<w xml:id="w99">алманаха</w>“,
<w xml:id="w101">за</w>
</s>
</p>
</TEI>
</teiCorpus>
xquery version "3.1";
declare namespace tei="http://www.tei-c.org/ns/1.0";
let $query :=
'"српске књижевности"~1'
(:
<query><near><term>српске</term><term>књижевности</term></near></query>
:)
return
collection("/db/test")//tei:p[ft:query(., $query)] => util:expand()
<p xmlns="http://www.tei-c.org/ns/1.0" xml:id="p17">
<s>
<w xml:id="w59">је</w>
<w xml:id="w61">виши</w>
<w xml:id="w63">земаљски</w>
<w xml:id="w65">судски</w>
<w xml:id="w67">саветник</w> (
<w xml:id="w69">апелациони</w>
<w xml:id="w71">саветник</w>)
<w xml:id="w73">у</w>
<w xml:id="w77">веома</w>
<w xml:id="w79">познат</w>
<w xml:id="w81">на</w>
<w xml:id="w83">пољу</w>
<w xml:id="w85"><exist:match xmlns:exist="http://exist.sourceforge.net/NS/exist">српске</exist:match></w>
<w xml:id="w87"><exist:match xmlns:exist="http://exist.sourceforge.net/NS/exist">књижевности</exist:match></w>,
<w xml:id="w89">како</w>
<w xml:id="w91">због</w>
<w xml:id="w93">вишегодишњег</w>
<w xml:id="w95">уређивања</w> „
<w xml:id="w97">Далматинског</w>
<w xml:id="w99">алманаха</w>“,
<w xml:id="w101">за</w>
</s>
</p>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment