Create a gist now

Instantly share code, notes, and snippets.

What would you like to do?
An XSL stylesheet testing for presence of characters from a certain Unicode block (in this case, Cyrillic) and reporting a message with filename of file containing such characters. Useful for cleaning up OCR, correcting homographs.
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:tei="http://www.tei-c.org/ns/1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0"
exclude-result-prefixes="tei">
<xsl:output method = "xml" indent="yes" omit-xml-declaration="no" />
<!-- 16croala-testforunicodeblocks: test text//text() nodes for characters from certain Unicode blocks -->
<xsl:template match="//*:text//text()">
<xsl:if test="matches(., '[\p{IsCyrillic}\p{IsCyrillicSupplement}\p{IsCyrillicExtended-A}\p{IsCyrillicExtended-B}]')">
<xsl:message>Characters from Cyrillic Unicode blocks in <xsl:value-of select="base-uri(.)"/></xsl:message>
</xsl:if>
</xsl:template>
</xsl:stylesheet>
Owner

nevenjovanovic commented May 1, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment