Create a gist now

Instantly share code, notes, and snippets.

What would you like to do?
An XSL stylesheet testing for presence of characters from a certain Unicode block (in this case, Cyrillic) and reporting a message with filename of file containing such characters. Useful for cleaning up OCR, correcting homographs.
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:tei="http://www.tei-c.org/ns/1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0"
exclude-result-prefixes="tei">
<xsl:output method = "xml" indent="yes" omit-xml-declaration="no" />
<!-- 16croala-testforunicodeblocks: test text//text() nodes for characters from certain Unicode blocks -->
<xsl:template match="//*:text//text()">
<xsl:if test="matches(., '[\p{IsCyrillic}\p{IsCyrillicSupplement}\p{IsCyrillicExtended-A}\p{IsCyrillicExtended-B}]')">
<xsl:message>Characters from Cyrillic Unicode blocks in <xsl:value-of select="base-uri(.)"/></xsl:message>
</xsl:if>
</xsl:template>
</xsl:stylesheet>
@nevenjovanovic

This comment has been minimized.

Show comment Hide comment
Owner

nevenjovanovic commented May 1, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment