Skip to content

Instantly share code, notes, and snippets.

@nevenjovanovic
Last active May 1, 2016 18:11
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save nevenjovanovic/2137dfe694fa8d1d79da3e443fac2397 to your computer and use it in GitHub Desktop.
Save nevenjovanovic/2137dfe694fa8d1d79da3e443fac2397 to your computer and use it in GitHub Desktop.
An XSL stylesheet testing for presence of characters from a certain Unicode block (in this case, Cyrillic) and reporting a message with filename of file containing such characters. Useful for cleaning up OCR, correcting homographs.
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:tei="http://www.tei-c.org/ns/1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0"
exclude-result-prefixes="tei">
<xsl:output method = "xml" indent="yes" omit-xml-declaration="no" />
<!-- 16croala-testforunicodeblocks: test text//text() nodes for characters from certain Unicode blocks -->
<xsl:template match="//*:text//text()">
<xsl:if test="matches(., '[\p{IsCyrillic}\p{IsCyrillicSupplement}\p{IsCyrillicExtended-A}\p{IsCyrillicExtended-B}]')">
<xsl:message>Characters from Cyrillic Unicode blocks in <xsl:value-of select="base-uri(.)"/></xsl:message>
</xsl:if>
</xsl:template>
</xsl:stylesheet>
@nevenjovanovic
Copy link
Author

nevenjovanovic commented May 1, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment