Skip to content

Instantly share code, notes, and snippets.

@ubermichael
Created March 11, 2021 17:51
Show Gist options
  • Save ubermichael/872cdd21614d1e2be0282c73541f70ab to your computer and use it in GitHub Desktop.
Save ubermichael/872cdd21614d1e2be0282c73541f70ab to your computer and use it in GitHub Desktop.
Extract text between page breaks in TEI with PHP
$nodes = $xp->query('/tei:TEI/tei:text//node()[self::text() or self::tei:pb]');
for($n = 0; $n < $nodes->length; $n++) {
$node = $nodes->item($n);
if($node instanceof DOMNode && $node->nodeType === XML_ELEMENT_NODE) {
$pageCount++;
$content = preg_replace("/[[:space:]]{2,}/u", ' ', $text) . "\n";
$fn = sprintf("%s/%s_%04d", $dir, $id, $pageCount);
file_put_contents($fn, $content);
$text = '';
}
if($node instanceof DOMText) {
$text .= $node->textContent . " ";
}
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment