Skip to content

Instantly share code, notes, and snippets.

@hubgit
Created June 22, 2012 11:57
Show Gist options
  • Select an option

  • Save hubgit/2972322 to your computer and use it in GitHub Desktop.

Select an option

Save hubgit/2972322 to your computer and use it in GitHub Desktop.
XPath query on all HTML files in a (nested) directory
<?php
// path = directory containing HTML files
// query = XPath query to run on each file, e.g. "//object/@data"
list(, $path, $query) = $argv;
if(!$path || !$query) exit("Usage: $argv[0] path query\n");
if(!is_dir($path)) exit("$path is not a directory\n");
$files = new RecursiveIteratorIterator(new RecursiveDirectoryIterator($path));
$matches = new RegexIterator($files, '/\.html$/i', RegexIterator::GET_MATCH);
libxml_use_internal_errors(true);
$dom = new DOMDocument;
$items = array();
foreach ($matches as $file => $match) {
$dom->loadHTMLFile($file);
$xpath = new DOMXPath($dom);
$nodes = $xpath->query($query);
if (!$nodes->length) continue;
foreach ($nodes as $node) {
$items[] = $node->textContent;
}
}
asort($items);
print_r($items);
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment