Skip to content

Instantly share code, notes, and snippets.

@mikemorris
Created May 10, 2012 15:55
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save mikemorris/2654086 to your computer and use it in GitHub Desktop.
Save mikemorris/2654086 to your computer and use it in GitHub Desktop.
PHP script to scrape links from HTML files
<?php
$links = shell_exec('egrep -o -r "<a[^<]+?(\w*)</a>" --include=*.html *');
$output;
foreach(preg_split("/(\r?\n)/", $links) as $line){
$file = explode(':', $line)[0];
preg_match('/(?<=href=")[^"]+?(?=")/', $line, $url_matches);
$url = $url_matches[0];
preg_match('/(?<=\>)[^<]+?(?=\<)/', $line, $text_matches);
$text = trim($text_matches[0]);
// Does the link open a new window?
$new = preg_match('/target="_blank"/', $line) ? 'Y' : 'N';
$output .= $file . "," . $text . "," . $url . "," . $new . "\n";
}
echo $output;
file_put_contents('links.csv', $output);
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment