Check a list of URLs for content using PHP & CURL
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
<?php | |
// usage: cat urls.txt | php checklinks.php --use-curl | |
// initialize results arrays | |
$valid_links = array(); | |
$invalid_links = array(); | |
// check command line option whether to use curl (prefered method) | |
$use_curl = false; | |
if (isset($argv[1])) | |
{ | |
if ($argv[1] == "--use-curl") | |
{ | |
$use_curl = true; | |
} | |
} | |
// read a list of files from stdin | |
$stdin = fopen("php://stdin", "r"); | |
// check each one | |
while ($url = @fgets($stdin)) | |
{ | |
if ($use_curl) | |
{ | |
$is_valid_url = check_link_curl($url); | |
} | |
else | |
{ | |
$is_valid_url = check_link_fopen($url); | |
} | |
if ($is_valid_url) | |
{ | |
print " valid $url"; | |
$valid_links[] = $url; | |
} | |
else | |
{ | |
print " invalid $url"; | |
$invalid_links[] = $url; | |
} | |
} | |
// report totals | |
print "valid links: " . count($valid_links) . "\n"; | |
print "invalid links: " . count($invalid_links) . "\n"; | |
//NOTE: this will not work unless php.ini has allow_url_fopen enabled | |
//You may also have trouble if you are behind a proxy | |
function check_link_fopen($url) | |
{ | |
$file_handle = fopen($url, 'r'); | |
if ($file_handle) | |
{ | |
fclose($f); | |
return true; | |
} | |
return false; | |
} | |
// NOTE: this requires curllib to be compiled into PHP | |
function check_link_curl($url) | |
{ | |
$curl = curl_init(); | |
$curl_options = array(); | |
$curl_options[CURLOPT_RETURNTRANSFER] = true; // do not output to browser | |
$curl_options[CURLOPT_URL] = "$url"; // set URL | |
$curl_options[CURLOPT_NOBODY] = true; // do a HEAD request only | |
$curl_options[CURLOPT_TIMEOUT] = 60; // 1 minute | |
curl_setopt_array($curl, $curl_options); | |
curl_exec($curl); | |
$status = curl_getinfo($curl, CURLINFO_HTTP_CODE); | |
curl_close($curl); | |
if ($status == 200) // success | |
{ | |
return true; | |
} | |
return false; | |
} | |
?> |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
http://one-shore.com/ | |
http://google.com/ | |
http://yahoo.com/ |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Hi. Using this on Windows 10, Xampp, PHP 7.4 I had no end of trouble with fopen($url)
I traded out check_link_fopen(url) for
function get_http_response_code($url) {
$headers = get_headers($url);
return substr($headers[0], 9, 3);
}
then compared the result
$is_valid_url = get_http_response_code($url);
if ( 200 == $is_valid_url ){
print " valid $url\n\r";
$valid_links[] = $url;
} else {
print " invalid $url [$is_valid_url]\n\r";
$invalid_links[] = $url;
}
It got rid of issues with php and fopen against the URL.
I also had issues with invalid URL when I got past this. It was because of Windows and newlines.
So I stripped newlines and carriage returns out of the URL before running the check.
$url = str_replace(array("\n", "\r"), '', $url);
Its why the print statements above have /r/n at the end ( to replace those for the print out).
HTH
The final code is below:
`<?php
// usage: cat urls.txt | php checklinks.php --use-curl
if ( ini_get( 'allow_url_fopen' ) ) {
echo "allow_url_fopen is ENABLED.\n";
} else {
echo "allow_url_fopen is DISABLED.\n";
exit (1);
}
// initialize results arrays
$valid_links = array();
$invalid_links = array();
// read a list of files from stdin
$stdin = fopen('url.txt', "r");
// check each one
while ($url = @fgets($stdin))
{
$url = str_replace(array("\n", "\r"), '', $url);
$is_valid_url = get_http_response_code($url);
}
// report totals
print "valid links: " . count($valid_links) . "\n";
print "invalid links: " . count($invalid_links) . "\n";
function get_http_response_code($url) {
$headers = get_headers($url);
return substr($headers[0], 9, 3);
}
?>`