Instantly share code, notes, and snippets.

Embed
What would you like to do?
April 4 2018 - Scrape HTML Into an RSS Feed with PHP
<?php
$html = @file_get_contents('http://hedislimane.com/diary/');
if (!$html) {
http_response_code(500);
exit(1);
}
$dom = new domDocument;
@$dom->loadHTML($html); // `@` seems to help prevent errors from sloppy HTML
$dom->preserveWhiteSpace = false;
$images = $dom->getElementsByTagName('img');
$rss_feed = '<?xml version="1.0" encoding="UTF-8"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">';
$rss_feed .= '<channel>';
$rss_feed .= '<title>Hedi Silmane\'s Diary</title>';
$rss_feed .= '<atom:link href="http://'.$_SERVER['HTTP_HOST'].$_SERVER['REQUEST_URI'].'" rel="self" type="application/rss+xml" />';
$rss_feed .= '<link>http://hedislimane.com/diary</link>';
$rss_feed .= '<description>hedislimane.com diary</description>';
$i = 0;
foreach ($images as $image) {
if (strpos($image->getAttribute('src'),'diary/admin/images') !== false && $i < 20) {
$rss_feed .= '<item>';
$rss_feed .= '<title>Photo by Hedi Silmane</title>';
// $rss_feed .= '<pubDate>'.date("r", $user_media->created_time).'</pubDate>';
$rss_feed .= '<dc:creator><![CDATA[ Hedi Silmane ]]></dc:creator>';
$rss_feed .= '<description><![CDATA[<img src="'.str_replace(' http', 'http', $image->getAttribute('src')).'" />]]></description>';
$rss_feed .= '<guid>'.str_replace(' ', '', $image->getAttribute('src')).'</guid>';
$rss_feed .= '</item>';
$i++;
}
}
$rss_feed .= '</channel>';
$rss_feed .= '</rss>';
header('Content-Type: text/xml; charset=utf-8');
echo $rss_feed;

Scrape HTML Content Into an RSS Feed

If you still like RSS as a technology (like I do), then you might be interested in creating an RSS feed for a website that no longer supports RSS. That way, you can point your RSS reader to a PHP file on your server and have your RSS reader update when new content is created. Yes folks, we are screen scrapping with PHP today!

One of my favourite photographs is Hedi Slimane. I really like his style and exculsive use of black and white. However his blog site doens't support RSS. Since I really want to know when new content is posted, I created a little PHP script that will check his site and return an XML RSS feed from my server to my RSS reader.

This is just an example of what is possible in the vast and scary world of screen scraping. I just thought that I would post this script up as an example.

Cheers!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment