Skip to content

Instantly share code, notes, and snippets.

@vjeux
Created February 2, 2012 17:23
Show Gist options
  • Save vjeux/1724677 to your computer and use it in GitHub Desktop.
Save vjeux/1724677 to your computer and use it in GitHub Desktop.
Scrap = require '../src/scrap.js'
match = require 'match'
# Create a new Scrap with the base for all the following requests
wikipedia = new Scrap
path: 'http://en.wikipedia.org/'
# Download the front page as string
wikipedia.get '/', (page) ->
# Get all the wiki pages using a regex
urls = match.all(page, '<a href="(/wiki/[^:"]+)"')
wikipedia.get urls, (page, url) ->
console.log
url: url
title: match(page, '<h1[^>]+>(.*?)<\/h1>')
excerpt: match(page, '<p>(.*?)<\/p>').replace(/<[^>]+>/g, '')
<?php
require '../../phpFileDownload/filedownload.php';
require '../../phpMatch/match.php';
$fd = new FileDownload();
$index = $fd->get('http://en.wikipedia.org/');
foreach (match_all($index, '<a href="(/wiki/[^:"]+)"') as $url) {
$page = $fd->get('http://en.wikipedia.org' . $url);
echo "<strong>" . $url . "</strong><br />\n";
echo match($page, '<h1[^>]+>(.*?)</h1>') . "<br />\n";
echo preg_replace('/<[^>]+>/', '', match($page, '<p>(.*?)</p>')) . "<br />\n";
}
?>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment