Skip to content

Instantly share code, notes, and snippets.

@nemo
Last active March 13, 2017 22:21
Show Gist options
  • Save nemo/43e2a1857df8b622f06a5a14e3c2a9ef to your computer and use it in GitHub Desktop.
Save nemo/43e2a1857df8b622f06a5a14e3c2a9ef to your computer and use it in GitHub Desktop.
Techcrunch Article Title Scraper
const lib = require('lib');
const async = require('async');
const _ = require('lodash');
var pages = [
'https://techcrunch.com/',
...(_.range(2, 100).map((i) => "https://techcrunch.com/page/" + i))
];
async.mapLimit(pages, 10, (pageUrl, callback) => {
lib.nemo.scrape({
url: pageUrl,
query: ".post-title a",
userAgent: "nemo/scrape v0.1"
}, callback);
}, (err, results) => {
if (err) return console.error("failure", err);
var names = _.flatten(_.map(results, 'query_value'));
console.log(names);
});
@nemo
Copy link
Author

nemo commented Dec 6, 2016

Very easy way to scrape multiple pages of Techcrunch in seconds:

screen shot 2016-12-06 at 2 07 58 am

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment