Skip to content

Instantly share code, notes, and snippets.

@milo2012
Created July 28, 2016 11:09
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save milo2012/3c977ae6bd74eeadd17a2493c3ef0282 to your computer and use it in GitHub Desktop.
Save milo2012/3c977ae6bd74eeadd17a2493c3ef0282 to your computer and use it in GitHub Desktop.
Simple NodeJS script to crawl Websites and get URLs
var Crawler = require("js-crawler");
var url = require('url');
if (process.argv.length <= 2) {
console.log("Usage: " + __filename + " http://www.yahoo.com");
process.exit(-1);
}
var crawler = new Crawler().configure({
maxRequestsPerSecond: 10,
maxConcurrentRequests: 10,
depth: 99,
});
var url1 = process.argv[2];
var hostname = (url.parse(url1).hostname);
var parts = hostname.split('.');
var subdomain = parts.shift();
var upperleveldomain = parts.join('.');
crawler.crawl({
url: url1,
success: function(page) {
if((url.parse(page.url).hostname).indexOf(upperleveldomain) > -1) {
console.log(page.url);
};
},
});
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment