Skip to content

Instantly share code, notes, and snippets.

@aresnick
Last active April 21, 2024 20:06
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save aresnick/778e4fb9c92a44f32459 to your computer and use it in GitHub Desktop.
Save aresnick/778e4fb9c92a44f32459 to your computer and use it in GitHub Desktop.

Scraping a dynamic web page with CasperJS

This is a small sketch allowing you to load and save a dynamic webpage that you'd like to scrape; this is useful when tools like wget can only grab the HTML and JS the server gives you (which might then go on to load or synthesize additional parts of the page).


You'll also need to install the development version of CasperJS (via brew install casperjs --devel) and run the scrape file via casperjs test.js --ssl-protocol=any. Notice that if the dynamic page needs cookies to load properly (e.g. if you're scraping content that relies on being logged in), you can invoke this with casperjs test.js --ssl-protocol=any --cookies-file=cookies.txt .

var target = "https://thenounproject.com/term/programming/176241/"; // Our target URL
var selectorToWaitFor = ".hero-icon.imgLoaded";
var casper = require('casper').create({
verbose: true,
logLevel: "info",
pageSettings: {
webSecurityEnabled: false, // (http://casperjs.readthedocs.org/en/latest/faq.html#i-m-having-hard-times-downloading-files-using-download)
userAgent: "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_2) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.97 Safari/537.11" // Spoof being Chrome on a Mac (https://msdn.microsoft.com/en-us/library/ms537503(v=vs.85).aspx)
}
});
casper.start(target);
var scrape = function() {
console.log("Saving…");
var html = String(casper.getHTML()); // grab our HTML (http://casperjs.readthedocs.org/en/latest/modules/casper.html#gethtml)
var filename = target.replace(/[^A-z]/g, ''); // create a sanitized filename by removing all the non A-Z characters (https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_Expressions)
require('fs').write(filename + ".html", html, 'w'); // and save it to a file (https://docs.nodejitsu.com/articles/file-system/how-to-write-files-in-nodejs)
console.log("…wrote HTML to", filename);
};
casper.waitForSelector(selectorToWaitFor, scrape);
// casper.wait(1000, scrape); // You can also just wait a second before scraping if that's easier than looking for a given selector
casper.run(); // and start casper
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment