Skip to content

Instantly share code, notes, and snippets.

@hijonathan
Last active September 16, 2019 11:43
Show Gist options
  • Save hijonathan/6b4b699d024b7bc6ada4 to your computer and use it in GitHub Desktop.
Save hijonathan/6b4b699d024b7bc6ada4 to your computer and use it in GitHub Desktop.
Fun in-browser web scraper.

What is this?

This little script lets you easily define a few things you want from a web page (using CSS selectors), the crawl the site until you get them all.

It's based on the way Kimono works, but it's much simpler and has no limit to the number of results you can get. It also uses your auth tokens from the browser, so it's just as secure as your browser (which you should still be suspect of).

How do I use it?

Put that script into your browser terminal and run it. If you use Chrome, I highly recommend saving it as a snippet for easy reuse. To start scraping a site, create a Scraper instance with your desired options:

var scraper = new Scraper({
  container: 'li.person',  // The highest common sibling you want to grab.
  targets: {  // The items you want to grab within each container.
    first_name: {  // A name for the data you're trying to scrape.
      selector: '.profile span:first-child',  // Query selector to the element you want to get data from.
      parser: function(el) { return el.innerText }  // A function you want to run on the found element.
    }
  },
  next: '.pagination.next-page'  // Query selector to the pagination link, if applicable.
})

Once that's set up, just start the scraping.

scraper.start();

At any point, you can request the current data set at the results property, e.g. scraper.results. Hint: to copy that to your clipboard in Chrome, use copy(scraper.results).

That's it! You can create multiple scraper instances and run them all simultaneously. If you have interesting uses of this, I'd love to hear it :D

var Scraper = function(options) {
this.results = [];
this.options = options;
};
Scraper.prototype.start = function() {
this.scrape(document, this.options);
};
Scraper.prototype.scrape = function(doc, options) {
var self = this, next, prev;
// Scrape the current page.
this.results = this.results.concat(this.parse(doc.querySelectorAll(options.container), options.targets));
if (options.prev) {
// Remove previous page.
options.prev.remove();
}
// Load the next page.
if (options.next && (next = doc.body.querySelector(options.next))) {
this.load(next.href, function(d) {
options.prev = this;
self.scrape(d, options);
});
} else {
console.log('Scraping complete.');
}
};
Scraper.prototype.load = function(url, callback) {
var f = document.createElement('iframe');
f.src = url;
f.onload = function() {
window.setTimeout(callback.bind(this, f.contentDocument), 100);
}
document.body.appendChild(f);
return f;
};
Scraper.prototype.parse = function(containers, targets) {
// Get the container for each result.
return Array.prototype.slice.call(containers).map(function(el) {
var res = {};
// Search for our targets.
Object.keys(targets).forEach(function(key) {
var target = targets[key];
res[key] = target.parser(el.querySelector(target.selector));
})
return res;
});
};
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment