Skip to content

Instantly share code, notes, and snippets.

@justinabrahms
Last active August 29, 2015 13:56
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save justinabrahms/9116933 to your computer and use it in GitHub Desktop.
Save justinabrahms/9116933 to your computer and use it in GitHub Desktop.
anti-scraper

So I was writing an article on screen scraping and one of the things that came up is "How do you mitigate against screen scraping?" I think this is actually in interesting question, which brought up the idea of a side project that maybe someone else has time for.

The idea is that to prevent screen scraping, the page being scraped must be mutated as to break a scraper. To do that, you could do things like alter selectors of css resources and html (for instance, changing all ids of "signupButton" to "sarah-goldfarb") or change the structure of the page. Maybe it also mutates the structure of the DOM.

A small node proxy that does this around streams would be particularly cool.

Things you're likely to learn:

  1. More about CSS selector precedence. Which selectors can you easily mutate? Which are harder? Are there examples of selectors which you can't mutate?
  2. What can XPath do? XPath is a mechanism for querying tree structures (notably XML). If you were going to alter the page such that xpath would be invalid, what would you need to do?
  3. Can you subtitute entire html tags for another set while keeping the same visual representation? (eg: subbing out <div> for <span style="display:block"> or similar)
  4. You'll probably learn about the browser's static resource caching, as you'll not want stale copies of your auto-generated css.

I'm not sure if this is an actually useful thing to have around. As the article states, I think screen scraping is A-OK. That said, you're bound to learn something fun and interesting if you decide to make this. And, really.. isn't that the point? :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment