So I was writing an article on screen scraping and one of the things that came up is "How do you mitigate against screen scraping?" I think this is actually in interesting question, which brought up the idea of a side project that maybe someone else has time for.
The idea is that to prevent screen scraping, the page being scraped must be mutated as to break a scraper. To do that, you could do things like alter selectors of css resources and html (for instance, changing all ids of "signupButton" to "sarah-goldfarb") or change the structure of the page. Maybe it also mutates the structure of the DOM.
A small node proxy that does this around streams would be particularly cool.
Things you're likely to learn:
- More about CSS selector precedence. Which selectors can you easily mutate? Which are harder? Are there examples of selectors which you can't mutate?
- What can XPath do? XPath is a mechanism for querying tree structures (notably XML). If you were going to alter the page such that xpath would be invalid, what would you need to do?
- Can you subtitute entire html tags for another set while keeping the same visual representation? (eg: subbing out
<div>
for<span style="display:block">
or similar) - You'll probably learn about the browser's static resource caching, as you'll not want stale copies of your auto-generated css.
I'm not sure if this is an actually useful thing to have around. As the article states, I think screen scraping is A-OK. That said, you're bound to learn something fun and interesting if you decide to make this. And, really.. isn't that the point? :)