Skip to content

Instantly share code, notes, and snippets.

@necenzurat
Last active April 19, 2016 12:56
Show Gist options
  • Save necenzurat/5279729 to your computer and use it in GitHub Desktop.
Save necenzurat/5279729 to your computer and use it in GitHub Desktop.
emag web crawler in phantomjs + pjscrape.js
// todo save to db
pjs.addSuite({
title: 'emag fucker',
url: 'http://www.emag.ro',
moreUrls: 'a',
maxDepth: 0,
// function to get some data
scraper: function() {
return {
name: $('h2').text(),
mpn: $('span[itemprop="identifier"]').text(),
price: $('.price-over').attr("content"),
brand: $('span[itemprop="brand"]').attr("content"),
category: $('span[itemprop="category"]').attr("content"),
img: $('img[itemprop="image"]').attr("src"),
desc: $('#box-specificatii-produs').html()
}
}
});
pjs.config({
timeoutInterval: 5000,
// options: 'stdout', 'file' (set in config.logFile) or 'none'
log: 'stdout',
// options: 'json' or 'csv'
format: 'json',
// options: 'stdout' or 'file' (set in config.outFile)
writer: 'file',
outFile: 'scrape_output.json'
});
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment