Navigation Menu

Skip to content

Instantly share code, notes, and snippets.

@mtrunkat
Created April 5, 2018 13:21
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save mtrunkat/2d8deb4ffeefe95c29bd172398f82345 to your computer and use it in GitHub Desktop.
Save mtrunkat/2d8deb4ffeefe95c29bd172398f82345 to your computer and use it in GitHub Desktop.
Hacker News crawler using Apify SDK (PuppeteerCrawler and RequestQueue classes)
const Apify = require('apify');
Apify.main(async () => {
// Get queue and enqueue first url.
const requestQueue = await Apify.openRequestQueue();
const enqueue = async url => requestQueue.addRequest(new Apify.Request({ url }));
await enqueue('https://news.ycombinator.com/');
// Create crawler.
const crawler = new Apify.PuppeteerCrawler({
requestQueue,
disableProxy: true,
// This page is executed for each request.
// If request failes then it's retried 3 times.
// Parameter page is Puppeteers page object with loaded page.
handlePageFunction: async ({ page, request }) => {
console.log(`Request ${request.url} succeeded!`);
// Extract all posts.
const data = await page.$$eval('.athing', (els) => {
return els.map(el => el.innerText);
});
// Save data.
await Apify.pushData({
url: request.url,
data,
})
// Enqueue next page.
const nextHref = await page.$eval('.morelink', el => el.href);
await enqueue(nextHref);
},
// If request failed 4 times then this function is executed.
handleFailedRequestFunction: async ({ request }) => {
console.log(`Request ${request.url} failed 4 times`);
await Apify.pushData({
url: request.url,
errors: request.errorMessages,
})
},
});
// Run crawler.
await crawler.run();
});
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment