Skip to content

Instantly share code, notes, and snippets.

@veb
Created August 26, 2017 09:02
Show Gist options
  • Save veb/c1beab69b5eb1b07123e5eaf55b80320 to your computer and use it in GitHub Desktop.
Save veb/c1beab69b5eb1b07123e5eaf55b80320 to your computer and use it in GitHub Desktop.
Scrapes the main page of HackerNews and returns an array of objects using Puppeteer and Cheerio
const puppeteer = require('puppeteer');
const cheerio = require('cheerio');
async function run() {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://news.ycombinator.com');
let content = await page.content();
var $ = cheerio.load(content);
$('span.comhead').each(function(i, element){
var a = $(this).prev();
var rank = a.parent().parent().text();
var title = a.text();
var url = a.attr('href');
var subtext = a.parent().parent().next().children('.subtext').children();
var points = $(subtext).eq(0).text();
var username = $(subtext).eq(1).text();
var comments = $(subtext).eq(2).text();
var metadata = {
rank: parseInt(rank),
title: title,
url: url,
points: parseInt(points),
username: username,
comments: parseInt(comments)
};
console.log(metadata);
});
browser.close();
}
run();
@mrm8488
Copy link

mrm8488 commented Aug 26, 2017

Cool! I think, doing it with the request module or request-promise and ofc cheerio is faster.

@dotaxis
Copy link

dotaxis commented Jan 20, 2018

Very nice! The tradeoff for speed is obviously the browser-like behaviour which won't trigger captcha. :)

@ychong
Copy link

ychong commented Aug 31, 2018

Amazing! Thanks for writing and sharing!

@ikainar
Copy link

ikainar commented Jan 8, 2019

comments confused with time when news posted

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment