Skip to content

Instantly share code, notes, and snippets.

@toshvelaga
Created October 15, 2023 05:11
Show Gist options
  • Save toshvelaga/40001adf2972828a5cdb7b5b2f33995d to your computer and use it in GitHub Desktop.
Save toshvelaga/40001adf2972828a5cdb7b5b2f33995d to your computer and use it in GitHub Desktop.
Use pupeteer to get the contents of a website
import puppeteer from 'puppeteer'
import 'dotenv/config'
// import screenshot from '../screenshot.js'
const PROD_CONFIG = {
headless: true,
ignoreHTTPSErrors: true,
args: ['--no-sandbox', '--disable-setuid-sandbox'],
ignoreDefaultArgs: ['--disable-extensions'],
}
// you will need to edit the executablePath
const DEV_CONFIG = {
executablePath:
'/Applications/Google Chrome.app/Contents/MacOS/Google Chrome',
headless: false,
ignoreHTTPSErrors: true,
}
const runWebScraper = async (url) => {
const browser = await puppeteer.launch(
process.env.NODE_ENV === 'production' ? PROD_CONFIG : DEV_CONFIG
)
console.time('puppeteer')
const page = await browser.newPage()
await page.goto(url, { waitUntil: 'domcontentloaded' })
// await page.goto(url, { waitUntil: 'networkidle0' })
const content = await page.$eval('*', (el) => {
const selection = window.getSelection()
const range = document.createRange()
range.selectNode(el)
selection.removeAllRanges()
selection.addRange(range)
return window.getSelection().toString()
})
// console.log(content)
// console.log('content length: ', content.length)
await page.close()
await browser.close()
console.timeEnd('puppeteer')
return content
}
// FOR TESTING
// const URL = 'https://www.npmjs.com/package/html-to-text'
// runWebScraper(URL)
export default runWebScraper
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment