Skip to content

Instantly share code, notes, and snippets.

@sueszli
Last active January 12, 2024 14:23
Show Gist options
  • Save sueszli/ef1993e15a746734ea69822dbdfb9f2e to your computer and use it in GitHub Desktop.
Save sueszli/ef1993e15a746734ea69822dbdfb9f2e to your computer and use it in GitHub Desktop.
github topics (https://github.com/topics/python) scraper
import { assert, log } from 'console'
import fs from 'fs'
import playwright from 'playwright'
const DOWNLOAD_PATH = 'downloads'
const main = async () => {
console.clear()
// init download dir
const downloadDirExists = fs.existsSync(DOWNLOAD_PATH)
if (!downloadDirExists) {
fs.mkdirSync(DOWNLOAD_PATH)
}
// get user arg
const args = process.argv.slice(2)
assert(args.length === 1, 'only one argument is allowed')
const URL = args[0]
log(`URL: ${URL}`)
// open browser
const browser = await playwright.chromium.launch({
// headless: false,
// slowMo: 1000,
})
const context = await browser.newContext()
const page = await context.newPage()
await page.goto(URL)
const links = await page.$$eval('article div:nth-child(1) a:nth-child(2)', (anchors) => {
const hrefs = anchors.map((anchor) => anchor.getAttribute('href'))
const url = hrefs.map((href) => `https://github.com${href}`)
return url
})
log(links)
await browser.close()
}
main()
const { join } = require("path");
module.exports = {
cacheDirectory: join(__dirname, ".cache", "puppeteer"),
};
@sueszli
Copy link
Author

sueszli commented Nov 19, 2023

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment