Skip to content

Instantly share code, notes, and snippets.

@ryan-williams
Created March 14, 2020 04:10
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save ryan-williams/65c6143d3a027cd2532785565d740ca8 to your computer and use it in GitHub Desktop.
Save ryan-williams/65c6143d3a027cd2532785565d740ca8 to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Parse submissions from Papercall\n",
"This notebook opens papercall in a Chromium window (that will stay open and respond to your commands!).\n",
"\n",
"It uses the excellent [tslab](https://github.com/yunabe/tslab) Jupyter kernel (which provides top-level `async`/`await`, among other things 🎉)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Setup"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import assert from 'assert'\n",
"const fs = require('fs')\n",
"const puppeteer = require('puppeteer');"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"// Initialize puppet browser and default page\n",
"async function puppet(url, width, height) {\n",
" width = width || 1200;\n",
" height = height || 800;\n",
" const browser = await puppeteer.launch({\n",
" headless: false,\n",
" args: [ '--no-sandbox', `--window-size=${width},${height}`],\n",
" userDataDir: process.env[\"USER_DATA_DIR\"] || '.chrome',\n",
" });\n",
" const page = await browser.newPage();\n",
" if (url)\n",
" await page.goto(url);\n",
"\n",
" return { browser, page };\n",
"}"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Open https://papercall.io"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"let { browser, page } = await puppet('https://papercall.io')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Login\n",
"Open the hamburger menu:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"await page.click('.icon.icon--hamburger')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### (Helper for finding+clicking links containing text)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"const escapeXpathString = str => {\n",
" const splitedQuotes = str.replace(/'/g, `', \"'\", '`);\n",
" return `concat('${splitedQuotes}', '')`;\n",
"};\n",
"\n",
"const findByText = async ({\n",
" page,\n",
" text,\n",
" tag = 'a',\n",
" requireSingleton = true,\n",
" click = false,\n",
" }) => {\n",
" console.log(`Checking ${page.url()} for <${tag}>'s containing \"${text}\"`);\n",
" const escapedText = escapeXpathString(text);\n",
" const links = await page.$x(`//${tag}[contains(., ${escapedText})]`);\n",
"\n",
" if (links.length === 0)\n",
" throw new Error(`Link not found: ${text}`);\n",
"\n",
" if (links.length > 1 && requireSingleton)\n",
" throw new Error(`Required 1 <${tag}> containing \"${text}\", found ${links.length}`);\n",
"\n",
" if (click)\n",
" await links[0].click();\n",
" else\n",
" return links[0];\n",
"};"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Click \"Login\""
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"await findByText({ page, text: \" Login \", click: true })"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Finish logging in\n",
"You're kind of on your own here; log in however you like from the various OAuth providers on offer.\n",
"\n",
"## Parse Submissions\n",
"First, navigate to your event page's \"submissions\" list; something like:\n",
"\n",
"![](https://p199.p4.n0.cdn.getcloudapp.com/items/DOu8n154/Screen+Shot+2020-02-12+at+10.04.02+PM.png?v=baaac5e2f2dcf3905d33dee67c909d02)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### \"Submission list\" page helpers:"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Grabb a simplified view of a node's children (including `#text` nodes):"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"async function children(n) {\n",
" return n.evaluate(\n",
" n => \n",
" Array.prototype.slice.call(n.childNodes)\n",
" .map(n => {\n",
" if (n.nodeType == 3) {\n",
" let txt = n.nodeValue.trim();\n",
" if (txt) return { node: '#text', txt }\n",
" } else {\n",
" let node = n.nodeName;\n",
" let txt = (n.text || n.innerText);\n",
" if (node != 'BR') return { node, txt }\n",
" }\n",
" }).filter(n => n)\n",
" )\n",
"}"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Pull {talk title, (optional) tags, author name, and (optional) author location} from the first column (\"Title\") of the \"Submissions\" table:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"function parseFirstColumnChildren(pieces) {\n",
" assert(pieces[0]['node'] == 'A');\n",
" let title = pieces[0]['txt'];\n",
" let idx = 1;\n",
" let piece = pieces[idx];\n",
" let tags = [];\n",
" while (piece['node'] == 'A') {\n",
" tags.push(piece['txt']);\n",
" idx++; piece = pieces[idx];\n",
" if (piece['node'] != '#text' || piece['txt'] != ',') break;\n",
" idx++; piece = pieces[idx];\n",
" }\n",
" assert(piece['node'] == '#text');\n",
" let author = piece['txt'];\n",
" idx++;\n",
" let obj = { title, tags, author };\n",
" if (idx < pieces.length) {\n",
" let piece = pieces[idx];\n",
" assert(piece['node'] == 'SMALL', piece['node']);\n",
" let location = piece['txt'];\n",
" obj['location'] = location;\n",
" idx++;\n",
" }\n",
" assert(idx == pieces.length, idx.toString());\n",
" return obj;\n",
"}"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Parse a full submission row:\n",
"- \"Title\" column\n",
"- Submission time\n",
"- Modification time"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"async function parseSubmissionRow(tr) {\n",
" const td0 = await tr.$('td:first-child');\n",
" const pieces = await children(td0);\n",
" let metadata = parseFirstColumnChildren(pieces);\n",
"\n",
" const titleTag = await tr.$('td:first-child > a:first-child');\n",
" const href = await titleTag.evaluate(a => a.href);\n",
"\n",
" const submittedHandle = await tr.$('td:nth-child(5) > time');\n",
" const submitted = await submittedHandle.evaluate(time => time.attributes['datetime'].value);\n",
"\n",
" const modifiedHandle = await tr.$('td:nth-child(6) > time');\n",
" const modified = await modifiedHandle.evaluate(time => time.attributes['datetime'].value);\n",
"\n",
" const title_tag = await tr.$('td:first-child > a:first-child');\n",
" \n",
" metadata = { ...metadata, ...{ href, submitted, modified, handle: tr, title_tag } };\n",
" \n",
" return metadata;\n",
"}"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Example\n",
"Grab the submissions on this page:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"let rows = await page.$$('tbody > tr'); rows.length"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Parse them:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"let parsed_rows = await Promise.all(rows.map(async tr => await parseSubmissionRow(tr)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Print the parsed rows (omitting two `ElementHandle` fields that would otherwise clutter the output:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"parsed_rows.map(\n",
" function ({handle,title_tag,...rest}) {\n",
" return rest\n",
" }\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Try clicking through to the first submission:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"await parsed_rows[0]['title_tag'].click()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## \"Submission details\" page helpers"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Parse the main / left-hand boxes (omitting the final one, which contains the author's bio, and has a slightly different structure):"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"async function parseFixedBoxes(boxes) {\n",
" return await Promise.all(\n",
" boxes.slice(0, boxes.length - 1).map(async box => {\n",
" let h3 = await box.$('h3')\n",
" let title = await h3.evaluate(h3 => h3.innerText.trim())\n",
" let mdNode = await box.$('.markdown')\n",
" let md = await mdNode.evaluate(md => md.innerHTML)\n",
" let o = {};\n",
" o[title] = md;\n",
" return o;\n",
" })\n",
" )\n",
"}"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Generic helper for merging a list of objects:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"function flattenObjs(objs) {\n",
" let o = {};\n",
" objs.forEach(O => o = {...o, ...O});\n",
" return o;\n",
"}"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Parse the author's \"bio\" box (which looks like the boxes above, but includes some links in the header):"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"async function parseAuthorMetaLink(a) {\n",
" const href = await a.evaluate(a => a.href);\n",
" const icon = await a.$('i');\n",
" const className = await icon.evaluate(i => i.className);\n",
" const txtSpan = await a.$('span');\n",
" const txt = await txtSpan.evaluate(s => s.innerText);\n",
" return { txt, className, href };\n",
"}"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"async function parseAuthorBox(boxes) {\n",
" let authorBox = boxes[boxes.length - 1];\n",
" let header = await authorBox.$('header');\n",
" let nameBox = await header.$('div.justifize__box:first-child > h3')\n",
" let name = await nameBox.evaluate(h3 => h3.innerText)\n",
" let links = await header.$$('div.justifize__box:nth-child(2) > a')\n",
" let parsedLinks = \n",
" await Promise.all(\n",
" links.map(\n",
" async a => await parseAuthorMetaLink(a)\n",
" )\n",
" )\n",
" ;\n",
" let mdNode = await authorBox.$('.markdown')\n",
" let md = await mdNode.evaluate(md => md.innerHTML)\n",
" return { name, links: parsedLinks, blurb: md };\n",
"}"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Parse the right-hand \"aside\" that includes metadata like \"Talk Format\" and \"Audience Level\":"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"async function parseAside(row) {\n",
" let aside = await row.$('aside'); \n",
" const asideBox = (await row.$$('aside > .box.spaced--md'))[0];\n",
" let items = await asideBox.$$('.pack__item');\n",
" async function parseItem(item) {\n",
" const h4 = await item.$('h4');\n",
" const key = await h4.evaluate(h4 => h4.innerText);\n",
" const span = await item.$('span');\n",
" const value = await span.evaluate(s => s.innerText);\n",
" let o = {};\n",
" o[key.trim()] = value.trim();\n",
" return o;\n",
" }\n",
" return flattenObjs(await Promise.all(items.map(async item => await parseItem(item))));\n",
"}"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Parse all info from a \"submission details\" page:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"async function parseSubmission() {\n",
" let row = await page.$('.container > .row');\n",
" let boxes = await row.$$('div.box.box--md')\n",
"\n",
" let o = flattenObjs(await parseFixedBoxes(boxes));\n",
" const authorMeta = await parseAuthorBox(boxes);\n",
" o['author'] = authorMeta;\n",
" \n",
" const items = await parseAside(row);\n",
" const format = items['Talk Format']\n",
" const level = items['Audience Level']\n",
" \n",
" o = {...o, ...{format, level}};\n",
" return o\n",
"}"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Navigation helpers"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"const sleep = 500;"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"async function clickSubmission(submission) {\n",
" await submission['title_tag'].click();\n",
" await page.waitForSelector('aside');\n",
" await page.waitFor(sleep);\n",
"}"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"async function back() {\n",
" await page.goBack();\n",
" await page.waitForSelector('table');\n",
" await page.waitFor(sleep);\n",
"}"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Parse submissions\n",
"Iterate through submissions, index by index, on a \"submission list\" page:\n",
"- parse metadata from submissions list\n",
"- click through to details page\n",
"- parse details\n",
"- go back to submissions list\n",
"\n",
"Repeat until we've parsed all submissions on the page:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"async function parseSubmissions(start = 0, end = null) {\n",
" start = start || 0\n",
" let subIdx = start\n",
" const submissions = [];\n",
" while (true) {\n",
" let submissionRows = await page.$$('tbody > tr')\n",
" end = end || submissionRows.length;\n",
" if (subIdx >= end) break;\n",
" let submissionRow = submissionRows[subIdx];\n",
" console.log(`Parsing submission ${subIdx}`);\n",
" let meta = await parseSubmissionRow(submissionRow);\n",
" await clickSubmission(meta);\n",
" let details = await parseSubmission();\n",
" await back();\n",
" const submission = { meta, details };\n",
" if (submissions.length == subIdx) {\n",
" console.log(`appending submission ${subIdx}`)\n",
" submissions.push(submission)\n",
" } else {\n",
" console.log(`replacing submission ${subIdx}`)\n",
" submissions[subIdx] = submission;\n",
" }\n",
" subIdx++;\n",
" }\n",
" return submissions\n",
"}"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Run it!"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"const page1 = await parseSubmissions()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"page1"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Mild cleanup before persisting: remove `ElementHandle`s:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"function dropHandles(sub) {\n",
" const { meta, details } = sub;\n",
" const { handle, title_tag, ...rest } = meta;\n",
" return { meta: rest, details };\n",
"}"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"page1.map(dropHandles)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Checkpoint here if you like…"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"fs.writeFileSync('page-1.json', JSON.stringify(page1.map(dropHandles), null, 2))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Click to page 2 of submissions…\n",
"You'll have to do that manually; scripting it is left as an exercise for the reader 😛"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"const page2 = await parseSubmissions()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"page2"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Combine pages 1 and 2:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"const all = page1.concat(page2)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"all.length"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Persist!"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"fs.writeFileSync('submissions.json', JSON.stringify(all.map(dropHandles), null, 2))"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "JavaScript",
"language": "javascript",
"name": "jslab"
},
"language_info": {
"file_extension": ".js",
"mimetype": "text/javascript",
"name": "javascript",
"version": ""
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Display the source blob
Display the rendered blob
Raw
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment