Skip to content

Instantly share code, notes, and snippets.

@chrishham
Last active April 19, 2021 16:45
Show Gist options
  • Save chrishham/a4a2ea42ac36534249e4a3e59038b54a to your computer and use it in GitHub Desktop.
Save chrishham/a4a2ea42ac36534249e4a3e59038b54a to your computer and use it in GitHub Desktop.
Super Web Scraper - Sample node-html-parser code
// https://www.npmjs.com/package/node-html-parser
// htmlRoot gets injected and is available to your script
// Helper Functions:
// select:removeWhitespace().text & el.attributes.content
// selectAll: same as above
const result = []
const list = Array.from(htmlRoot.querySelectorAll(':is(.AdvItemBox,.FreeListingItemBox)'))
for (const el of list) {
const isAd = !!el.querySelector('.AdvDetailsArea')
const details = el.querySelector('[name="extraInfo"]')
result.push({
name: select(el, 'h2.CompanyName', 'text'),
description: isAd
? select(el, 'div.CompanyDescr > label > span', 'text')
: select(el, '.CompanyDescr span', 'text'),
address: isAd
? select(el, 'div > div.AdvAddress', 'text')
: select(el, '.FreeListingAddress', 'text'),
region: select(el, 'meta[itemprop="addressLocality"]', 'content'),
landline: selectAll(details, 'div[itemprop="telephone"]', 0, 'text'),
mobile: selectAll(details, 'div[itemprop="telephone"]', 1, 'text'),
email: isAd
? select(el, 'a[itemprop="email"]', 'text')
: select(el, 'meta[itemprop="email"]', 'content'),
site: select(el, 'a.siteLink[target="_blank"]', 'href'),
latitude: select(el, 'meta[itemprop="latitude"]', 'content'),
longitude: select(el, 'meta[itemprop="longitude"]', 'content'),
isAd
})
}
return result
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment