Skip to content

Instantly share code, notes, and snippets.

@borispov
Last active September 9, 2019 20:44
Show Gist options
  • Save borispov/17b6d6702f39bf4e4c4eb42a3805fb72 to your computer and use it in GitHub Desktop.
Save borispov/17b6d6702f39bf4e4c4eb42a3805fb72 to your computer and use it in GitHub Desktop.

Web Scraping a job listing website

Using web scraping for a real use case application

Data Extraction

In this article I will show you how I built a simple web scraping tool to assist my wife's job seeking journey.

My wife's route of action is straight-forward, she knows couple of job postings sites and she checks them out on a daily basis, filtering irrelevant jobs that are showing up too frequently and clotting up the page space, and those that are relevant she's sending her application via email.

The main goal of the application that I'd intended to build was simple, I wanted to automate the demanding and repeating process that included opening the browser, entering URL, checkboxing specific field options, inserting search query and then filtering relevant postings by their titles. I'd thought about this, launch browser, goto url, etc.. and I had an insightful flashback of just the right tool for the job: Puppeteer!

( You can check full source code in github.com/borispov/jobscrape )

The tools I've used in this project are quite minimal, althought puppeteer is capable for draining your machine all by itself if you misuse it. run this code to install the modules:

npm install puppeteer nodemailer easy-pdf-merge

We'll save each job posting in a separate pdf, merge all pdfs and then mail them using nodemailer.

First lines of code, name the file however you like, I named it scraper.js

const puppeteer = require('puppeteer');

// the URL we'll be scraping off of.
const URL = `https://www.shatil.org.il/modaot/joboffer_lastweek?field_activity_zones_tid%5B%5D=87&field_main_roles_job_tid%5B%5D=98&combine=${getQuery()}`;

(async () => {
   const browser = await puppeteer.launch();
   const page = await browser.newPage();
   await page.goto(URL);

   // All Your Logic Here

   await browser.close();
})();

(Disclaimer: The website in this example is in Hebrew, however, the language does not affect our code process in any major way)

First, we're importing the puppeteer module. Then, we hard code the URL we'll using to gather our data from. Notice, that we have query parameters in the URL: field_activity_zones_tid && field_main_roles_job_tid AND combine=. They are responsible for: checking specific job role (Socla Work), region (South) and a custom parameter that states our city. The only thing I might be changing is the custom query parameter so I won't hard code it in the URL, and it'll live as a function that returns query or ''.

Next, we declare an async IIFE and inside we launch a browser with the await keyword which we'll see across our code, because puppeteer works asynchronously. Right after, we basically open a new page, a tab if you will, and use page.goto(URL) to navigate to our URL. Lastly, when we are done with our code we close the browser with await browser.close();, closing the browser in the end is quite crucial because if your script doesn't end with it, then you might be end up with dozens of browser processes slowing down your machine.

As I would like this app to run as a script from the shell rather than a web app, I'd be using command line arguments for any input necessary. For example, custom query param will be dervied from the arguments as well, and this is the function I wrote for this job:

const getQuery = () => {
   const args = process.argv.slice(2)
   if (!args) return ''

   return args.some(el => el.includes('query='))
      ? args
         .filter(el=>el.includes('query='))[0]
         .split('=')[1] 
      : searchField
}

Next, we are going to open the URL manually and hack a way to get the selectors and eventually the data. In this case, the selectors are'nt structured in a very sympathetic way to a scraper (^_^), which means I have to get a hold of a selector, and then use tricks like: going to it's first/last element, etc.. In this specific example, this is how I got it:

const jobs = await page.$$eval(
   '.job',
   nods => Array
   .from(nods
      .map(node => ({
         txt: node.firstElementChild.innerText,
         href: node.firstElementChild.href,
         date: node.previousSibling.previousSibling.innerText
      }))
   )
)

Basically, we use selector methods, I recommend reading the docs to get better understanding regarding those methods. we grab all the nodes with 'job' classname, and pass it as the first argument to a function. Array.from(nods) to form an array, then map this array to return an array of objects with the keys 'txt', 'href', 'date'. The txt field holds the title of the job, href holds the http address to the job and date holds, well .. the date. This is wealth of information, and technically all we need is the href to instruct our script to take a screenshot and save it in a pdf file. Why do I need the date and the title text? Because I don't accomplish my goal yet, I have to filter out irrelevant jobs and the way to do this is to store a separate text file with a dictionary to store keywords that I will check the title against, and if the title contains any of that, I rule them out. Say you're looking for a job and you're certain you don't want anything to do with PHP, see..? Now, what about the date? I wanted an option to retrieve last day's jobs, or last week's, and I figured the fastest way is to integrate a Database. Yeah, no... I wanted to explicitly state how many days back I want to be saved in a pdf. A nice option would be using command line arguments since I'm not building this as a web app.

I wrote two helper filter functions for Title and Date.


   const filter = f => arr => arr.filter(f)


   // a function that internally uses .reduce, and will later be passed as a function to .filter method
   const containAny = (string, arrayOfStr) => arrayOfStr.reduce((answer, cur) => {
      if (string.indexOf(cur) > -1) return true
      return answer
   }, false);

   const getDateArg = () => {
      const oneday = 1000 * 60 * 60 * 24;
      const firstArg = process.argv[2];
      return firstArg === 'last'
         ? oneday
         : !isNaN(firstArg) ? oneday * firstArg : 0
   }

   // passed to filter as function.
   const filterByDate = (node) => {
   const arg = getDateArg();
   if (!arg) return true
   const nodeDate = `2019/${node.date.slice(3,5)}/${node.date.slice(0,2)}`
   const nodeTimestamp = new Date(nodeDate).getTime();
   return (new Date().getTime() - nodeTimestamp <= arg)
   }

Let's start with containAny. We take 2 args, a string which will be the title of a job in question and array of keywords. We then reduce over the keyword dictionary, using a false as initialValue. __ string.indexOf(cur) > -1 __ checks if the node title contains the current keyword, if it does we set the value to true, which then we can basically stop the enumeration but I was lazy to find the splice trick to do so.

I decided the Date parameter should live as the first argument after the script name; i.e: node scraper.js 5 .We accept numbers and the string 'last' only so we check against that.

filterByDate if user didn't pass any arguments or passed a wrong one, we treat it as if no date has been passed, and we return the value true which in turn doesn't filter anything out. Else, we calc the number of days to timestamp, we take the difference of current date and job's date timestamps, and if it's less than or equals the given day we true, in other words, we filter out everything that is 'bigger' or 'golder' than the time provided.

We use the next code to take advantage of the functions we wrote earlier:

const rJobs = await jobs
   .filter(({txt}) => !containAny(txt, dict))
   .filter(filterByDate)

We are left with all the date we actually need, everything else is disposed. What next? next, we take care of the technical stuff of actually saving the documents. It's supposed to be straight-forward, however, this was the part I spent most of the time with and is the reason I didn't care for refactoring the code, I was just happy to have it working!

We have to visit each job's URL and take a screenshot, right? since we do it in async fashion, we cannot map over it, we have to use a good 'ol for loop. Likewise:

for (var i = 0; i < rJobs.length; i++){

   const theNode = rJobs[i].href;
   // just wait till page loads.
   await page.waitFor(200);
   // navigate to the url
   await page.goto(theNode);
   // take a picture ;) save to pdf. 
   await page.pdf({path: 'hardCoded.pdf', format: 'A4'});
}

await browser.close();

Easy, right?! Well, not quite so. There are always unplanned surpirses along the way boy. Within the URL, there is an email field for applying to the job, unfortunately you have to click a span that has no direct selector. FUN! to reveal the email. We have to manually the fastest way to get there. After failing to retrieve this element via common methods, I decided to query all spans and filter by this field's unique innerText which says in Hebrew "show email" so we are left with the selector we need and use the click function.

Add this after page.goto(theNode);

await page.evaluate(() => {
  const target = [...querySelectorAll('span')].find(el => el.innerText.includes('show email'))
  target && target.click();
})

We use array spread operator to create and array and expand it with all spans we find, then we use built in function find which returns the first instance of what we search for which is this span that exists only once in that page. Then if found, click it via target.click().

That is almost it, we built the core of this program. The overall code looks like this:

for (var i = 0; i < rJobs.length; i++){
  // Notice the use of pdfName instead a hard-coded one.
  const pdfName = rJobs[i].href.split('/').slice(-1)[0] + '.pdf';
  const theNode = rJobs[i].href;
  await page.waitFor(200);
  await page.goto(theNode);

  await page.evaluate(() => {
    const target = [...querySelectorAll('span')].find(el => el.innerText.includes('show email'))
    target && target.click();
  })

  await page.pdf({path: pdfName, format: 'A4'});
}

Generally, on a large scale or even moderate use case you'd have pagination which in this case would involve implementing similar logic to how we clicked on the span, except you'd have to click on a button that takes you to the next page. In this project I didn't have to as all results for our city were summed up in a single page. Yet, after saving a couple of pdf's over few days, I'd be left with a mess. I had to merge and clean 'em up. I also wanted to be able to send this as an email attachment to my wife, remember? So you'd need to install two packages if you haven't: easy-pdf-merge and nodemailer.

Create 2 files: merge.js, send.js. Basically, merge.js will be exported as a function with 1 parameter: toMail which will be responsible for sending or NOT sending the merged file over email.

const merge = require('easy-pdf-merge');
const fs = require('fs');
const mailer = require('./send');


module.exports = async (toMail) => {
  console.log(`toMail option: ${toMail}`)
  const dirPdfs = () => fs.readdirSync('./').filter(s => s.split('-')[0] !== 'merged' && s.match(/.+\.pdf$/))
  const getDayAndMonth = () => `${new Date().getDate()}/${new Date().getMonth()}`;
  const allPdfs = dirPdfs();

  const d = new Date()
  const ddmm = 'merged-' + d.getDate() + '.' + d.getMonth() + '.pdf';
  merge(allPdfs, ddmm, (err => {
  if (err) return console.error(err)
  console.log(`OK: Merged files into: ${ddmm}`)
  console.log('cleaning up...')
  allPdfs.map(f => f.split('-')[0] !== 'merged' && fs.unlinkSync(f))
  console.log(`
  \t ||||||||||||||||||||||||||||||
  \t ||                          ||
  \t || ---   SENDING MAIL  ---  ||
  \t ||                          ||
  \t ||||||||||||||||||||||||||||||
  `)
  toMail && mailer(ddmm)
  }));
};

dirPdfs - Function that return an array with all the pdf files except for those that contain the word 'merged' in them to avoid deleting the merged files. I have accidenetly deleted all files including the script but managed to restore the script. getDayAndMonth - Function that returns Date as DD/MM, i.e : 29/08 as 29th August. Used for naming the merged pdf file. allPdfs - Self- explana... _ddmm - the pathname for the pdf to be saved.

__ merge(allPdfs, ddmm (err) => .. __ basically this is the merge function from the library easy-pdf-merge. if it's successful we follow up be cleaning up, we map over allPdfs and delete with fs.unlinkSync. toMail && mailer(ddmm) - we check of the argument mail exists, if does we execute the mailer function. simple. How we retrieve the toMail? from scraper.js. Like so:

// check if any arguments include the word 'mail'.  That's it.
const toMail = () => [...process.argv.slice(2)].includes('mail') ? true : false

Technically, at this point the script works and should be merging all to one file, except the nodemailer part.

(Reminder; I recommend finding other website, modify the code to suit your website of interest)

First, run this command from your project folder:
touch config.js && touch send.js
And add the following to config.js:

const sender = {
  auth: {
    user: "yourmail@provider.com",
    pass: "yourpassword123"
  },
  // Who you want the email to be from?
  from: 'Boris Povolotsky'
}

// Whom you want to send the mail? use your mail for testing.
const to = "yourmail@provider.com"

// export it.
module.exports = {
  sender,
  to
}

Now, the following for nodemailer, as this isn't about nodemailer I won't get much into details and this is pretty simple as well. I got this from their docs mainly.

const nodemailer = require('nodemailer');
const fs = require('fs');
const { sender, to } = require('./config');

// You can use node send.js fileToSend.pdf to send the email independent from the scraper.js function.
// grabs the first argument which should be the filename
const fi = process.argv[2] || ''
module.exports = async (fileToSend = fi) => {

  if (!fileToSend) {
    return console.error('Please provide file to send')
  }

  const content = fs.readFileSync(fileToSend);
  const receipient = to
  const smtpTransporter = nodemailer.createTransport({
    service: "Gmail",
    host: 'smtp.gmail.com',
    auth: sender.auth
  });

  const mailOptions = {
    from: sender.from,
    to: to,
    subject: 'This is your request list of jobs',
    text: 'Optional, body of the mail',
    attachments: {
      filename: fileToSend,
      content
    }
  }

  smtpTransporter.sendMail(mailOptions, function(error, info){
    if (error) {
      console.log(error);
    } else {
      console.log('Email sent: ' + info.response);
    }
  });
}

So far, our scraper.js file should look like this:

const puppeteer = require('puppeteer');
const mergePdf = require('./merge');
const dict = fs.readFileSync('./dict.txt', 'utf-8').split('\n').slice(0, -1);
const fs = require('fs')

const getQuery = () => {
  const args = process.argv.slice(2)
  if (!args) return ''
  return args
    .some(el => el.includes('query=')) ? 
      args
        .filter(el=>el.includes('query='))[0]
        .split('=')[1] 
        : searchField
}

const URL = `https://www.shatil.org.il/modaot/joboffer_lastweek?field_activity_zones_tid%5B%5D=87&field_main_roles_job_tid%5B%5D=98&combine=${getQuery()}`;

const getDateArg = () => {
  const oneday = 1000 * 60 * 60 * 24;
  const firstArg = process.argv[2];
  return firstArg === 'last'
    ? oneday
    : !isNaN(firstArg) ? oneday * firstArg : 0
}

const filterByDate = (node) => {
  const arg = getDateArg();
  if (!arg) return true
  const nodeDate = `2019/${node.date.slice(3,5)}/${node.date.slice(0,2)}`
  const nodeTimestamp = new Date(nodeDate).getTime();
  return (new Date().getTime() - nodeTimestamp <= arg)
}

const containAny = (string, arrayOfStr) => arrayOfStr.reduce((answer, cur) => {
  return string.indexOf(cur) > - 1 ? true : answer
}, false);

const toMail = () => [...process.argv.slice(2)].includes('mail') ? true : false

(async () => {
   const browser = await puppeteer.launch();
   const page = await browser.newPage();
   await page.goto(URL);

   // All Your Logic Here
   let browser = await puppeteer.launch({headless: true});

   for (var i = 0; i < rJobs.length; i++){
     // Notice the use of pdfName instead a hard-coded one.
     const pdfName = rJobs[i].href.split('/').slice(-1)[0] + '.pdf';
     const theNode = rJobs[i].href;
     await page.waitFor(200);
     await page.goto(theNode);

     await page.evaluate(() => {
       const target = [...querySelectorAll('span')].find(el => el.innerText.includes('show email'))
       target && target.click();
     })
     await page.pdf({path: pdfName, format: 'A4'});
  }

  await browser.close();
  // When everything is done, we run the merge function and we return toMail() value as the argument.
  // toMail() will be either true/false, true if you ran the script with 'mail' option and false otherwise. 
  await (async () => mergePdf(toMail()))();
})();

NOTE: I think this setup requires your machine to have Java installed because of easy-pdf-merge, I guess it's core is implemented with java or something.

That would be all, I hope you enjoyed it and you'll find it useful for your next web scraping project ;) There are definitely tools and stuff you can extract from this article.

Farewell

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment