Skip to content

Instantly share code, notes, and snippets.

@jpnadas
Created April 13, 2020 19:29
Show Gist options
  • Save jpnadas/5633055e4565caf3b568f7168bd263df to your computer and use it in GitHub Desktop.
Save jpnadas/5633055e4565caf3b568f7168bd263df to your computer and use it in GitHub Desktop.

Website Scrapping Automation

I have recently moved to the Netherlands and due to the social distancing measures, I have been having some issues with the immigration department, the IND, which is closed. Basically you cannot schedule an appointment with them before the 28 of April, and even after that they cannot promise that the system will be back and running, as no one can predict how the policy regarding social distancing will be then.

Their orientation is to check the website every now and then to see anything has changed, and obviously since this is a tedious (however important) task, I decided to automate it. This is a somewhat common problem that also provides many different learning opportunities, so I decided to share my solution with you. The solution uses Python, bs4, libnotify and systemd timers and the code can be found here.

Summary

The idea is very simple, create a scrapper to fetch data from the page and see if the message has changed. There are multiple ways of doing that, but I decided to go with BeautifulSoup from Python's bs4 module.

with that script in hand, I set a systemd timer and have it running every hour, this way I will not be flooded with notifications and at the same time I will know when the message has changed rather quickly, so I can act and schedule my appointment.

This is their page and the message in question is:

The IND desks can be visited for urgent matters by appointment only up to and including 28 April. You can only make an appointment at an IND desk to collect your first regular residence document. The condition for this is that you have travelled to the Netherlands with a Provisional Residence Permit (mvv). And that you need a residence document for example to apply for health insurance, or to register in the Personal Records Database (BRP) at your town hall (gemeente). Does this apply to you? Make an appointment to visit an IND desk by calling the IND's information line at 088 043 0430 (standard charges apply). From abroad, please call +31 88 043 0430. You cannot make this appointment online.

The scrapping

Ok, enough talking, let's see some code.

  1. We are gonna need to import a few modules:
import requests
from bs4 import BeautifulSoup
import subprocess

requests is used to get the data from the page, BeautifulSoup is used to parse the data and subprocess is used to launch the notify-send command from within Python.

  1. Next, we fetch the website data.
url = 'https://ind.nl/en/Pages/Making-an-appointment-online.aspx'
response = requests.get(url)

if response.status_code == 200:
    process_website(response)
else:
    subprocess.Popen(["/usr/bin/notify-send", "-u", "critical", "IND site offline"])

Here we are testing to see if the website's data and check to see that it returned OK (200). If so, we scrape the page in search for our message.

  1. At this point, I had to make a choice. Should I look for the message anywhere on the page, or should I look for it at the location that it currently resides. I decided to go for the latter route because if the message changed place, there might be something important that had been added to the page.

Here is a sample of the HTML from the page, which contains the message we want:

<strong class="ms-rteBackColor-4">
    The IND desks can be visited for urgent matters by appointment only up to and including&#160;28 April. 
    You&#160;can only make an appointment at an IND desk to collect&#160;your first regular residence document.
    The condition for this is that you have travelled to the Netherlands with a Provisional Residence Permit (mvv).
    And that you need a residence document for example to apply for health insurance, or to register in the Personal Records Database (BRP) at your town hall (gemeente). 
    Does this apply to you? Make an appointment to visit an IND desk by calling the IND's information line at 088 043 0430 (standard charges apply).
    From abroad, please call +31 88 043 0430. You cannot make this appointment
        <strong class="ms-rteBackColor-4"> online. </strong>
 </strong>

Let's have a look at the implementation of process_website.

First we create an instance of BeautifulSoup and find the tag strong.

    soup = BeautifulSoup(response.text, "html.parser")

    st = soup.find_all("strong")

Next, we get the contents of where the message is stored to a variable, just so we can reuse it later to send in the notification, in case it changes.

    notify_str = st[1].contents[0]

Finally, we compare it to our message and notify the result via subprocess.Popen["/usr/bin/notify-send", "-u", "critical", "text"]):

    if "The IND desks can be visited for urgent matters by appointment only up to and including\xa028 April. You\xa0can only make an appointment at an IND desk to collect\xa0your first regular residence document. The condition for this is that you have travelled to the Netherlands with a Provisional Residence Permit (mvv). And that you need a residence document for example to apply for health insurance, or to register in the Personal Records Database (BRP) at your town hall (gemeente). Does this apply to you? Make an appointment to visit an IND desk by calling the IND's information line at 088 043 0430 (standard charges apply). From abroad, please call +31 88 043 0430. You cannot make this appointment" not in notify_str:
        subprocess.Popen(["/usr/bin/notify-send", "-u", "critical", notify_str])
    else:
        subprocess.Popen(["/usr/bin/notify-send", "-u", "critical", "No change to the IND message."])

The complete code of our solution is:

from bs4 import BeautifulSoup
import requests
import subprocess


def process_website(response):
    soup = BeautifulSoup(response.text, "html.parser")

    st = soup.find_all("strong")

    notify_str = st[1].contents[0]

    if "The IND desks can be visited for urgent matters by appointment only up to and including\xa028 April. You\xa0can only make an appointment at an IND desk to collect\xa0your first regular residence document. The condition for this is that you have travelled to the Netherlands with a Provisional Residence Permit (mvv). And that you need a residence document for example to apply for health insurance, or to register in the Personal Records Database (BRP) at your town hall (gemeente). Does this apply to you? Make an appointment to visit an IND desk by calling the IND's information line at 088 043 0430 (standard charges apply). From abroad, please call +31 88 043 0430. You cannot make this appointment" not in notify_str:
        subprocess.Popen(["/usr/bin/notify-send", "-u", "critical", notify_str])
    else:
        subprocess.Popen(["/usr/bin/notify-send", "-u", "critical", "No change to the IND message."])


url = 'https://ind.nl/en/Pages/Making-an-appointment-online.aspx'

response = requests.get(url)

if response.status_code == 200:
    process_website(response)
else:
    subprocess.Popen(["/usr/bin/notify-send", "-u", "critical", "IND site offline"])

Automation

Now, that code is nice, but what we really want is to have it running as often as we desire, so we don't miss any new messages.

For that, I decided to go with systemd timers, because I had never used them and this was a good opportunity for learning it. Also, it has the persistent option, which makes sure it runs even if the system was turned off, so if I had configured it to run once a day and my computer happened to had been off at the time, I would miss it on that day. Perhaps this application doesn't require this level of security, but there are several other applications, such as automating a backup script, that do.

Since I am using notifications, and I want them to only be issued once my graphical interface has loaded, I first created a target that let's me know when xsession is running, and placed it at ~/.config/systemd/user/xsession.target, with the following contents:

[Unit]
Description=Xsession running
BindsTo=graphical-session.target

After that, I added a couple of lines to my bspwmrc to start the target after my window manager had launched:

systemctl --user import-environment PATH DBUS_SESSION_BUS_ADDRESS
systemctl --no-block --user start xsession.target

Next, I created the service which will run the scrapper script, named check_ind.service and placed at the same directory as xsession.target:

[Unit]
Description=Scrape the IND website to see if the message has changed.
PartOf=graphical-session.target
 
[Service]
ExecStart=/usr/bin/python /path/to/script/main.py

Lastly, I created a check_ind.timer (here it is important that the file has the same name as the service), to configure how often I want it to run, and placed it at the same path. I made it persistent and depend on the target that we had created earlier.

[Unit]
Description=Schedule IND scrapper

[Timer]
OnCalendar=hourly
Persistent=true

[Install]
WantedBy=xsession.target

Lastly, I just need to tell systemd to enable it, and we are good to go!

systemctl --user enable --now check_ind.timer

I am sure there are many ways to improve the solution, so please, let me know what you think and how you would tackle the problem!

Sources

To get this working, I based myself mainly in two articles, this by Julia Kho had simple steps to get the scrapping working and this by On Air Netlines, had all that I needed to get systemd timers running.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment