asowder3943/webscraping-1.md

## webscraping-1.md

      
    Raw
  

              webscraping-1.md
            
          
    Web Scraping on Linux with xdotool and Python Original Creation: (2017)

Read Time: 20 minutes

Introduction

In this post, I am going to demonstrate how to do some simple web scraping with xdotool and the pyperclip Python module.
Web scraping is an important skill to have if you want to work with public data on the internet. Sometimes you may need information from a website on a regular basis, but there is no neat api to send requests to. In these cases web scraping may be necessary. Web scraping on static html pages is easy enough: simply send some requests and parse the results that are returned. Things get a little trickier though when you are dealing with a site that has authentication layers, or dynamically loaded content. What if you could get your data the same way a user with a keyboard and mouse would?
Disclaimer

Importantly before I continue, I feel the need to say: Please don’t be a Jerk when web scraping. Be mindful of the rate at which you are making requests to individual sites, especially small ones that may not be on well equipped servers. Also, be ethical when deciding what data to collect and how. If it feels wrong it probably is, so listen to yourself at least if you don’t listen to me….. Speech Over.
So what is xdotool?

xdotool is used to simulate keyboard and mouse activity on a Linux machine. The tool is  similar to autohotkey on windows. The project is open sourced and maintained by Jordan Sissler. You can view the project on github here. https://github.com/jordansissel/xdotool
Why xdotool
I found xdotool after a botched windows update forced me to re image my laptop. I was using Autohotkey for a number of tasks on a daily basis and finding a suitable alternative was the last thing stopping me from finally removing windows from all of my machines. You can view the Stack Exchange Answer that opened my eyes here. https://unix.stackexchange.com/questions/165124/autohotkey-equivalent
Setting up xdotool

Getting started with xdotool is simple enough, it can be installed via the major package managers or compiled from source. In my case I used the following to set it up on my machine running Manjaro Gnome.
sudo pacman -S xdotool

After the package was finished installing I verified my version information and ran the examples provided here https://www.linux.org/threads/xdotool-examples.10705/ to test that the tool was working correctly.
To illustrate some simple web scraping using xdotool we’re going to create a simple BASH script that opens a new tab in Firefox, navigates to a page, and then downloads the site assets to the default download location.
For this small script to work a few conditions must be met:
Firefox must be open and the windows must be active when the scrape is performed
The url we want to scrape must be in the clipboard so that we can paste it into the browser bar
Firefox should be configured to ask where to save newly downloaded files
xdo_scraper.sh
#!/bin/bash

# function for simple keypress with short wait time and output message
k (){
    echo "$1"
    xdotool key "$2"
    sleep 1
}

# firefox must be open and active, url must be in the clipboard, and firefox should be configured to ask where to download new files
downloadContentAtClipboardURL (){
    k "Creating a new tab in firefox" "ctrl+t"
    k "Pasting url to scrape" "ctrl+v"
    k "Navigating  to url" "KP_Enter"
    k "Saving html at url to default download location" "ctrl+s"
    k "Choosing file system location for download" "KP_Enter"
    k "Replacing and existing files with same name" "KP_Enter"
    k "Closing the tab use to navigate to url" "ctrl+w"
}

echo “You have 3 seconds to make firefox the active window before we scrape with xdotool”
sleep 3
downloadContentAtClipboardURL

We could insert individual keystrokes to type out the full url instead of using the clipboard, but this technique will work nicely if we want to kick off this script from a larger Python application where we can manipulate the clipboard at will.
Kicking off our scraper with pyperclip and subprocess

In order to manipulate the clipboard and kick off our bash script from python we are going to use the pyperclip and subprocess modules.
scrape.py
from pyperclip import copy
from subprocess import call

url_to_scrape = “https://en.wikipedia.org/wiki/Super_Mario_World”

copy(url_to_scrape)
Remote_call = call(xdo_scraper.sh)

This python script should call the xdo_scraper.sh bash script. There is a short delay allowing the user to make the firefox window active. After the script is finished running the html from the wikipedia page for Super Mario World will be downloaded at the default download location.
Next Steps

Now that the html of the desired site is downloaded we can write some code to parse the important information from the page. But before we move on to parsing the HTML there are some improvements to be made with the scraping script.
For one, the script should automatically open and use firefox if it is not already open. Running keyboard and mouse automation on a system with the wrong window open can be catastrophic or at least very annoying. Also We should have some mechanism for checking if the file was completely downloaded before exiting the python script.