Skip to content

Instantly share code, notes, and snippets.

@kmcelwee
Last active July 7, 2021 09:51
Show Gist options
  • Save kmcelwee/309ec6b3eae932530c1ef7b15fb325dc to your computer and use it in GitHub Desktop.
Save kmcelwee/309ec6b3eae932530c1ef7b15fb325dc to your computer and use it in GitHub Desktop.

Web Scraping & Data Analysis Cheat Sheet

Case Study

Definitions

  • Web scraping: Collecting data from the internet in an automated way
  • API: Application Programming Interface. In the context of web scraping, it is a system used by web site owners to monitor and control how data exits their platform.
  • HTTP Methods: (e.g. POST, GET, UPDATE, DELETE) For web scraping we're only interested in what are called "GET requests", a request made to the website's server for information. With that request, you include the type of information you need, and usually an authorization token.
  • Rate limiting: The speed limit placed on programmers that prevents them from making too many requests at once and overworking a site's servers. This varies from site to site.
  • JSON: JavaScript Object Notation. A lightweight data format used throughout the web. If you receive a response through an API, it will almost certainly be in this format.
  • HTML: HyperText Markup Language. This is the most fundamental web file. It is essentially containers and text. It is made of "tags" (e.g <p>text</p> as a container for paragraphs, <img src="apple.png" alt="Apple"> for images) that are nested within each other. Its header contains information that imports metadata, styling, and code.
  • CSS: Cascading Style Sheets. Using special selectors, these file types apply style to a web page. For web scraping, learning CSS isn't too necessary; however, the way that CSS selects parts of HTML is often used by scraping programs.
  • Javascript: The programming language used by your web browser.
  • Wrapper: In web scraping a wrapper is a library of code that translates the API into a language that you're comfortable programming in.
  • Python: The language most often used for writing quick scripts for web scraping and data analysis. If you are using an API, there's a good chance that there's a wrapper in Python. Python has also been embraced by the data science community and has many libraries to support data cleaning and visualization.

Popular APIs

  • Wikipedia
  • Twitter
  • Reddit
  • Spotify
  • more...
    • Note: Facebook has an API, but it does not allow for web scraping. Facebook explicitly disallows scraping of its platform. Tools exist that scrape Facebook in an automated fashion, but using them is legally dubious, so proceed with caution.

Typical Workflow

In general, after collecting your data. You don't want to change that data directly. You'll want to copy it, and pare down that copy to what you really need. The pandas and json library should be useful in parsing and curating your dataset.

To answer more interesting research questions, you can enrich your dataset with fields that you create and manually sort. However, understand that this process always (always) takes longer than you think. Make sure you do a quick back-of-the-napkin calculation to estimate that the time you're prepared to sink in to curating the dataset is worth the research question that your work would answer.

After you create your dataset, apply validation. Ensure that the assumptions that you have about your dataset are accurate. For example, are fields like "ID" unique? Do categorical fields, like days of the week, only have those seven values? Especially when your dataset grows beyond 10k items, you're likely to miss those errors unless validated automatically.

After you finish curating your dataset. Use a Jupyter notebook (or Google Colab) along with matplotlib and/or pandas to explore your research question.

What if there isn't a wrapper?

If there isn't a wrapper in the language that you're comfortable programming in, you'll need to manually make requests in your language. In python, there is the requests library that you can download.

Your code would look something like this:

import requests
import json

r = requests.get('API-URL')
new_dict = json.loads(r.text)

# Then perform analysis with new_dict...

This structure is near universal: request data from a URL, parse it as JSON, and then manipulate the data in your code. And it's fewer than 5 lines! Things are complicated when you need to add more parameters to the request. These are called HTML query strings. These are worth learning about. Also, often site owners want an authorization token to track what data you collect, so you'll need to include that with your request as well.

What if there isn't an API?

If there isn't an API, you'll probably need to scrape the website manually, which puts you in murky waters legally. But assuming you've gotten permission from the site owner, you can write a piece of code that looks something like...

import requests
import json
import pandas as pd

r = requests.get('URL')
# Have pandas collect all tables from the webpage
dfs = pd.read_html(r.text)
# Select the first dataframe
df = dfs[0]

# Then perform analysis with df...
df.to_csv('my-data.csv')

Here I'm using pandas's read_html function to look for tables in the webpage and then export as a CSV. This is useful if I'm looking specifically for tabular data on a webpage. Otherwise, you'll need to probably use BeautifulSoup, a library that parses a webpage, and makes it easy to select information within that webpage.

Sometimes data only appears when the browser is open. To automate control of your browser, you can use the Selenium library. This process will be slower than simply using the requests library.

Popular sources of inspiration

  • Subreddits
    • r/datasets: Search here for your datasets before scraping! Sometimes someone else has already done the work.
    • r/dataisbeautiful: Where data visualization fans show off their work. You'll often see the cutting-edge here.
    • r/dataisugly: Negative inspiration.
  • Data journalists

Easy mistakes to avoid

  • Correlation is not causation.
  • Avoid going fishing for trends where there aren't any, especially if you're exploring your dataset without a hypothesis. Dicing up a dataset too much will inevitably increase your family-wise error rate.
  • Look for context. If you're scraping data, chances are you're not performing a controlled scientific study, meaning that you will inevitably have to narrativize your data. Enriching your research with as much context as possible will prevent sensational results.

(Python-centric) Technologies Worth Learning

  • Pandas: This is the go-to library for manipulating tabular data. It can also be used to create plots
  • Matplotlib: A simple library for creating plots of your data.
  • Jupyter Notebooks: A coding notebook that allows users to combine markdown, code, and plots. You can publish your code as a report, or just use the platform as a simple way to work with your data interactively. If you're not interested in downloading software, Google Colab is another option.
  • Selenium: A library to help you control your browser with python
  • BeautifulSoup: An HTML parsing library

Resources to start learning

And don't get stuck in "tutorial hell"—a place where programmers watch tutorial after tutorial without trying something on their own. Don't worry too much about starting with the "right" tutorial too, it's essentially all the same, the most important thing is getting started!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment