Skip to content

Instantly share code, notes, and snippets.

View yrochat's full-sized avatar
💭
Doing digital humanities

Yannick Rochat yrochat

💭
Doing digital humanities
View GitHub Profile
# Load the packages we’re going to be using:
# Alongside the usual stuff like tidyverse and magrittr, we’ll be using rvest for some web-scraping, jsonline to parse some JSON, and extrafont to load some nice custom fonts
needs(tidyverse, magrittr, rvest, jsonlite, extrafont)
# Before we go on, two things to note:
# First, on web scraping:
# You should always check the terms of the site you are extracting data from, to make sure scraping (often referred to as `crawling`) is not prohibited. One way to do this is to visit the website’s `robots.txt` page, and ensure that a) there is nothing explicitly stating that crawlers are not permitted, and b) ideally, the site simply states that all user agents are permitted (indicated by a line saying `User-Agect: *`). Both of those are the case for our use-case today (see https://www.ultimatetennisstatistics.com/robots.txt).
# And second, about those custom fonts:
# The packages we'll be using
packages <- c("rvest","dplyr","tidyr","pipeR","ggplot2","stringr","data.table")
# From those packages, which ones are not yet installed?
newPackages <- packages[!(packages %in% as.character(installed.packages()[,"Package"]))]
# If any weren't already installed, install them now
if(length(newPackages)) install.packages(newPackages)
# Now make sure all necessary packages are loaded
@dannguyen
dannguyen / faa-333-pdf-gathering.md
Last active June 19, 2021 13:18
Using wget + grep to explore inconveniently organized federal data (FAA Section 333 Exemptions)

if !database: wget + grep

The Federal Aviation Administration is posting PDFs of the Section 333 exemptions that it grants, i.e. the exemptions for operators who want to fly drones commercially before the FAA finishes its rulemaking. A journalist wanted to look for exemptions granted to operators in a given U.S. state. But the FAA doesn't appear to have an easy-to-read data file to use and doesn't otherwise list exemptions by location of operator.

However, since their exemptions page is just one giant HTML table for listing the PDFs, we can just use wget to fetch all the PDFs, run pdftotext on each file, and then [grep](https://medium.com/@rualthanzauva/grep-was-a-private-command-of-m

@primaryobjects
primaryobjects / saveChart.R
Last active October 9, 2021 02:21
Add text outside the chart area of a ggplot2 graph in R and save the resulting chart to a png file.
require(ggplot2)
require(gridExtra)
saveChart <- function(chart, fileName) {
# Draw attribution.
chart <- chart + geom_text(aes(label = 'sentimentview.com', x = 2.5, y = 0), hjust = -2, vjust = 6, color="#a0a0a0", size=3.5)
# Disable clip-area.
gt <- ggplot_gtable(ggplot_build(chart))
gt$layout$clip[gt$layout$name == "panel"] <- "off"
@briatte
briatte / charlie-data.r
Last active August 29, 2015 14:16
scrape front covers from Charlie Hebdo – source: http://stripsjournal.canalblog.com/
#
# download 338 Charlie Hebdo covers with keywords
#
library(dplyr)
library(XML)
library(lubridate)
library(stringr)
library(ggplot2)
import networkx as nx
from lxml import etree
import re
import itertools
def getNamesInAction(action,textNames,nameDict):
# go through the names, in order of length, get them from the action, then remove them before looping
act = action
sortNames = sorted(textNames, key=len, reverse=True)
returnNames = []
@milesgrimshaw
milesgrimshaw / Kickstarter_Geocoding.R
Last active August 29, 2015 13:57
Data prep for geocoding
# Load desired packages
library(lubridate)
library(stringr)
library(ggplot2)
library(scales)
# Set the working directory
getwd()
setwd("~/Desktop/Patreon/")