Skip to content

Instantly share code, notes, and snippets.

@jhofman
jhofman / icwsm.py
Created July 14, 2011 16:13
script to scrape pdfs and paper info for icwsm2011
#!/usr/bin/env python
from lxml import etree
from urllib import urlopen
if __name__=='__main__':
url = 'http://www.aaai.org/ocs/index.php/ICWSM/ICWSM11/schedConf/presentations'
tree = etree.parse(urlopen(url), etree.HTMLParser())
@jhofman
jhofman / scrapple.py
Last active December 24, 2015 08:49
checks apple.come for iphone 5s in-store pickup availability
#!/usr/bin/env python
#
# file: scrapple.py
#
# description: checks apple.come for iphone 5s in-store pickup availability
#
# usage: ./scrapple.py [zip] [att|verizon|sprint] [16|32|64] [grey|silver|gold]
#
# or in a crontab:
# */5 * * * * /path/to/scrapple.py 10012 verizon 32 grey && mailx -s 5s 2125551212@vtext.com
@jhofman
jhofman / bookmark_starred.py
Created January 16, 2014 17:05
mirrors starred github repositories to delicious
#!/usr/bin/env python
#
# file: bookmark_starred.py
#
# description: mirrors starred github repos to delicious
#
# usage: bookmark_starred.py GITHUB_USER DELICIOUS_USER
#
# requirements: requests
@jhofman
jhofman / track_changes.sh
Last active January 28, 2016 20:23
word-style "track changes" with latexdiff and git
#!/bin/bash
#
# script to show word-style "track changes" from a previous git revision
#
if [ $# -lt 2 ]
then
echo "usage: $0 <rev> <maintex>"
echo " rev is the prefix of a git revision hash"
echo " see 'git log' for revision hashes"
@jhofman
jhofman / scrape_income_dist.sh
Last active August 29, 2015 14:22
You Draw It
#!/bin/bash
#
# Scrape income distribution data from whatsmypercent.com
#
# Output is in incomes.csv (percentile,income)
#
# start at $100 / year
income=100
@jhofman
jhofman / dplyr_filter_ungroup.R
Created January 20, 2016 16:45
careful when filtering with many groups in dplyr
library(dplyr)
# create a dummy dataframe with 100,000 groups and 1,000,000 rows
# and partition by group_id
df <- data.frame(group_id=sample(1:1e5, 1e6, replace=T),
val=sample(1:100, 1e6, replace=T)) %>%
group_by(group_id)
# filter rows with a value of 1 naively
system.time(df %>% filter(val == 1))
@jhofman
jhofman / Makefile
Last active July 31, 2016 00:54
scrape nyc neighborhood populations from pediacities
all: pediacities_nyc_neighborhood_populations.csv
pediacities_nyc_neighborhood_populations.csv: pediacities_nyc_neighborhoods.json extract_neighborhood_populations.sh
extract_neighborhood_populations.sh
pediacities_nyc_neighborhoods.json: download_neighborhood_pages.sh
download_neighborhood_pages.sh
@jhofman
jhofman / changing_ggplot_legends.R
Created June 29, 2017 19:07
Which method do you prefer?
library(tidyverse)
library(forcats)
# The original plot
## This has an ugly legend title, maybe we should remove it and modify the labels
ggplot(mtcars, aes(x = mpg, y = disp, col = as.factor(cyl))) +
geom_point()
# Approach 1: Modify the plot
@jhofman
jhofman / scrapple_x.py
Last active November 12, 2017 17:55
checks apple.come for iphone x in-store pickup availability
#!/usr/bin/env python
#
# file: scrapple_x.py
#
# description: checks apple.come for iphone x in-store pickup availability
#
# usage: ./scrapple.py [zip] [att|verizon|sprint|tmobile] [64|256] [grey|silver]
#
# or in a crontab:
# */5 * * * * /path/to/scrapple.py 12345 tmobile 64 grey && mailx -s "iphone x" 2125551212@vtext.com
@jhofman
jhofman / filter_by_group_id.R
Last active July 31, 2018 13:37
a more efficient way to filter a grouped data frame?
library(tidyverse)
library(digest)
# create a dummy dataframe with 100,000 groups and 1,000,000 rows
# where group ids are md5 hash of integers from 1 to 100,000
set.seed(42)
md5 <- Vectorize(function(x) digest(x, algo="md5"))
df <- data.frame(group_id=sample(md5(1:1e4), 1e6, replace=T),
val=sample(1:100, 1e6, replace=T))