Skip to content

Instantly share code, notes, and snippets.

View briatte's full-sized avatar

François Briatte briatte

View GitHub Profile
@briatte
briatte / porngram.r
Last active August 29, 2015 13:55
ggplot2 wrapper for http://porngram.sexualitics.org/ (uses elements from ngramr)
porngram <- function(x = c("hardcore", "softcore"), ..., adjust = "xxx") {
library(ggplot2)
library(XML)
library(reshape)
library(rPython)
x = c(x, ...)
if (length(x) > 10) {
x <- x[1:10]
warning("Porngram API limit: only using first 10 phrases.")
@briatte
briatte / README.md
Last active August 29, 2015 13:56
aggregation functions, test #1: base, plyr, dplyr

Collapsing a 4-column data frame of real data from 500,000 rows to 91,000 by pasting and counting row values. Execution on a 1.8GHz Intel Core i5 shows that dplyr is 1.5 times quicker than base R.

See this Gist for a simpler test over twice more rows and roughly as many groups. In both tests, dplyr is as concise as plyr, as fast as data.table, and clearly more readable than base R.

@briatte
briatte / README.md
Last active August 29, 2015 13:57
aggregation functions, test #2: base, dplyr, data.table

Here's a simple timing test of aggregation functions in R, using 1.3 million rows and 80,000 groups of real data on a 1.8GHz Intel Core i5. Thanks to Arun Srinivasan for helpful comments.

The fastest function to run through the data.frame benchmark is data.table, which runs twice faster than dplyr, which runs ten times faster than base R.

For a benchmark that includes plyr, see this earlier Gist for a computationally more intensive test on half a million rows, where dplyr still runs 1.5 times faster than aggregate in base R.

Both tests confirm what W. Andrew Barr blogged on dplyr:

the 2 most important improvements in dplyr are >

Improving access to panel series data for social scientists: the psData package

GitHub repository: https://github.com/rOpenGov/psData

Social scientists have access to many electronically available panel series datasets. However, downloading, cleaning, and merging them together is time-consuming and error-prone: for example, using Reinhart and Rogoff's data on the fiscal costs of the financial crisis involves downloading, cleaning, and merging 4 Excel files with over 70 individual sheets, one for each country’s data. Furthermore, because such datasets are not bundled in a format that is easy to manipulate, many of them are not updated on a regular basis.

In this talk, we introduce the psData package for the R statistical software. This package is being developed under the rOpenGov framework to solve two problems:

  1. Time wasted by social scientists downloading, cleaning, and transforming commo
@briatte
briatte / declarations.r
Created July 24, 2014 12:29
download all asset declarations from French MPs, July 2014
# parse XPath syntax from well-formed HTML
library(XML)
# complete archive will take ~ 1.4 GB on disk
dir.create("declarations", showWarnings = FALSE)
# finds 941 MPs on 2014-07-24 at website launch
h = htmlParse("http://www.hatvp.fr/consulter-les-declarations-rechercher.html")
h = paste0("http://www.hatvp.fr/", xpathSApply(h, "//div[@id='annuaire']/*/*/*/a/@href"))
@briatte
briatte / fix.r
Created October 2, 2014 04:49
fix R locale
system("defaults write org.R-project.R force.LANG en_US.UTF-8")
End of fieldwork/ election date CON LAB LIB DEM OTHER CON LEAD OVER LABOUR Sample Fieldwork dates
15-06-1984 37% 38% 23% 2% -1% n/a June, 1984
15-07-1984 34% 39% 26% 1% -5% n/a July, 1984
15-08-1984 36% 39% 24% 1% -3% n/a Aug, 1984
15-09-1984 39% 38% 21% 2% 1% n/a Sep 1984
15-10-1984 38% 36% 24% 2% 2% n/a Oct, 1984
15-11-1984 42% 33% 24% 1% 9% n/a Nov, 1984
15-12-1984 41% 32% 26% 1% 9% n/a Dec, 1984
15-01-1985 41% 33% 25% 1% 8% n/a Jan, 1985
15-02-1985 38% 36% 25% 1% 2% n/a Feb, 1985
@briatte
briatte / seance1.r
Last active August 29, 2015 14:07
A demo of R + ggplot2, using Guardian/ICM polling data.
# A demo of R + ggplot2, using Guardian/ICM polling data.
# 2014-10-02
# Load packages.
pkgs = c("httr", "ggplot2", "lubridate", "RColorBrewer", "reshape2")
pkgs = lapply(pkgs, FUN = function(x) {
if(!require(x, character.only = TRUE)) {
install.packages(x, quiet = TRUE)
Country Year growth ratio
147 Australia 1946 -3.55795148247978 190.419080068143
148 Australia 1947 2.45947456679709 177.321371355335
149 Australia 1948 6.43753409710857 148.929810515079
150 Australia 1949 6.61199384930806 125.828698553949
151 Australia 1950 6.92020124184758 109.809397999623
152 Australia 1951 4.27261154812808 87.0944792259533
153 Australia 1952 0.904651599574291 86.0664381755866
154 Australia 1953 3.11928028540407 79.8650221084478
155 Australia 1954 6.21681382650683 76.8467028869713
@briatte
briatte / seance2.r
Last active August 29, 2015 14:07
An overview of the Reinhart & Rogoff data, from an exercise by Cosma Shalizi.
# An overview of the Reinhart & Rogoff data, from an exercise by Cosma Shalizi.
# 2014-10-09
#
# package
#
library(ggplot2)
#