François Briatte briatte

## porngram.r
porngram <- function(x = c("hardcore", "softcore"), ..., adjust = "xxx") {
  library(ggplot2)
  library(XML)
  library(reshape)
  library(rPython)

  x = c(x, ...)
  if (length(x) > 10) {
    x <- x[1:10]
    warning("Porngram API limit: only using first 10 phrases.")

## README.md

      
              2 files
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                briatte
                / README.md
            
            
              Last active
              August 29, 2015 13:56
            
              
                aggregation functions, test #1: base, plyr, dplyr
              
          
    Collapsing a 4-column data frame of real data from 500,000 rows to 91,000 by pasting and counting row values. Execution on a 1.8GHz Intel Core i5 shows that dplyr is 1.5 times quicker than base R.
See this Gist for a simpler test over twice more rows and roughly as many groups. In both tests, dplyr is as concise as plyr, as fast as data.table, and clearly more readable than base R.

  
## README.md

      
              3 files
            
          
              0 forks
            
          
              7 comments
            
          
              0 stars
            
          
                briatte
                / README.md
            
            
              Last active
              August 29, 2015 13:57
            
              
                aggregation functions, test #2: base, dplyr, data.table
              
          
    Here's a simple timing test of aggregation functions in R, using 1.3 million rows and 80,000 groups of real data on a 1.8GHz Intel Core i5. Thanks to Arun Srinivasan for helpful comments.
The fastest function to run through the data.frame benchmark is data.table, which runs twice faster than dplyr, which runs ten times faster than base R.
For a benchmark that includes plyr, see this earlier Gist for a computationally more intensive test on half a million rows, where dplyr still runs 1.5 times faster than aggregate in base R.
Both tests confirm what W. Andrew Barr blogged on dplyr:

the 2 most important improvements in dplyr are
>


## abstract.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                briatte
                / abstract.md
            
            
              Last active
              August 29, 2015 14:01
            
          
    Improving access to panel series data for social scientists: the psData package


GitHub repository: https://github.com/rOpenGov/psData

Social scientists have access to many electronically available panel series datasets. However, downloading, cleaning, and merging them together is time-consuming and error-prone: for example, using Reinhart and Rogoff's data on the fiscal costs of the financial crisis involves downloading, cleaning, and merging 4 Excel files with over 70 individual sheets, one for each country’s data. Furthermore, because such datasets are not bundled in a format that is easy to manipulate, many of them are not updated on a regular basis.
In this talk, we introduce the psData package for the R statistical software. This package is being developed under the rOpenGov framework to solve two problems:

Time wasted by social scientists downloading, cleaning, and transforming commo


## declarations.r
# parse XPath syntax from well-formed HTML
library(XML)

# complete archive will take ~ 1.4 GB on disk
dir.create("declarations", showWarnings = FALSE)

# finds 941 MPs on 2014-07-24 at website launch
h = htmlParse("http://www.hatvp.fr/consulter-les-declarations-rechercher.html")
h = paste0("http://www.hatvp.fr/", xpathSApply(h, "//div[@id='annuaire']/*/*/*/a/@href"))

## fix.r
system("defaults write org.R-project.R force.LANG en_US.UTF-8")

## icm.polls.8413.csv

          
            End of fieldwork/ election date
            CON
            LAB
            LIB DEM
            OTHER
            CON LEAD OVER LABOUR
            Sample
            Fieldwork dates

            
              15-06-1984
              37%
              38%
              23%
              2%
              -1%
              n/a
              June, 1984

            
              15-07-1984
              34%
              39%
              26%
              1%
              -5%
              n/a
              July, 1984

            
              15-08-1984
              36%
              39%
              24%
              1%
              -3%
              n/a
              Aug, 1984

            
              15-09-1984
              39%
              38%
              21%
              2%
              1%
              n/a
              Sep 1984

            
              15-10-1984
              38%
              36%
              24%
              2%
              2%
              n/a
              Oct, 1984

            
              15-11-1984
              42%
              33%
              24%
              1%
              9%
              n/a
              Nov, 1984

            
              15-12-1984
              41%
              32%
              26%
              1%
              9%
              n/a
              Dec, 1984

            
              15-01-1985
              41%
              33%
              25%
              1%
              8%
              n/a
              Jan, 1985

            
              15-02-1985
              38%
              36%
              25%
              1%
              2%
              n/a
              Feb, 1985

## seance1.r
# A demo of R + ggplot2, using Guardian/ICM polling data.
# 2014-10-02

# Load packages.
pkgs = c("httr", "ggplot2", "lubridate", "RColorBrewer", "reshape2")
pkgs = lapply(pkgs, FUN = function(x) {

  if(!require(x, character.only = TRUE)) {

    install.packages(x, quiet = TRUE)

## debt.csv

          
            Country
            Year
            growth
            ratio

            
              147
              Australia
              1946
              -3.55795148247978
              190.419080068143

            
              148
              Australia
              1947
              2.45947456679709
              177.321371355335

            
              149
              Australia
              1948
              6.43753409710857
              148.929810515079

            
              150
              Australia
              1949
              6.61199384930806
              125.828698553949

            
              151
              Australia
              1950
              6.92020124184758
              109.809397999623

            
              152
              Australia
              1951
              4.27261154812808
              87.0944792259533

            
              153
              Australia
              1952
              0.904651599574291
              86.0664381755866

            
              154
              Australia
              1953
              3.11928028540407
              79.8650221084478

            
              155
              Australia
              1954
              6.21681382650683
              76.8467028869713

## seance2.r
# An overview of the Reinhart & Rogoff data, from an exercise by Cosma Shalizi.
# 2014-10-09

#
# package
#

library(ggplot2)

#
	porngram <- function(x = c("hardcore", "softcore"), ..., adjust = "xxx") {
	library(ggplot2)
	library(XML)
	library(reshape)
	library(rPython)

	x = c(x, ...)
	if (length(x) > 10) {
	x <- x[1:10]
	warning("Porngram API limit: only using first 10 phrases.")
	# parse XPath syntax from well-formed HTML
	library(XML)

	# complete archive will take ~ 1.4 GB on disk
	dir.create("declarations", showWarnings = FALSE)

	# finds 941 MPs on 2014-07-24 at website launch
	h = htmlParse("http://www.hatvp.fr/consulter-les-declarations-rechercher.html")
	h = paste0("http://www.hatvp.fr/", xpathSApply(h, "//div[@id='annuaire']///*/a/@href"))
End of fieldwork/ election date	CON	LAB	LIB DEM	OTHER	CON LEAD OVER LABOUR	Sample	Fieldwork dates
15-06-1984	37%	38%	23%	2%	-1%	n/a	June, 1984
15-07-1984	34%	39%	26%	1%	-5%	n/a	July, 1984
15-08-1984	36%	39%	24%	1%	-3%	n/a	Aug, 1984
15-09-1984	39%	38%	21%	2%	1%	n/a	Sep 1984
15-10-1984	38%	36%	24%	2%	2%	n/a	Oct, 1984
15-11-1984	42%	33%	24%	1%	9%	n/a	Nov, 1984
15-12-1984	41%	32%	26%	1%	9%	n/a	Dec, 1984
15-01-1985	41%	33%	25%	1%	8%	n/a	Jan, 1985
15-02-1985	38%	36%	25%	1%	2%	n/a	Feb, 1985
	# A demo of R + ggplot2, using Guardian/ICM polling data.
	# 2014-10-02

	# Load packages.
	pkgs = c("httr", "ggplot2", "lubridate", "RColorBrewer", "reshape2")
	pkgs = lapply(pkgs, FUN = function(x) {

	if(!require(x, character.only = TRUE)) {

	install.packages(x, quiet = TRUE)
	Country	Year	growth	ratio
147	Australia	1946	-3.55795148247978	190.419080068143
148	Australia	1947	2.45947456679709	177.321371355335
149	Australia	1948	6.43753409710857	148.929810515079
150	Australia	1949	6.61199384930806	125.828698553949
151	Australia	1950	6.92020124184758	109.809397999623
152	Australia	1951	4.27261154812808	87.0944792259533
153	Australia	1952	0.904651599574291	86.0664381755866
154	Australia	1953	3.11928028540407	79.8650221084478
155	Australia	1954	6.21681382650683	76.8467028869713
	# An overview of the Reinhart & Rogoff data, from an exercise by Cosma Shalizi.
	# 2014-10-09

	#
	# package
	#

	library(ggplot2)

	#