Skip to content

Instantly share code, notes, and snippets.

@noamross
Last active December 25, 2020 08:26
Show Gist options
  • Star 3 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save noamross/4558068 to your computer and use it in GitHub Desktop.
Save noamross/4558068 to your computer and use it in GitHub Desktop.
Some examples of using the plyr package for data manipulation
## Some examples using the package plyr
library(plyr)
## Example dataset from ggplot
library(ggplot2)
data(mpg)
str(mpg)
## Simplify the dataset
data <- mpg[,c(1,7:9)]
str(data)
## Summarising/ Aggregating Data
ddply(data, .(manufacturer), summarize, avgcty = mean(cty))
## you can perform multiple functions in a single call
ddply(data, .(manufacturer), summarize, avgcty = mean(cty), sdcty = sd(cty), maxhwy = max(hwy))
## you can summarize data by a combination of variables/factors
ddply(data, .(manufacturer, drv), summarize, avgcty = mean(cty), sdcty = sd(cty), maxhwy = max(hwy))
## note the package reshape/reshape2 is an elegant alternative for aggregating many variables at one time
## note the differences between the commands "summarize" and "transform"
ddply(data, .(drv), summarize, avgcty = mean(cty))
ddply(data, .(drv), transform, avgcty = mean(cty))
## transform is very useful standardizing/normalizing
ddply(data, .(drv), transform, delta = mean(cty)-cty)
## Now let's use plyr to run a simple loop
## We'll ask the question: Does city mpg differ between car manufacturers, for each class of drivetrains (4x4, forward, or rear-wheel drive)? Let's try to automate these ANOVAs and extract the F-statistics and P-values from the ANOVAs.
## Step1: create function to run ANOVA
model <- function(data) { aov(cty~manufacturer, data=data) }
## Step 2: Use plyr to run model for each and create list (called anova.output) to store output for each drivetrain. For dlply, the syntax means d for input data is data frame and l for output data is list.
anova.output <- dlply(data, .(drv), model)
## Step 3: Create function that tells R where to find F-statistic and P-value in the output within the list. The output is somewhat hidden in this example- don't worry about the messy indexing here-- what's important is that this just tells R where the F-stats and P-values are stored.
juicy <- function(x) { c(summary(x)[[1]][["F value"]][[1]],
summary(x)[[1]][["Pr(>F)"]][[1]]) }
## Step 4: Extract components of model output from the list created in previous step. For ldply, the syntax is: input is list and output is data frame. Note that since the input is a list, we don't have to indicate the 2nd parameter (which variable(s) to apply the function to, as the default is to apply function to all elements of the list.)
ldply(anova.output, juicy)
## The data frame shows F-statistics (V1) and P-values (V2) for the ANOVAs by drivetrain.
## We could always condense some of the above steps as well:
anova.output <- dlply(data, .(drv), function(data) aov(cty~manufacturer, data=data))
ldply(anova.output, function(x) { c(summary(x)[[1]][["F value"]][[1]], summary(x)[[1]][["Pr(>F)"]][[1]]) })
## Note that there are many shortcuts that plyr uses, such as the functions colwis(), each() and splat(). You can always refer to the original article: http://www.jstatsoft.org/v40/i01/ for more on this.
@famingw
Copy link

famingw commented Oct 24, 2017

Thanks, very helpful

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment