Last active
December 25, 2020 08:26
-
-
Save noamross/4558068 to your computer and use it in GitHub Desktop.
Some examples of using the plyr package for data manipulation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
## Some examples using the package plyr | |
library(plyr) | |
## Example dataset from ggplot | |
library(ggplot2) | |
data(mpg) | |
str(mpg) | |
## Simplify the dataset | |
data <- mpg[,c(1,7:9)] | |
str(data) | |
## Summarising/ Aggregating Data | |
ddply(data, .(manufacturer), summarize, avgcty = mean(cty)) | |
## you can perform multiple functions in a single call | |
ddply(data, .(manufacturer), summarize, avgcty = mean(cty), sdcty = sd(cty), maxhwy = max(hwy)) | |
## you can summarize data by a combination of variables/factors | |
ddply(data, .(manufacturer, drv), summarize, avgcty = mean(cty), sdcty = sd(cty), maxhwy = max(hwy)) | |
## note the package reshape/reshape2 is an elegant alternative for aggregating many variables at one time | |
## note the differences between the commands "summarize" and "transform" | |
ddply(data, .(drv), summarize, avgcty = mean(cty)) | |
ddply(data, .(drv), transform, avgcty = mean(cty)) | |
## transform is very useful standardizing/normalizing | |
ddply(data, .(drv), transform, delta = mean(cty)-cty) | |
## Now let's use plyr to run a simple loop | |
## We'll ask the question: Does city mpg differ between car manufacturers, for each class of drivetrains (4x4, forward, or rear-wheel drive)? Let's try to automate these ANOVAs and extract the F-statistics and P-values from the ANOVAs. | |
## Step1: create function to run ANOVA | |
model <- function(data) { aov(cty~manufacturer, data=data) } | |
## Step 2: Use plyr to run model for each and create list (called anova.output) to store output for each drivetrain. For dlply, the syntax means d for input data is data frame and l for output data is list. | |
anova.output <- dlply(data, .(drv), model) | |
## Step 3: Create function that tells R where to find F-statistic and P-value in the output within the list. The output is somewhat hidden in this example- don't worry about the messy indexing here-- what's important is that this just tells R where the F-stats and P-values are stored. | |
juicy <- function(x) { c(summary(x)[[1]][["F value"]][[1]], | |
summary(x)[[1]][["Pr(>F)"]][[1]]) } | |
## Step 4: Extract components of model output from the list created in previous step. For ldply, the syntax is: input is list and output is data frame. Note that since the input is a list, we don't have to indicate the 2nd parameter (which variable(s) to apply the function to, as the default is to apply function to all elements of the list.) | |
ldply(anova.output, juicy) | |
## The data frame shows F-statistics (V1) and P-values (V2) for the ANOVAs by drivetrain. | |
## We could always condense some of the above steps as well: | |
anova.output <- dlply(data, .(drv), function(data) aov(cty~manufacturer, data=data)) | |
ldply(anova.output, function(x) { c(summary(x)[[1]][["F value"]][[1]], summary(x)[[1]][["Pr(>F)"]][[1]]) }) | |
## Note that there are many shortcuts that plyr uses, such as the functions colwis(), each() and splat(). You can always refer to the original article: http://www.jstatsoft.org/v40/i01/ for more on this. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Thanks, very helpful