Skip to content

Instantly share code, notes, and snippets.

@aammd
Last active March 30, 2020 06:03
Show Gist options
  • Save aammd/11386424 to your computer and use it in GitHub Desktop.
Save aammd/11386424 to your computer and use it in GitHub Desktop.

Simple data manipulations in R

Many years ago, I was introduced to R by Cam Webb . At the time, his website contained a list of common data manipulations (original here). This list dated from Cam's early experience with R, and contained the R-help mailing list responses to a series of data manipulations. For a long time, I kept this file as a handy reference. I printed it out. I recommended it to friends.

Now I have been using R for years, and the state of the art has advanced considerably. Particulary, Hadley Wickham's reshape2 and dplyr packages have transformed the way most useRs manipulate their data. I decided that it would be interesting to revisit my favourite resource and try my hand at solving these problems with tools from these two packages.

library(reshape2)
library(dplyr)

GROUP

Turn this table (A):

c1 c2 c3
A a 1
A a 3
A a 1
A b 1
A b 2
B c 2
B d 1

into this (B):

Group1 Group2 Nrows SumOfCol3
A a 3 5
A b 2 3
B c 1 2
B d 1 1
A <- data.frame(
       c1 = c('A', 'A', 'A', 'A', 'A', 'B', 'B'),
       c2 = c('a', 'a', 'a', 'b', 'b', 'c', 'd'),
       c3 = c(1, 3, 1, 1, 2, 2, 1))

B <- A %.%
  group_by(c1,c2) %.%
  summarize(Nrows=n(),
            SumOfCol3=sum(c3))
kable(B)
c1 c2 Nrows SumOfCol3
A a 3 5
A b 2 3
B c 1 2
B d 1 1

SPLIT

Turn col3 of the second table (B) into this (C):

row a b c d
A 3 2 . .
B . . 1 1
C <- dcast(B,c1~c2,value.var="Nrows")
kable(C)
c1 a b c d
A 3 2 NA NA
B NA NA 1 1

Many original responders suggested the use of table, referring to the original dataset:

C_alt<-with(A,table(c1,c2))
kable(C_alt)
id a b c d
A 3 2 0 0
B 0 0 1 1

although that solution is not "tidy" in the Hadlian sense -- i.e., it does not return a data.frame, but rather a table object. You can obtain a data.frame with dcast directly:

C_alt2<-dcast(A,c1~c2,value.var="c3",fun.aggregate=length)
kable(C_alt2)
c1 a b c d
A 3 2 0 0
B 0 0 1 1

STACK

Turn the above table (C) into this (D):

c1 V1 V2
A a 3
A b 2
A c .
A d .
B a .
B b .
B c 1
B d 1
D <- melt(C,id="c1") %.%
  arrange(c1)
kable(D)
c1 variable value
A a 3
A b 2
A c NA
A d NA
B a NA
B b NA
B c 1
B d 1

JOIN

Join these tables (E, F):

c1 c2
A 1
B 2
C 3
c1 c3
A a
B a
B a
B b
C c
A b

to give (G):

c1 c3 c2
A a 1
B a 2
B a 2
B b 2
C c 3
A b 1
E<-data.frame(c1=c("A","B","C"), c2=1:3)
FF <- data.frame(c1=c("A","B","B","B","C","A"), c3=c("a","a","a","b","c","b"))
G <- left_join(FF, E)
## Joining by: "c1"
kable(G)
c1 c3 c2
A a 1
B a 2
B a 2
B b 2
C c 3
A b 1

the dplyr package supplies left_join(), which preserves the sequence of rows in its left argument. Alternative, as was originally suggested, one could use merge() :

G_merge <- merge(FF,E)
kable(G_merge)
c1 c3 c2
A a 1
A b 1
B a 2
B a 2
B b 2
C c 3

Although columns now come out sorted.

SUBSET

subset Table G to give H:

c1 c3 c2
A a 1
A b 1
H <- filter(G,c1=="A")
kable(H)
c1 c3 c2
A a 1
A b 1

TRANSPOSE

transpose H to give:

V1 V2
A A
a b
1 1
H_transpose <- data.frame(t(H))
kable(H_transpose)
id X1 X2
c1 A A
c3 a b
c2 1 1

SORT

In the original, the question suggested "up to three keys".

A_arranged <- arrange(A,c1,c2,c3)
kable(A_arranged)
c1 c2 c3
A a 1
A a 1
A a 3
A b 1
A b 2
B c 2
B d 1

Conclusion

To my surprise, each of these was actually a single line. The only exception was the first (GROUP), and that was because there are really two separate steps here -- the first to actually group the data, the second to apply summary functions to the data. dplyr automates both tasks, and supplies great readability.

@adletaw
Copy link

adletaw commented May 15, 2014

@thiagotoyoyo, try labelling your sequences and then using ddply. e.g. with your "dat" frame:

dat$group <- rep(LETTERS[1:10], each = 10)
ddply(dat, .(fac, group), summarize, mean_var1 = mean(var1))
fac group mean_var1
a A 10.573626
a B 9.412206
a C 12.275305
a D 9.985959
a E 10.054419
a F 10.006143
a G 9.696368
a H 9.191245
a I 10.256897
a J 9.242131
b A 10.804564
b B 8.714214
b C 10.790161
b D 11.390734
b E 10.186804
b F 8.906547
b G 10.446726
b H 11.299901
b I 9.752915
b J 10.746946

@aammd
Copy link
Author

aammd commented May 16, 2014

WOW I did not realize that there were so many comments here! oops, sorry you guys.

@thiagotoyoyo In your case I would use cut from base R to convert the numeric vector into a factor:

dat %.%
  mutate(grp=cut(seq,10,labels=LETTERS[1:10])) %.%
  group_by(grp,fac) %.%
  summarize(mean.var1=mean(var1),
            mean.var2=mean(var1),
            )

@aammd
Copy link
Author

aammd commented May 16, 2014

@adletaw here is my take on your problem. It takes advantage of ldply's handy ability to turn list names into a column:

fooo <- replicate(3,A1,simplify=FALSE)
names(fooo) <- c("x","y","z")
ldply(fooo,.id="b1")

Or an even simpler way:

B1 %>%
 group_by(b1) %>%
 do(A1)

@daattali
Copy link

This is pretty sweet
Do you mind sharing the R/Rmd source code so that I can reproduce this and maybe add tidyr to it? I'll keep your name on it

@aammd
Copy link
Author

aammd commented Oct 31, 2014

@daattali to my dismay, I don't think I actually saved the source :( I might have even written this directly in markdown (in my naiveté). A version that includes tidyr is VERY NECESSARY, however! that's an awesome idea. In fact this thing is so old I think it actually predates tidyr, or at least my use of it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment