Skip to content

Instantly share code, notes, and snippets.

@alienzj
Forked from lyndametref/Apply.md
Created September 24, 2021 02:57
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save alienzj/9b828d2c8cd587a25d255cf4d68e95ff to your computer and use it in GitHub Desktop.
Save alienzj/9b828d2c8cd587a25d255cf4d68e95ff to your computer and use it in GitHub Desktop.
R Cheat Sheets

R Cheat Sheet : Applying functions

apply(x,index,function)

Applying a function to the rows (index=1) or columns (index=2) of a matrix.

   > mat<-matrix(1:9,3,3)
   > mat
        [,1] [,2] [,3]
   [1,]    1    4    7
   [2,]    2    5    8
   [3,]    3    6    9
   > apply(mat,1,sum)
   [1] 12 15 18
   > apply(mat,2,sum)
   [1]  6 15 24

lapply(x,function)

apply a function to each element of the list x

    > x<-list(1:10)
    > lapply(x,sqrt)
    [[1]]
     [1] 1.000000 1.414214 1.732051 2.000000 2.236068 2.449490 2.645751 2.828427 3.000000 3.162278
    > class(lapply(x,sqrt))
    [1] "list"
    > x
    [[1]]
     [1]  1  2  3  4  5  6  7  8  9 10

sapply(x,function)

apply a function to each element of the list x with simplification of result

    > x<-list(1:10)
    > sapply(x,sqrt)
              [,1]
     [1,] 1.000000
     [2,] 1.414214
     [3,] 1.732051
     [4,] 2.000000
     [5,] 2.236068
     [6,] 2.449490
     [7,] 2.645751
     [8,] 2.828427
     [9,] 3.000000
    [10,] 3.162278
    > class(sapply(x,sqrt))
    [1] "matrix"

tapply(x,y,function)

Apply a function to subsets of a vector X and defined the subset by vector Y.

    > x<-1:10
    > y <-rep(c(T,F),5)
    > tapply(x, y, sum)
    FALSE  TRUE
       30    25
    > tapply(x, y, list)
    $`FALSE`
    [1]  2  4  6  8 10

    $`TRUE`
    [1] 1 3 5 7 9

    > tapply(x, y, max)
    FALSE  TRUE
       10     9
    > class(tapply(x, y, list))
    [1] "list"
    > class(tapply(x, y, max))
    [1] "array"

mapply(function,x,y,...)

Apply a function on multiple objects by elements. mapply(*,x,y) return c(x1*y1,x2*y2,x3*y3,...). By default the result is simplified.

References

R Cheat Sheet: Basics

Functions, conditions and loops

    anExampleFunction <- function(x, ...) {
        aLocalVarable <-x
        if(!is.null(x)) return(x) else message("x is null")
        while(is.null(x)) x=1
        for (i in 0:3) x=seq(1,i)
        ifelse(x%%2==0,TRUE,FALSE)
    }

Other stuffs:

  • break and next do not return a value as they transfer control within the loop.
  • do.call(funname, args) executes a function call from the name of the function and a list of arguments to be passed to it

Datatypes

  • vectors: x=[1:10] (numeric), x=['aaa','bbbb'] (character) only one object type
  • list: Lists have elements, each of which can contain any type of R object
    > mylist<-list(x='a',y=2,z=1:10,n='Hello world')
    > mylist[1]
    $x
    [1] "a"
    > mylist[[1]]
    [1] "a"
    > mylist["z"]
    $z
    [1]  1  2  3  4  5  6  7  8  9 10
    > mylist$n
    [1] "Hello world"
  • matrix
   > matrix(seq(1,8),2,4)
     [,1] [,2] [,3] [,4]
   [1,]    1    3    5    7
   [2,]    2    4    6    8
  • dataframe
   > x<-data.frame(x = 1, y = 1:4, fac = LETTERS[1:4])
   > x
     x y fac
   1 1 1   A
   2 1 2   B
   3 1 3   C
   4 1 4   D
   > class(x$fac)
   [1] "factor"
   > x<-data.frame(x = 1, y = 1:4, fac = LETTERS[1:4],stringsAsFactors = FALSE)
   > class(x$fac)
   [1] "character"

Create Data

  • seq(from,to) generates a sequence
	> seq(1,10,by=2)
	[1] 1 3 5 7 9
	> seq(1,10,length=2)
	[1]  1 10
	> seq(1,10,along=1:4)
	[1]  1  4  7 10
  • rep(x,n) replicate x n times
	> rep(1:3,2)
	[1] 1 2 3 1 2 3
	> rep(1:3,each=2)
	[1] 1 1 2 2 3 3
  • runif random unif distributed, default 0-1
> runif(5)
[1] 0.4490484 0.5588949 0.2798801 0.8900940 0.7158493

Is it...?

is.na(x), is.null(x), is.array(x), is.data.frame(x), is.numeric(x), is.complex(x), is.character(x)

Strings

  • paste(...,sep=" ") concatenate vectors after converting to character;
  • `substr(x,start,stop)``
> substr("Hello World", 7,10)
[1] "Worl"
  • strsplit(x,split) split x according to split
> strsplit("Hello World",split = " ")
[[1]]
[1] "Hello" "World"
  • grep(pattern,x) searches for matches to pattern within x
> grep("[a-e]", letters)
 [1]  1  2  3  4  5
  • gsub(pattern,replacement,x) replacement of matches to pattern
  • sub() same as gsub but only replaces the first occurrence.
  • tolower(x) convert to lowercase
  • toupper(x) convert to uppercase
  • match(x,table) or x %in% table a vector of the positions of first matches for the elements of x among table

Input and output

  • load() load the datasets written with save
  • read.table(file) reads a file in table format and creates a data frame from it
    • default separator sep="" is any whitespace
    • header=TRUE read the first line as a header of column names
    • as.is=TRUE prevent character vectors from being converted to factors
    • skip=n to skip n lines before reading data
  • read.csv("filename",header=TRUE)
  • read.fwf(file,widths) read a table of fixed width formatted data into a ’data.frame’;
    • widths is an integer vector, giving the widths of the fixed-width fields
  • save(file,...) saves the specified objects (...) in the XDR platform- independent binary format

References

R Cheat Sheet: Data wrangling

Packages

library(dplyr)
library(tidyr)

tbl_df(myDataframe) Converts data to tbl class. tbl’s are easier to examine than data frames displays only the data that fits onscreen.

> tbl_df(iris)
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
          <dbl>       <dbl>        <dbl>       <dbl>  <fctr>
1           5.1         3.5          1.4         0.2  setosa
2           4.9         3.0          1.4         0.2  setosa
3           4.7         3.2          1.3         0.2  setosa
4           4.6         3.1          1.5         0.2  setosa
5           5.0         3.6          1.4         0.2  setosa
6           5.4         3.9          1.7         0.4  setosa
7           4.6         3.4          1.4         0.3  setosa
8           5.0         3.4          1.5         0.2  setosa
9           4.4         2.9          1.4         0.2  setosa
10          4.9         3.1          1.5         0.1  setosa
... with 140 more rows

Subsetting

Variables (columns)

select (dplyr) Select columns by name or helper function

> select(iris, Sepal.Width, Petal.Length, Species)
    Sepal.Width Petal.Length    Species
1           3.5          1.4     setosa
2           3.0          1.4     setosa
3           3.2          1.3     setosa
4           3.1          1.5     setosa
5           3.6          1.4     setosa

helper:

  • select(iris, ends_with("Length"))
  • select(iris, starts_with("Sepal"))
  • select(iris, contains(".")) contains character
  • select(iris, matches(".t.")) match Regex
  • select(iris, num_range("x", 1:5))
  • select(iris, Sepal.Length:Petal.Width) range between 2 columns
  • select(iris, -Species) all except specified

Observations (rows)

Slicing

slice (dplyr) selects rows by position.

> slice(iris,1:5)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa

Filtering

filter (dplyr) extracts rows that meet logical criteria on given columns

> filter(iris,Sepal.Length>7.6)
  Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
1          7.7         3.8          6.7         2.2 virginica
2          7.7         2.6          6.9         2.3 virginica
3          7.7         2.8          6.7         2.0 virginica
4          7.9         3.8          6.4         2.0 virginica
5          7.7         3.0          6.1         2.3 virginica

Deduplicate

distinct (dplyr) remove duplicate rows.

> nrow(iris)
[1] 150
> nrow(distinct(iris))
[1] 149

Sampling

  • sample_frac(iris, 0.5, replace = TRUE) Randomly select fraction of rows.
  • sample_n(iris, 10, replace = TRUE) Randomly select n rows.

replace = TRUE Sample with replacement of elements in dataframe for subsequent choice.

Reshaping Data

Gather & Spread columns into row

  • gather (tidyr) Gather columns into rows.
    • convert If TRUE will automatically run type.convert on the key column. This is useful if the column names are actually numeric, integer, or logical.
    • factor_key If FALSE, the default, the key values will be stored as a character vector. If TRUE, will be stored as a factor, which preserves the original ordering of the columns.
  • spread (tidyr) Spread rows into columns.
	> test <- data.frame(Name=c("A","B","C"),M1=c(2.5,3,6),M2=c(5,6,7))
	> test
	  Name  M1 M2
	1    A 2.5  5
	2    B 3.0  6
	3    C 6.0  7
	> gather(test, Param, val, M1, M2)
	Name Param val
	1    A    M1   2.5
	2    B    M1   3.0
	3    C    M1   6.0
	4    A    M2   5.0
	5    B    M2   6.0
	6    C    M2   7.0
	> spread(gather(test,Param, val, M1,M2), Param,val)
	  Name  M1 M2
	1    A 2.5  5
	2    B 3.0  6
	3    C 6.0  7

Split & unitecolumn

  • separate (tidyr) Separate one column into several.
  • unite (tidyr) concatenate strings of several column with a sep
	> test <- data.frame(
	+ id = sprintf("x%01d.%02d", c(rep(1,2),rep(2,2),rep(3,2)),rep(1:2,3)),
	+ val= runif(6))
	> test
	     id       val
	1 x1.01 0.4516309
	2 x1.02 0.1182174
	3 x2.01 0.2386353
	4 x2.02 0.4705228
	5 x3.01 0.3523231
	6 x3.02 0.3385752
	> sep <- separate(test,id, into = c("sample","replicate"))
	  sample replicate       val
	1     x1        01 0.4516309
	2     x1        02 0.1182174
	3     x2        01 0.2386353
	4     x2        02 0.4705228
	5     x3        01 0.3523231
	6     x3        02 0.3385752
	> unite(sep,id,sample,replicate,sep = "-")
	     id       val
	1 x1-01 0.4516309
	2 x1-02 0.1182174
	3 x2-01 0.2386353
	4 x2-02 0.4705228
	5 x3-01 0.3523231
	6 x3-02 0.3385752

Grouping, summarise and mutate

  • mutate create a new column from others
  • transmute like mutate but drop old columns
  • summarise summarise a column with a function
  • summarise_each summarise all columns with a function (note use of funs mandatory)
  • group_by specify by which column data should be groupped
> mutate(iris, Petal.Surf=Petal.Length*Petal.Width)
    Sepal.Length Sepal.Width Petal.Length Petal.Width    Species Petal.Surf
1            5.1         3.5          1.4         0.2     setosa       0.28
2            4.9         3.0          1.4         0.2     setosa       0.28
3            4.7         3.2          1.3         0.2     setosa       0.26
4            4.6         3.1          1.5         0.2     setosa       0.30
...

> transmute(iris, Petal.Surf=Petal.Length*Petal.Width)
    Petal.Surf
1         0.28
2         0.28
3         0.26
4         0.30
5         0.28
6         0.68
...

> summarise(iris,,avg=mean(Sepal.Length))
       avg
1 5.843333
> summarise_each(iris,funs(mean))
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1     5.843333    3.057333        3.758    1.199333      NA
Warning message:
In mean.default(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,  :
  argument is not numeric or logical: returning NA
> iris %>% group_by(Species) %>% summarise(avg=mean(Sepal.Length))
     Species   avg
1     setosa 5.006
2 versicolor 5.936
3  virginica 6.588

Merging

  • merge(a,b,all=, by=) merge two data frames by common columns or row names, if all=TRUE, extra rows will be added to the output, one for each row in x that has no matching row in y and reciprocally
	> authors <- data.frame(
	     surname = I(c("Tukey", "Venables", "Tierney")),
	     deceased = c(T, rep(F, 2)))
	> books <- data.frame(
	     name = I(c("Tukey", "Venables", "Tierney",
	                "Ripley",  "R Core")),
	     title = c("Exploratory Data Analysis",
	               "Modern Applied Statistics ...",
	               "LISP-STAT",
	               "Spatial Statistics",
	               "An Introduction to R"))
	> merge(authors, books, by.x = "surname", by.y = "name", all = TRUE)
	   surname deceased                         title
	1   R Core       NA          An Introduction to R
	2   Ripley       NA            Spatial Statistics
	3  Tierney    FALSE                     LISP-STAT
	4    Tukey     TRUE     Exploratory Data Analysis
	5 Venables    FALSE Modern Applied Statistics ...
	> merge(authors, books, by.x = "surname", by.y = "name", all = FALSE)
	   surname deceased                         title
	1  Tierney    FALSE                     LISP-STAT
	2    Tukey     TRUE     Exploratory Data Analysis
	3 Venables    FALSE Modern Applied Statistics ...

Piping

> x %>% f(y) # f(x, y)
> y %>% f(x, ., z) # f(x, y, z )
> iris %>%
   group_by(Species) %>%
   summarise(avg = mean(Sepal.Width)) %>%
   arrange(avg)

References

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment