Skip to content

Instantly share code, notes, and snippets.

@mitchwongho
Last active August 29, 2015 14:12
Show Gist options
  • Save mitchwongho/f051ae38d04257a6254c to your computer and use it in GitHub Desktop.
Save mitchwongho/f051ae38d04257a6254c to your computer and use it in GitHub Desktop.
R Programming

R Programming

Google's R Style Guide https://google-styleguide.googlecode.com/svn/trunk/Rguide.xml

# This is a comment

# Get current working directory
> getwd()

# Set working directory
> setwd("/User/Foo/example")

# Directory doesn't exist?
> dir.create("/User/Foo/example")

# List directory
> dir()

# Read a .csv file (operates on your CWD)
> read.csv("example.csv")

# List objects
> ls()  # alternatively
> objects()

# Import Source .R file
> source("example.R")

Programming Syntax

Below the Integer value 1 is assigned <- to the value x. Printing the value x using the print() method yields the result [1] 1: a vector with 1 element which is an Integer of 1.

> x <- 1
> print(x)
[1] 1

Objects

There are 5 basic of Atomic object types:

  • Character
  • Numberic (real numbers) e.g 1 and INF and NaN
  • Integer e.g 1L
  • Complex
  • Logical (True, False, T, F)

The most basic a vector. An empty vector is created using the vector() function e.g. v <- vector("numeric", 10) will create an empty vector (of type Numeric) with the length of 10. The list is a type of vector with the exception that it may contain elements of different types.

Objects have attributes, for example:

  • name, dimname (dimension-name)
  • dimensions (arrays, matrices)
  • class (use class() function to determine an object's class)
  • length (use length() function on an object to determine it's length)
  • other user-defined

User attributes() function on an object to determine it's attributes

Vectors and Lists

Vectors

The function c() can be used to create a vector by concatenating objects e.g.

> x <- c(1,2,3)  ## numeric
> x <- c(TRUE, FALSE) ## logicial
> x <- c(T, F) ## logical
> x <- c("a", "b", "c") ## characters
> x <- c(9:12) ## using sequence ':' keyword
> x <- c(1+0i, 2+4i)  ## complex 

NB Creating a vector of mixed elements, will result in the elements being (implicitly) coerced to a common type (remember: vectors can only by of a single-type) e.g.

> c(1.7, "a") ## becomes c("1.7", "a")
> c(TRUE, 2)  ## becomes c(1, 2)
> c("a", TRUE)  ## becomes c("a", "TRUE")

To explicitly coerce an object, us the as.*() function e.g. as.logical(1) will yield a value TRUE.

Lists

Lists are created using the list(...) function e.g. x <- list(2, "f", TRUE, 9L, 1+2i) where [[1]] is 2 [[2]] is "f" [[3]] is TRUE etc...

Matrices

Matrices are vectors with a dimension attribute. The dimension attribute is an integer vector of nrow and ncol.

> m <- matrix( nrow = 2, ncol = 3 )  ## create a 2x3 matrix

## Matrices are created _column-wise_ (popularing from [1,1])
> m <- matrix( 1:6, nrow=2, ncol=3)  
>   [,1] [,2] [,3]
[1,]  1    3    5
[2,]  2    4    6

## Creating matrix from vector
> i <- 1:10
> dim(i) <- c(2,5)
> m
    [,1] [,2] [,3] [,4] [,5]
[1,]  1    3    5    7    9
[2,]  2    4    6    8   10

Subsettings

Operators

The following opertors are used to extract subsets of objects: [ returns an object of the same type as the original. May be used to return more than one element. [[ used to return elements of a list or data frame. It can only be used to extract one element. $ used to extract elements of a list or data frame by name. Similar semantics are [[. e.g. data$Ozone.

Data Frame

  • Use the names() function to list the column name.
  • Use the complete.cases() function return a subset that is void of NA and NaN elements.
  • Data frame subsets can be extracted using indices using [row,col]. Indices can be missing e.g data[1,] returns the 1st row for the data frame/matrix, or data[,5] returns the 5th col. data[3,"Temp"] returns the Temp column of the 3rd row.
  • Partial matching can be applied to [[ and $ operators e.g. dataframe$a or dataframe[[a, exact = FALSE]] will a subset that includes elements with a name starting a.
> df <- read.csv("hw1_data.csv") ## read csv file

> names(df)  ## returns the names of the data frame

> df[1:3,]  ## subset first 3 rows

> tail(df, 4) ## subset last 4-rows of data frame

> nrow(df) ## get number of rows

> df[47,] ## returns subset row 47

> df[is.na(df$Ozone)]  ## returns subset where 'Ozone' is NA

> mean(df[complete.cases(df$Ozone),]$Ozone)  ## returns the Mean of Ozone (exclude NA elements)

> mean(ss[ss$Ozone > 31 & ss$Temp > 90,]$Solar.R) ## returns the mean of Solar.R where Ozone > 31 and Temp > 90

## Max Ozone for May
> may <- df[df$Month == 5]
> cleanMay <- complete.cases(may)  ##remove NA Ozone elements
> may <- may[cleanMay]
> max(may$Ozone)  ## returns the max Ozone for May

Manipulating Data with dplyr

  • load dplyr package -> library(dplyr)
  • check package version -> packageVersion("dplyr)
  • load data into a data frame table -> cran <- tbl_df(mydf)
  • dplyr fundamental tasks: select(), filter(), arrange(), mutate() and summarize()

select()

select() keeps only the vectors(columns) listed e.g select(cran, ip_id, package, country).

  • display vectors using the range notation: select(cran, r_arch:country)
  • exclude vectors using the '-' symbol: select(cran, -time) or range select(cran, -(r_arch:country))

Reading files

# Downloading a file from the web
> download.file([fileURL], destfile=[local.path], method="curl")

# Reading local flat files
> read.csv() # or read.csv2()

Connecting to MySQL

# install package
> install.packages("RMySQL")

# Reading from MySQL
> hg19 <- dbConnect(MySQL(), user="genome", db="hg19", host="genome-mysql.cse.ucse.edu")  # [1] 10949
> allTables <- dbListTables(hg19)  # allTables[1:5]
> length(allTables)  # length of tables
> result <- dbGetQuery(hg19, "show databases;")  # list databases
> dbListFields(hg19, "affyU133Plus2")  # list table fields
> dbGetQuery([db],[sql-query])  # dbGetQuery(hg19, "select count(*) from affyU133Plus2")
> dataframe <- dbReadTable([db],[table])
> [query] <- dbSendQuery([db],[sql-SELECT-query])
  [dataframe] <- fetch([query]); 
  [sub dataframe] <- fetch([query],n=[numrows])
  dbClearResult([query])  # clear result - mandatory
> dbDisconnectt(hg19)

Reading from HDF5

# install RHD5 package
> source("http://bioconductor.org/biocLite.R")
> biocLite("fhdf5")
> library(rhdf5)
> created = h5createFile("example.h5")  # created

# create groups
> created = h5createGroup("example.h5","foo")
> created = h5createGroup("example.h5","bar")
> created = h5createGroup("example.h5","foo/bar")
> h5ls("example.h5")

# write to groups
A = matrix(1:10,nr=5,nc=2)  # create a matrix
h5write([matrix, [file], [group]) # e.g h5write(A, "example.h5", "foo/A") 
Copy link

ghost commented Mar 9, 2015

Are you sure that "df[is.na(df.$Ozone)]" returns subset where 'Ozone' is NA? Doesn't work for me

@mitchwongho
Copy link
Author

Thanks. I'll double check and fix asap

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment