Skip to content

Instantly share code, notes, and snippets.

@cbergman
Created March 8, 2019 13:09
Show Gist options
  • Save cbergman/b4fe6d50ba2e7bd16af255af62297dcd to your computer and use it in GitHub Desktop.
Save cbergman/b4fe6d50ba2e7bd16af255af62297dcd to your computer and use it in GitHub Desktop.
R_workshop_3_8_19
# download R: https://cran.rstudio.com/
# download Rstudio: https://www.rstudio.com/products/rstudio/download/
### Brief intro to R language ###
# R is a FOSS statistical programming language
# R is an implementation of S (Bell labs, 1976) by Ross Ihaka and Robert Gentleman (1995->2000)
# R name is play on S language and author names
# R is an interpreted language, R command line interpreter (console) is written in R, C and Fortran
# R interpreter takes plain text input, interprets input, and generates numerical, text or graphical output
# R is a scripting language, human readable code, no need to compile programs (more like BASH or Mathematica than C or JAVA)
# R code can be run interactively or at command line
# Core R language and packages maintained by R Development Core Team
# Core R extended by user community to develop domain-specific packages (CRAN, bioconductor, github)
# Core R provides basic R.app CLI and GUI
# Rstudio (also FOSS) is more advance GUI developed by Rstudo team (2010), who also develop many useful data science packages (ggplot2, tidyverse)
### Quick intro to Rstudio ###
# Rstudio is an Integrated Development Environment (IDE) that makes writing and running R code faster and easier
# 4 Panes: Source, Console/Terminal, Environment/History/Git, Files/Plots/Packages/Help: https://bookdown.org/ndphillips/YaRrr/the-four-rstudio-windows.html
# Only 3 visible until you open/create a source file
# Can resize and minimize/maximize panes
# Source: create, edit & save R code in plain text files. Can also select lines to be run in Console (Command + Return)
# Console/Terminal: CLI R interpreter (with arrowkey + mouse navigation). Can type directly at prompt or run commands selected in Source pane. Terminal allows access to system through shell
# Environment/History/Git: shows which packages and variables exist in your current environment; history of all commands run, and provides access to Git version control system
# Files/Plots/Packages/Help: shows what files are in project folder, interactive plot viewer, package manager, and help window/search
### Basics of R input and output ###
# boot up Rapp & Rstudio
# > is prompt, blinking cursor
# any text after # symbol is treated as comment and ignored by interpreter
# use R for basic numerical caculations
```
# this is a comment
1 + 100 # this is a comment
```
# calculation is interpreted as the sum of two single-element vectors, resulting in a single-element vector indexed by [1]
# index is not a part of the output, just a helpful guidepost for when output has many elements
```
rep(1:100)
```
# input can be split over multiple lines
# if input does not yield a completely interpretable statment, prompt will change from > to +
# + symbol does not indicate addition in this context
```
1 +
100
```
# if you get + prompt that you don't want, hit Esc to kill in R.app/Rstudio (use crtl+c in R CLI)
# R uses standard mathematical symbols and order of operations
```
3 + 5 * 2
(3 + 5) * 2
```
# R uses scientific E-notation format for very large and small numbers
```
2/10000
```
# 2e-04 = 2E-04 = 2e-4 = 2*10^(-4)
# 2e-04 != 2*exp(-04)
# often see P values of 2e-16, give-away someone is using R in their analysis
# non-interger numbers are represented as double floating point numbers with 53 binary digits of accuracy (with corresponds to ~16 decimal digits)
# important because non-integer numbers are not represented exactly & you can exprience rounding errors that can compound
# R has in-built mathematical functions
```
log(1) # what is the default base of log function in R?
log(10) # default base isn't 10
log(exp(1)) # default is natural log (base e)
log10(10) # log with base 10
log2(2) # log with base 2
log(3, base=3) # log with aribtrary base
```
# R has in-built help functions that explain functions and provide examples
# help menu pane and tab autocompletion/help suggestions are better in Rstudio than R.app
```
?log
```
# R has in-built comparison and logical operators
```
1 == 1 # equality (note two equals signs, read as "is equal to", only use for integers & strings)
1 != 2 # inequality (read as "is not equal to")
1 < 2 # less than
1 <= 1 # less than or equal to
1 > 0 # greater than
1 >= -9 # greater than or equal to
1 == 1 & 1 == 2 # AND
1 == 1 | 1 == 2 # OR
!(1 == 1) # NOT
```
### Variable naming, assignment & management ###
# Variables are named containers that store information
# Variables can be manipulated & referenced by the variable name
# Do not need to declare variables in R or assign a type (integer, string, dataframe) to them prior to use (dynamically typed)
# Variable names cannot start with a number or underscore, or contain spaces
# R variable naming conventions are:
```
periods.between.words
camelCaseToSeparateWords
underscores_between_words
```
# Google style guide for R code says period >> camelCase >> underscores: https://google.github.io/styleguide/Rguide.xml#identifiers (I disagree with this, and consider this merely a matter of style)
# Key is to be consistent in your variable naming style
# Assignment of values to variables is typically done using the leftward "<-" composite operator
# "<-" does not mean "less than negative""
```
x <- 1/40 # assign value to x (x added to environment)
x # print value currently assigned to x
x <- 1/30 # assign new value to x (new x *not* added to environment)
log(x) # can use x in place of number in any calculation (value of log(x) reported to interpreter but not stored as variable in environment)
```
# the right hand side of the expression is evaluated before being assigned to the variable on the left hand side
# R also allows rightward assignment "1/30 -> x"
# R also allows "=" to be used for assignment, but this is not recommended by google style guide:https://google.github.io/styleguide/Rguide.xml#assignment (I agree with this since "=" is used to set paramenters in many functions)
# Managing variables in your environment
# Variables that exist in current environment can be listed using:
```
ls()
```
# Variables (and their values) that exist in current environment can also be inspected in the Environment pane
# Variables starting with a "." are hidden from ls() and Environment pane
```
.x <- 0
ls()
```
# You can view all viarables in your environment as follow:
```
ls(all.names=TRUE) instead
```
# Variables can be overwritten/modified by assignment
# Variables can be deleted as follows:
```
rm(.x)
```
# To remove all variables from your environment,
```
rm(list = ls())
```
#################
### Exercises ###
#################
# 1) Will the following expression evaluate to TRUE or FALSE?
(1.25 * (1 * 0.8) - 1) == (1.25 * (3 * 0.8) - 3)
# 2) what are the values of the following expressions
1 * 0.8
1.25 * (1 * 0.8)
1.25 * (1 * 0.8) - 1
# 3) what are the values of the following expressions
3 * 0.8
1.25 * (3 * 0.8)
1.25 * (3 * 0.8) - 3
# 4) Will the following expression evaluate to TRUE or FALSE?
all.equal((1.25 * (1 * 0.8) - 1),(1.25 * (3 * 0.8) - 3))
########################
# Managing Projects in R
########################
### Creating a project in Rstudio ###
# Click the “File” menu button, then “New Project”.
# Click “New Directory”.
# Click “New Project”.
# Type in the name of the directory to store your project, e.g. “my_project”.
# If available, select the checkbox for “Create a git repository.”
# Click the “Create Project” button.
### Organizing files in a project ###
# Put each project in its own directory, which is named after the project
# Put text documents associated with the project in the /doc directory
# Put (small) raw data and metadata in the /data directory
# Put scripts in the /src directory
# Directories can be created in Files Pane
### Formatting your input data ###
# All variables should have a separate column (don’t mix meaning in a column, add a new column if necessary)
# All data from the same variable go in same column
# Label your columns with terms that are meaningful to other people
# Do not leave white spaces in your variable names (use underscore or period, e.g. Air.Flow or air_flow) or data cells (use “n.a.”)
# Make sure headers are the first row in file
# Don’t leave any blank row or columns in a file
# Don’t color code cells (add a new column with an indicator variable, e.g. “gfp_tagged” with values “y/n”)
# Save your data as plain text files in tab delimited or comma separated value (CSV) format
# After input data is cleaned, make read only (and/or keep under version control)
# import CSV data into R session as follows:
```
gapminder_data <- read.csv("gapminder_data.csv", header=T)
head(gapminder_data)
```
### Managing results files ###
# Make separate results directories for each analysis (use date in dir_name)
# Treat generated output as disposable
# If large results files keep outside of git repository (or use git lfs)
### installing/loading packages ###
# You can see what packages are installed by typing
```
installed.packages()
```
# You can install packages (e.g. phangorn package) by typing
```
install.packages("phangorn")
```
# You can update installed packages by typing
```
update.packages("packagename")
```
# You can remove a package with
```
remove.packages("packagename")
```
# You need to make a package available before using in your R session (needs to be installed first)
```
library(phangorn)
```
# Package management is easiest with Packages pane (reports commands, avoids syntax errors)
# Packages often have complex dependencies which requires installation of other packages
# Packages can be installed from source or binaries: http://r-pkgs.had.co.nz/package.html
# Some packages have many versions, important to record which version you are using
# As project nears completion it is a good idea to archive your R packages, since it may not be possible to reconstruct full environment in the future (alternatively use conda/bioconda)
### version control with Git ###
#################
### Exercises ###
#################
1) Create R project in Rstudio & create data, src, and doc directories.
2) download gapminder data: https://raw.githubusercontent.com/swcarpentry/r-novice-gapminder/gh-pages/_episodes_rmd/data/gapminder_data.csv
3) Move gapminder_data.csv into data folder.
4) Import gapminder_data.csv into R session, assign dataframe to variable, and inpect that your dataframe has been imported properly using head(). Note: you may need to modify the path to the input datafile.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment