Instantly share code, notes, and snippets.

Created September 27, 2013 20:24
Show Gist options
• Save hadley/6734639 to your computer and use it in GitHub Desktop.
My first stab at a basic R programming curriculum. I think teaching just these topics without overall motivating examples would be extremely boring, but if you're a self-taught R user, this might be useful to help spot your gaps.

Notes:

• I've tried to break up in to separate pieces, but it's not always possible: e.g. knowledge of data structures and subsetting are tidy intertwined.

• Level of Bloom's taxonomy listed in square brackets, e.g. http://bit.ly/15gqPEx. Few categories currently assess components higher in the taxonomy.

# Programming R curriculum

## Data structures

• basic data structures (vector, matrix, list and data frame):

• list and describe their differences (dimensionality, homogeneous vs. heterogeneous) [knowledge]

• pick the best data structure for a given problem [application]

• recall functions to coerce data structures between different forms [knowledge], and recognise which coercions are lossy [comprehension]

• match data types and the functions that identify them, and remember common gotchas (is.vector, is.numeric etc.) [comprehension]

• `str`:

• interpret the output of `str` [comprehension]

• use `str` and subsetting to extract desired pieces from an arbitrary object (for example, extract the r squared value from a linear model) [application]

• vectors:

• recognise which types of data corresponding to the four common atomic vectors (character, double, integer, logical) [knowledge]

• recognise the use of `L` to create integer vectors [knowledge]

• create new vectors with `c()`, and correctly predict vector type when multiple types are mixed (e.g. what is the type of `c(1, 1L, F)`) [application]

• create named vectors with `c()`, recognise how named vectors are printed and how to extract values with character subsetting [application]

• employ implicit logical to numerical coercion to compute number and proportion of TRUEs in a vector (e.g. what proportion of values are missing?) [application]

• predict how missing values propagate [application], and discuss why `is.na()` is necessary [synthesis]

• data frames:

• use `data.frame()` to create a data frame from multiple vectors, and control the names of the generated columns [application]

• describe the situations under which strings are coerced to factors, and recall how to use `I`, `asis = TRUE` or `stringsAsFactors = FALSE` to prevent conversion [knowledge]

• combine two or more data frames with `cbind()` and `rbind()`, and describe what conditions must be true for the combination to work [knowledge]

• use `head()`, `tail()`, `summary()` and `str()` to get an overview of a data frame [application]

• describe how 1d and 2d subsetting of data frame differ, and enumerate the circumstances under which subsetting a data frame will return a column instead of a data frame [comprehension]

• matrices

• contrast 1d vector operations and 2d matrix operations (e.g. `names()` vs. `colnames()` & `rownames()`, `length()` vs `nrow()` and `ncol()`). [analysis]

• predict the output when a matrix is coerced into a vector (i.e. remember that R matrices are stored col-wise)

• lists

• create a new list with `list()`, and selectively name components [application]

• convert a list into a vector with unlist, and apply implicit coercion rules to predict type of output [application]

• NULL

• strings vs. factors vs. ordered factors

• recall the key differences (cardinality, ordering) between strings, factors and ordered factors [knowledge]

• select the most appropriate type for a given variable [analysis]

• describe the operation of `drop = TRUE`, when it is needed, and remedies if you are using it frequently [application]

• match data types with conversion and testing functions, and list common gotchas (e.g. converting an ordered factor to a factor) [knowledge]

• know enough about floating point math to predict the output of `sqrt(2)^ 2 - 2 == 0` and spot potentially hazardous use of equality comparisons [application]

## Subsetting

• types of subsetting

• match the six types of subsetting objects with their results [knowledge]

• compare and contrast the use of subsetting, `match` and `%in%` when looking for matching values across two vectors [application]

• use integer subsetting to order multidimensional structures [application]

• apply De Morgan's rule to simplify a complicated double negation [application]

• identify uses of `which()` that are redundant (i.e. only need which you want the position of nth TRUE) [analysis]

• use repeated values in numeric indexing to create a "subset" that is larger than the original set [application]

• use character subsetting to create a lookup table [application]

• understand how 1d subsetting generalises to 2d subsetting [comprehension]

• describe the difference between simplifying and preserving subsetting (`[`` vs `[[`, when `drop = FALSE` is necessary) [analysis]

• understand the difference between `x\$y` and `x[["y"]]` and know when to use each form [application]

• use subsetting with assignment to change multiple values in a data structure at once [application]

• use subsetting with assignment and NULL to remove elements from a list/data frame [application]

• identify when subsetting + assignment will fail because the number of values to assign does not match the number of values in the subset [analysis]

• use R's boolean operators to recreate english expressions (e.g. x is less than 50 and more than 25). Recall the difference between R's or and or in regular English. [application]

• compare and contrast `&` and `|` with `&&` and `||` [analysis]

## Input and output

• identify the correct function to read/write a data frame to/from disk (csv, tab delimited or fixed width file) [application]

• use common arguments (`na.string`, `sep`, `header`) to deal with files that have unusual structure [analysis]

• recongise the lack of symmetry between `read.csv()` and `write.csv()`, and describe which options should be used by default [knowledge]

• use subset & transform to reduce the amount of typing for common data manipulation operations [knowledge]

• use `readRDS`/`saveRDS` to cache binary R objects that were expensive to compute [application]

• understand what `save()` and `load()` do, how they differ from `readRDS()` and `saveRDS()` [knowledge] and when to use them instead of the single object variants [evaluation]

## Functions & control flow

• convert a simple script into parameterised functions [synthesis]

• describe a simple R function in words [synthesis]

• describe R's argument matching semantics (position, partial, exact) [knowledge], predict how they apply in a specific situation [application], and evaluate good and less-good use of the three different types [evaluation]

• describe the parts of a function using correct terminology: body, formal arguments, return value [comprehension]

• use scoping rules to predict how names are mapped to values [application]

• describe short-circuiting and its impact on expressions like `is.null(x) || all(is.na(x))` or `TRUE || stop("!")`

• execute a script of R code with `source())`

## Control flow

• describe the structure of an if statement [comprehension]

• use a for loop to repeat the same operation on different elements of a data structure [application]

• convert a for loop to a while loop [analysis]

• illustrate why `1:length(x)` is dangerous and suggest a safer way [application]

• correct the identing and spacing of a piece of poorly formatted source code [application]

## Vectorisation/recycling

• describe what vectorisation means, distinguish internal and external vectorisation, and the performance consequence of each functions [knowledge]

• use vectorised operations instead of for loops to perform simple mathematical operations (log, addition, subtraction etc.) [application]

• use `lapply()`, `sapply()` and `apply()` to vectorise operations that are not already vectorised. [analysis]

• convert an `lapply()` call to a for loop [application]

• recognise a for-loop that can be rewritten to use `lapply` [knowledge]

• match common non-vectorised equivalents to their vectorised equivalents (e.g. `min()` and `pmin()`, `sum()` to `cumsum()` and `colSums()`) [knowledge]

• describe basic recycling rules, and know how to avoid them when necesary [knowledge]

## Recovering from errors

• recognise and remedy simple syntax errors (missing quotes, missing parentheses etc.) [comprehension]

• use `try()` to recover from an error [application]

• interpret the output of `traceback()`` to identify where an error occured [application]

• initiate an interactive debugger with `browser()` or `options(error = recover())` [application]

• list the commands used to control `browser()`/`recover()` [knowledge]

• use `options(warn = 2)` to convert warnings into errors for debug

• create a minimal reproducible example to get help from others [synthesis]

• find help for a function, data set, and package [knowledge]

• read and interpret the documentation of a function [analysis]

• use google to identify the name of a function that performs a given task

## Package management

• install a packages with `install.packages()` [comprehension]

• load a package with `library()` or `require()` [comprehension]

• determine which packages are out of date [application]

• understand lifetime of `install.packages`/`library` effects [comprehension]

• use `::` to refer to a function in a specific package

### rvprasad commented Sep 28, 2013

Nice!! If I were to suggest one change, then you might want to consider folding vectorized operations (including subsetting, selection, projection) into the exposition about data structures.

In an introductory class on R, I introduced vectorized operations along with data structures before introducing control flow. And, this was well received by the students. We found that while while every student had some programming and database experience, they related vectorized operations to SQL operations and stayed clear from control flow while thinking about data transformations.

If it might help, the slide deck from this class is available at http://www.slideshare.net/venkateshprasadranganath/the-r-language-an-introduction.

### jennybc commented Sep 28, 2013

I've learned to hammer on the meta-issues of using an IDE (probably RStudio, as @jhollist said) and being deliberate about where you work (probably by using an RStudio project). Attention to this early eliminates much aggravation about where things are to be read from and written to. This seems to make people much more willing to save scripts (instead of workspaces) and to break their work into pieces.

What about the `...` argument?

+1 knitr.

### hadley commented Sep 30, 2013

@jholish @jennydb good point - some basic editor familiarity is really important

@ychen41 @prabhasp I agree that knitr is important, but it's beyond the scope of this curriculum - knitr is for data analysis (generally), not programming

### hadley commented Sep 30, 2013

@joshuaulrich

• Why do you think `NROW` / `NCOL` are important? I can count the number of times I've used them on the fingers of one hand
• I think `isTRUE` is a relatively advanced concept, and similarly, I think recycling vectors with matrices is generally a bad idea - you should be explicit about it.

### hadley commented Sep 30, 2013

@vsbuffalo good point - I'll add a bullet

### wabarr commented Oct 6, 2013

What about file system commands like copying, listing files in directory etc? And string manipulation (base and/or stringr)?