Create a gist now

Instantly share code, notes, and snippets.

What would you like to do?
My first stab at a basic R programming curriculum. I think teaching just these topics without overall motivating examples would be extremely boring, but if you're a self-taught R user, this might be useful to help spot your gaps.

Notes:

  • I've tried to break up in to separate pieces, but it's not always possible: e.g. knowledge of data structures and subsetting are tidy intertwined.

  • Level of Bloom's taxonomy listed in square brackets, e.g. http://bit.ly/15gqPEx. Few categories currently assess components higher in the taxonomy.

Programming R curriculum

Data structures

  • basic data structures (vector, matrix, list and data frame):

    • list and describe their differences (dimensionality, homogeneous vs. heterogeneous) [knowledge]

    • pick the best data structure for a given problem [application]

    • recall functions to coerce data structures between different forms [knowledge], and recognise which coercions are lossy [comprehension]

    • match data types and the functions that identify them, and remember common gotchas (is.vector, is.numeric etc.) [comprehension]

  • str:

    • interpret the output of str [comprehension]

    • use str and subsetting to extract desired pieces from an arbitrary object (for example, extract the r squared value from a linear model) [application]

  • vectors:

    • recognise which types of data corresponding to the four common atomic vectors (character, double, integer, logical) [knowledge]

    • recognise the use of L to create integer vectors [knowledge]

    • create new vectors with c(), and correctly predict vector type when multiple types are mixed (e.g. what is the type of c(1, 1L, F)) [application]

    • create named vectors with c(), recognise how named vectors are printed and how to extract values with character subsetting [application]

    • employ implicit logical to numerical coercion to compute number and proportion of TRUEs in a vector (e.g. what proportion of values are missing?) [application]

    • predict how missing values propagate [application], and discuss why is.na() is necessary [synthesis]

  • data frames:

    • use data.frame() to create a data frame from multiple vectors, and control the names of the generated columns [application]

    • describe the situations under which strings are coerced to factors, and recall how to use I, asis = TRUE or stringsAsFactors = FALSE to prevent conversion [knowledge]

    • combine two or more data frames with cbind() and rbind(), and describe what conditions must be true for the combination to work [knowledge]

    • use head(), tail(), summary() and str() to get an overview of a data frame [application]

    • describe how 1d and 2d subsetting of data frame differ, and enumerate the circumstances under which subsetting a data frame will return a column instead of a data frame [comprehension]

  • matrices

    • contrast 1d vector operations and 2d matrix operations (e.g. names() vs. colnames() & rownames(), length() vs nrow() and ncol()). [analysis]

    • predict the output when a matrix is coerced into a vector (i.e. remember that R matrices are stored col-wise)

  • lists

    • create a new list with list(), and selectively name components [application]

    • convert a list into a vector with unlist, and apply implicit coercion rules to predict type of output [application]

  • NULL

  • strings vs. factors vs. ordered factors

    • recall the key differences (cardinality, ordering) between strings, factors and ordered factors [knowledge]

    • select the most appropriate type for a given variable [analysis]

    • describe the operation of drop = TRUE, when it is needed, and remedies if you are using it frequently [application]

    • match data types with conversion and testing functions, and list common gotchas (e.g. converting an ordered factor to a factor) [knowledge]

  • know enough about floating point math to predict the output of sqrt(2)^ 2 - 2 == 0 and spot potentially hazardous use of equality comparisons [application]

Subsetting

  • types of subsetting

    • match the six types of subsetting objects with their results [knowledge]

    • compare and contrast the use of subsetting, match and %in% when looking for matching values across two vectors [application]

    • use integer subsetting to order multidimensional structures [application]

    • apply De Morgan's rule to simplify a complicated double negation [application]

    • identify uses of which() that are redundant (i.e. only need which you want the position of nth TRUE) [analysis]

    • use repeated values in numeric indexing to create a "subset" that is larger than the original set [application]

    • use character subsetting to create a lookup table [application]

  • understand how 1d subsetting generalises to 2d subsetting [comprehension]

  • describe the difference between simplifying and preserving subsetting ([`` vs[[, whendrop = FALSE` is necessary) [analysis]

  • understand the difference between x$y and x[["y"]] and know when to use each form [application]

  • use subsetting with assignment to change multiple values in a data structure at once [application]

  • use subsetting with assignment and NULL to remove elements from a list/data frame [application]

  • identify when subsetting + assignment will fail because the number of values to assign does not match the number of values in the subset [analysis]

  • use R's boolean operators to recreate english expressions (e.g. x is less than 50 and more than 25). Recall the difference between R's or and or in regular English. [application]

  • compare and contrast & and | with && and || [analysis]

Input and output

  • identify the correct function to read/write a data frame to/from disk (csv, tab delimited or fixed width file) [application]

  • use common arguments (na.string, sep, header) to deal with files that have unusual structure [analysis]

  • recongise the lack of symmetry between read.csv() and write.csv(), and describe which options should be used by default [knowledge]

  • use subset & transform to reduce the amount of typing for common data manipulation operations [knowledge]

  • use readRDS/saveRDS to cache binary R objects that were expensive to compute [application]

  • understand what save() and load() do, how they differ from readRDS() and saveRDS() [knowledge] and when to use them instead of the single object variants [evaluation]

Functions & control flow

  • convert a simple script into parameterised functions [synthesis]

  • describe a simple R function in words [synthesis]

  • describe R's argument matching semantics (position, partial, exact) [knowledge], predict how they apply in a specific situation [application], and evaluate good and less-good use of the three different types [evaluation]

  • describe the parts of a function using correct terminology: body, formal arguments, return value [comprehension]

  • use scoping rules to predict how names are mapped to values [application]

  • describe short-circuiting and its impact on expressions like is.null(x) || all(is.na(x)) or TRUE || stop("!")

  • execute a script of R code with source())

Control flow

  • describe the structure of an if statement [comprehension]

  • use a for loop to repeat the same operation on different elements of a data structure [application]

  • convert a for loop to a while loop [analysis]

  • illustrate why 1:length(x) is dangerous and suggest a safer way [application]

  • correct the identing and spacing of a piece of poorly formatted source code [application]

Vectorisation/recycling

  • describe what vectorisation means, distinguish internal and external vectorisation, and the performance consequence of each functions [knowledge]

  • use vectorised operations instead of for loops to perform simple mathematical operations (log, addition, subtraction etc.) [application]

  • use lapply(), sapply() and apply() to vectorise operations that are not already vectorised. [analysis]

  • convert an lapply() call to a for loop [application]

  • recognise a for-loop that can be rewritten to use lapply [knowledge]

  • match common non-vectorised equivalents to their vectorised equivalents (e.g. min() and pmin(), sum() to cumsum() and colSums()) [knowledge]

  • describe basic recycling rules, and know how to avoid them when necesary [knowledge]

Recovering from errors

  • recognise and remedy simple syntax errors (missing quotes, missing parentheses etc.) [comprehension]

  • use try() to recover from an error [application]

  • interpret the output of `traceback()`` to identify where an error occured [application]

  • initiate an interactive debugger with browser() or options(error = recover()) [application]

  • list the commands used to control browser()/recover() [knowledge]

  • use options(warn = 2) to convert warnings into errors for debug

  • create a minimal reproducible example to get help from others [synthesis]

  • find help for a function, data set, and package [knowledge]

  • read and interpret the documentation of a function [analysis]

  • use google to identify the name of a function that performs a given task

Package management

  • install a packages with install.packages() [comprehension]

  • load a package with library() or require() [comprehension]

  • determine which packages are out of date [application]

  • understand lifetime of install.packages/library effects [comprehension]

  • use :: to refer to a function in a specific package

Perhaps add some discussion about text editors/ide's for managing scripts, etc. Rstudio (obviously), TINN-R, notepad++, gedit, and whatever the macs one are (I don't use Macs, so don;t know)... A brief overview of how to use these with R would be nice addition to intro curriculum.

When I have done basic R workshops, that is sometimes a problem point. More so, that I would have expected, anyway.

Suggestions:
Under matrices, first bullet: mention NROW/NCOL.
Somewhere, talk about implicit conversion during logical comparison: 1 < "2".
Under control flow: use if(isTRUE(condition)) for robustness.
Under recycling: matrices are column-major, which affects how values are recycled in matrices.

ychen41 commented Sep 27, 2013

Introduce the knitr package to your students, and free them from the copy/paste in MS Word. A decently formatted, automatically generated report gives great sense of achievement!

Awesome work! Maybe mention that data frames have lists as their foundation (even showing is.list(data.frame()) == TRUE), as I think this clears up why we can't use matrices for much of the data we work with (because there is type heterogeneity across column vectors).

Nice!! If I were to suggest one change, then you might want to consider folding vectorized operations (including subsetting, selection, projection) into the exposition about data structures.

In an introductory class on R, I introduced vectorized operations along with data structures before introducing control flow. And, this was well received by the students. We found that while while every student had some programming and database experience, they related vectorized operations to SQL operations and stayed clear from control flow while thinking about data transformations.

If it might help, the slide deck from this class is available at http://www.slideshare.net/venkateshprasadranganath/the-r-language-an-introduction.

jennybc commented Sep 28, 2013

I've learned to hammer on the meta-issues of using an IDE (probably RStudio, as @jhollist said) and being deliberate about where you work (probably by using an RStudio project). Attention to this early eliminates much aggravation about where things are to be read from and written to. This seems to make people much more willing to save scripts (instead of workspaces) and to break their work into pieces.

What about the ... argument?

+1 knitr.

Owner

hadley commented Sep 30, 2013

@jholish @jennydb good point - some basic editor familiarity is really important

@ychen41 @prabhasp I agree that knitr is important, but it's beyond the scope of this curriculum - knitr is for data analysis (generally), not programming

Owner

hadley commented Sep 30, 2013

@joshuaulrich

  • Why do you think NROW / NCOL are important? I can count the number of times I've used them on the fingers of one hand
  • I think isTRUE is a relatively advanced concept, and similarly, I think recycling vectors with matrices is generally a bad idea - you should be explicit about it.
Owner

hadley commented Sep 30, 2013

@vsbuffalo good point - I'll add a bullet

wabarr commented Oct 6, 2013

What about file system commands like copying, listing files in directory etc? And string manipulation (base and/or stringr)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment