Skip to content

Instantly share code, notes, and snippets.

@ejmg
Last active March 11, 2018 01:48
Show Gist options
  • Save ejmg/4ddc621f8bc8c7c488097e3dd4eff9de to your computer and use it in GitHub Desktop.
Save ejmg/4ddc621f8bc8c7c488097e3dd4eff9de to your computer and use it in GitHub Desktop.
My wiki submission for CS441: Programming Languages

Table of Contents

  1. the R Programming Language
    1. Note:
    2. Summary
    3. History
      1. Origins
      2. Contemporary Uses
    4. Features
      1. Interpreted (REPL)
      2. Data and Data Types
      3. Lists
      4. Functions
      5. Iteration
      6. Matrices
      7. Lexical Scoping
    5. Examples
      1. Statistical Analysis
      2. Insertion sort, vectorized
    6. Evaluation

the R Programming Language

Note:

Blackboard really messed up the nice formatting of my html code that emacs exported from my .org file. I've made a Github Gist with the exact same content as below, but with the proper syntax highlighting and generally nice format that blackboard ruined. Link.

Summary

The R programming language is a free and openly sourced, declarative, and interpreted functional programming language focused around statistical data analysis. 1 Created by senior statisticians Ross Ihaka and Robert Gentleman, both of the University of Auckland University, R was derived from and inspired by both the S and Scheme programming languages and first officially released in 1995.2 R has now become one of the most popular programming languages for users working on projects, platforms, or in businesses that involve major statistical analysis and modeling.3, 4

History

Origins

The R programming language is a direct outgrowth from the S programming language, which was created by John Chambers while at the statistics research department within Bell Laboratories.5 It was the mid-1970s, and computing was still done in batch mode; furthermore, outside statistical software at the time was extremely unfriendly to modification. These two conditions led to the creation of S as Chambers, and the statistics department at Bell generally, were in need of a language that allowed flexibility in approach and problem topic. To quote Richard Becker, a co-worker of Chambers and developer of S, over Bell Lab's creation of S, "It was the realization that routine data analysis should not require writing Fortran programs…."5 Graphics was another key aspect of S that R would inherited, and it was at Bell Labs that a device-independent library, GR-Z, was created to allow S to output high quality statistical plots.

This dual nature of statistical computation and visualization is nodded towards in the title Ihaka and Gentleman's initial paper on R, R: A Language for Data Analysis and Graphics.2 Ihaka and Gentleman (hereon, IG) developed R for reasons that plagued S' uptake in both academic and commercial usage. Until Unix itself was actively licensed out to academic institutions, there were not many machines that S could be installed on, and, even then, S itself was never licensed out prior to the opening of Unix.5 R, from its onset, is a member program under the GNU Project which allows a global and open collaborative environment in terms of its source code and extensions (distributed as R "Packages").1 Further, while R inherited the syntax of S, its evaluation philosophy was directly taken from Guy Steele and Gerald Sussman's Scheme, particularly its lexical scoping.

Like other free software, R was largely a "spontaneous" creation of the programming community. Ihaka and Gentleman worked privately on the language and made a small announcement on a mailing list for the S language upon its completion, but the interest in R by the greater programming and statistics community made it clear that Ihaka and Gentleman were on to something.6 By email correspondence, a user of the initial R binaries released by Ihaka and Gentleman argued that they should release their code under the GNU Public License. As Ihaka writes, "We had some initial doubts about doing this, but Martin’s arguments were persuasive, and we agreed to make the source code available by ftp under the terms of the Free Software Foundation’s GNU general license. This happened in June of 1995."6 1995 would mark the first "official" release of R, unleashing with it a slow cascade of community development. By 1998, a "core" team of R developers had already formed beyond Ihaka and Gentleman to focus on developing the mechanics of the language and eventual developments such as byte-compilation for the interpreter.6 The core team itself would lead to the creation of the R Foundation in 2003 which now oversees the development of R-core with a board made by stakeholders and developers (such as John Chambers of S variety).7

Contemporary Uses

Since its initial release, R has seen massive uptake across industry. That said, there are a few particularly noteworthy uses of R, specifically in the hard and social sciences of biology, physics, and Economics. While the origins of uptake can vary, the all tend to include R's status as a free and open technology, its large and committed developer base, and the ease of documenting research methodology made possible by R.8 , 9

  1. Biology

    By far one of the largest community of R users is those found within the bounds of the life sciences. It is by no coincidence that the rise in statistical computing can almost be mapped along with the rise of DNA sequencing as a profession.10 While R joined the scene after languages like Perl had long established themselves, R, with users listing its "flexibility, a substantial collection of good statistical algorithms and high-quality numerical routines, the ability to easily model and handle data, numerous documentation, cross-platform compatibility, a well designed extension system and excellent visualisation capabilities" all as reasons for its uptake in biology and bio-informatics.11 A major development was the founding of Bioconductor, an open-source bio-informatics software project meant to propagate open source based software, common standards and methods, and high quality documentation of research.12

  2. Physics

    Adjacent to biology, the physics community has become a large segment of the R user base and community. The vast majority of legacy code within the physics community is based around FORTRAN and c++ because of the high computation needs of physicists, such as the particle physics done at CERN. ROOT is a data analysis framework and an object based database system that has become critical to physicists.13 While its code remains based in c++ and FORTRAN, it now has an extremely popular bindings library to R, allowing researches to have all the benefits of fast, compiled computation that ROOT is based on but with the niceties provided by R syntax and development environment.14

  3. Economics

    A recent development in the R community has been the economics profession picking up R. For decades, economics has been dominated by proprietary software solutions such as STATA and Matlab. However, recent analysis of the job market shows that there has been a huge jump in the demand for economists with experience with open source solutions, with R taking the lead by a huge margin.15

Features

Interpreted (REPL)

In R, the traditional development environment revolves around a READ-EVAL-PRINT-LOOP (REPL), allowing data to be processed and computed dynamically by a user.16 It also allows for better feedback on the development of statistical models and their visualizations in addition to better error feedback standard with an interpreted language.

> 3 + 2
[1] 5
> x <- "wutang"
> x
[1] "wutang"
> y <- "forever"
> y
[1] "forever"
> paste(x, y, sep=" ")
[1] "wutang forever"

However, R, much like Python, is also capable of running standard scripts.16 The code in an R script is identical to that used while in a REPL session.

Data and Data Types

It should be noted, first and foremost, that R provides no means of directly accessing data stored in memory. Instead, R provides "specialized data structures" known as "objects" by the R community. Such objects are referred to via symbols and variables, a characteristic shared with Scheme.16 The most common type of "object" in R are its primitive data types as discussed below.

R's data types are very different from most other languages, with the most simple data type being an atomic vector.17 An atomic vector is a vector as you conceive of one mathematically, are not unlike those find in other languages like Python, and can be declared and accessed as so:

> x <- c(1, 2, 3)
> x[1]
[1] 1

Where the command c() is the constructor for a vector type (though the c stands for "collection", somewhat confusingly) and [i] the index operator that returns whatever value at the given index, i. As you can see above, one type of primitive atomic vector is that of double. However, that is not the only value. All atomic types in R are as follows:17

type example assignment
doubles 1, 3.0, 4.1 `x <- c(1, 3.0, 4.1)`
integers -1, 0, 1, `x <- c(-1L, 0L, 1L)`
characters "w", "ODB", "WuTang" `x <- c("WuTang")`
logicals TRUE, FALSE `x <- c(TRUE, FALSE)`
complex 0+3i, 2+3i, 1+4i `x <- c(3i, 2+3i, 1+4i)`
raw `57 75 54 61 6e 67` `x <- charToRaw("WuTang")`

It should be noted that in R, the default numeric value is double precision floats, double. Given the nature of statistics, integer type data is not often dealt with; however, when the time arises, one can create integer type numerical data by declaring it with an L attached as shown above.

Continuing with R's quirks, there is no difference between the string and character data types. Indeed, there is not actual string type as the term is synonymously with character in R, going against the norm for most programming languages.17

Finally, while the above examples all make assignments with atomic vectors with a length greater than one, it is worth noting that an assignment such as `x <- 1.0` is still a vector. It is merely the trivial case: a vector of size one.

Due to their commonality in mathematics, sequences can be readily declared in R as a vector with the : operator as follows:

> mySeq <- 1:5
> mySeq
[1] 1 2 3 4 5

This creates a vector of the type double from 1 to 5.

Lists

Lists are another important data type in R. They are very close in nature to vectors as describe above, but allow heterogeneous data. That is to say, we can have a list with a vector of doubles, characters, and logicals as shown below:

> myDouble <- c(1, 2, 3)
> myChar <- "WuTang, SUUU"
> myLogical <- c(TRUE, TRUE, FALSE)
> myList <- list(myDouble, myChar, myLogical)
> myList
[[1]]
[1] 1 2 3

[[2]]
[1] "WuTang, SUUU"

[[3]]
[1]  TRUE  TRUE FALSE

Functions

As alluded to earlier, everything in R is an object. Functions are no exception to this rule but are due some special attention because of it. While R partakes in imperative and object oriented patterns, it is ultimately a functional programming language because R treats functions as "first class" citizens.18 They are treated no differently from other objects and can be thrown around between other functions as arguments or as return data. This is largely an inspiration from the Lisp family of languages, with Scheme specifically influencing R's authors.2 This is a natural choice as statistics revolves around continuously manipulating and modifying data in various ways without changing the data itself, a characteristic functional languages traditionally embody.

To declare a function in R, you use the function keyword. An example declaration is given below. The function takes two values as arguments and sums them.

> myFunc <- function(x, y){
    x + y
}
> myFunc(5,5)
[1] 10

An easy demonstration of R's functional nature is its built in function, lapply(), which operates very similarly to a map function in lisp: provide it a function and list of values, and lapply() will apply the function to each value and return a new list with those values. An example follows:

> addOne <- function(x){
    x + 1
}
> addOne(1)
[1] 2
> myList <- 1:5 # creates a list from the sequence 1 to 5. Recall, a vector is a single dimensional list.
> lapply(myList, addOne)
[[1]]
[1] 2

[[2]]
[1] 3

[[3]]
[1] 4

[[4]]
[1] 5

[[5]]
[1] 6

Iteration

While R is a functional language, it is a dirty one. That is to say, R is far from a pure functional language. As previously mentioned, R implements various OOP features, such as classes, and its code is often iterative in nature. Structures such as loops are also common even though they are not as succinct and pretty as R's functional equivalent solutions.

An example loop is given below:

> for(i in 1:10)
{
  print("WuTang Clan Represent!")
}
[1] "WuTang Clan Represent!"
[1] "WuTang Clan Represent!"
[1] "WuTang Clan Represent!"
[1] "WuTang Clan Represent!"
[1] "WuTang Clan Represent!"
[1] "WuTang Clan Represent!"
[1] "WuTang Clan Represent!"
[1] "WuTang Clan Represent!"
[1] "WuTang Clan Represent!"
[1] "WuTang Clan Represent!"

Matrices

As mentioned above, the core computations within R revolve around vectors. Matrices are another major data structure used in statistics; consequently, they have first class status as an object in R.

A sample construction is given below. We declare a matrix with the keyword matrix, and must describe its data, the number of rows, the number of columns, and its form (by row or column). Below we have declared a matrix where each row has the vector x, with 5 rows, 5 columns, and ordering by row.

> x <- c(1, 2, 3, 4, 5)
> x
[1] 1, 2, 3, 4
> M <- matrix(x,
              nrow=5,
              ncol=5,
              byrow=TRUE)
> M
     [,1] [,2] [,3] [,4] [,5]
[1,]    1    2    3    4    5
[2,]    1    2    3    4    5
[3,]    1    2    3    4    5
[4,]    1    2    3    4    5
[5,]    1    2    3    4    5

Lexical Scoping

Recall that R takes its scoping from Scheme and forgoes the dynamic scoping of its predecessor, S.2, 19 Instead, scoping is statically set, with a call reference to a name recursively searching "up" from its frame for a binding to that name. Take the following snippet as an example20:

> a <- 1
> b <- 2
> f <- function(x)
{
  a*x + b
}
> g <- function(x)
{
  a <- 2
  b <- 1
  f(x)
}
> g(2)
4

The above example shows that g(2) returns 4 because f() was declared in the global frame, and thus inherits the values of 1 and 2 for a and b, respectively. If R behaved like S, with dynamic scoping, the return value would be 5.

Examples

Statistical Analysis

The example below demonstrates the very intuitive and powerful built-in capabilities of R for statistical analysis21:

> x <- c(1, 2, 3, 4, 5, 6)   # Create ordered collection (vector)
> y <- x^2              # Square the elements of x
> print(y)              # print (vector) y
[1]  1  4  9 16 25 36
> mean(y)               # Calculate average (arithmetic mean) of (vector) y; result is scalar
[1] 15.16667
> var(y)                # Calculate sample variance
[1] 178.9667
> lm_1 <- lm(y ~ x)     # Fit a linear regression model "y = B0 + (B1 * x)"
# store the results as lm_1
> print(lm_1)           # Print the model from the (linear model object) lm_1

Call:
lm(formula = y ~ x)

Coefficients:
(Intercept)            x
-9.333        7.000

> summary(lm_1)          # Compute and print statistics for the fit
                         # of the (linear model object) lm_1

Call:
lm(formula = y ~ x)

Residuals:
1       2       3       4       5       6
3.3333 -0.6667 -2.6667 -2.6667 -0.6667  3.3333

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)  -9.3333     2.8441  -3.282 0.030453 *
x             7.0000     0.7303   9.585 0.000662 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.055 on 4 degrees of freedom
Multiple R-squared: 0.9583, Adjusted R-squared: 0.9478
F-statistic: 91.88 on 1 and 4 DF,  p-value: 0.000662

> par(mfrow = c(2, 2))     # Request 2x2 plot layout

img

Insertion sort, vectorized

Rather than use the iterative solution, we can utilize R's vector "nature" to solve many iterative problems more efficiently due to linear operations22. This includes the insertion sort example below23:

insertion_sort <- function(x) {
  for (j in 2:length(x)) {
    key <- x[j]
    bp <- which.max(x[1:j] > key)
    # 'bp' stands for breakpoint
    if (bp == 1) {
      if (key < ar[1]){
        x <- c(key, ar[-j])
      }
    }
    else {
    x <- x[-j]
    x <- c(ar[1:bp - 1], key, x[bp : (s-1)])
    }
  return(x)
  }
}

Evaluation

Over the course of researching R, I've come to a few conclusions. First of all, it is a fantastic language for those who are doing statistical work. My reasoning for this is has largely been documented above, but I would list its open source license, the breadth and depth of its community, its interpreted nature, and its friendly syntax as the primary reasons.

If you are a statistician, whether academic or applied, or a scientist concerned with modeling some sort of phenomena, R is an incredibly powerful tool.

Its licensing means that an entire academic department, let alone a university, can adopt a research workflow around R at 100% zero financial cost in terms of licensing and deployment. This jumps a huge impediment to much of academic and industrial research from the get-go, which gives R a huge plus as compared to its rivals.

Its community, furthermore, further reinforces the utility of R as a language. In absence of a corporation supporting and extending a language, as proprietary competitor languages have, a free language needs a dynamic community to keep it alive. R has been taken up across the hard sciences, is spreading into the social sciences, and is creeping into other parts of industry. Consequently, the package ecosystem of R has become absolutely huge and has something for practically anyone.

Finally, the language itself lends itself as a smart choice for someone working in numerical analysis. Its interpreted nature allows a much more natural workflow for users. It lets you focus on the work you are doing as well as keep an active tab on its development. Compare this to a compiled language, where you have the constant problem of compilation: the time for code to compile and the time-gap in feedback (was my code good? was my analysis good? did my image come out nicely?). Continuing, R's syntax does not get in the way of your work. It lends itself naturally to mathematical notation, and the very formatting of its data and structures similarly follow mathematical notation. Operations revolve around vectors, lists, and matrices, which means abstraction at the level of math is allowed rather than focusing on the details of the language. Altogether, this makes R a strong choice for someone who needs a programming language that takes care of the underlying details and gets out of the way of the user.

The only real downfall of R comes from one of its main strengths: speed. As an interpreted language, R simply cannot compete with compiled languages when it comes to crunching extremely large datasets. This is a near unavoidable trade-off. Nearly. R luckily has bindings into other languages that allow its packages to gain extreme speed boosts as they allow R to call into much quicker compiled languages like c++ and FORTRAN, much like python does with its Numpy library. All in all, R is a great language for numerical computation and modeling and is generally a strong language.

Footnotes

1 Chambers, J. Facets of R Special invited paper on "The Future of R". The R Journal. 1.

2 Ihaka, R. and Gentleman, R. R: A Language for Data Analysis and Graphics. Journal of Computational and Graphical Statistics 6 (1). 299-314

3 Vance, A. January, 2009. Data Analysts Captivated by R’s Power. New York Times.

4 David Robinson. 2017. The Impressive Growth of R. (October 2017). https://stackoverflow.blog/2017/10/10/impressive-growth-r/

5 Becker, R., A Brief History of S, Murray Hill, New Jersey: AT&T Bell Laboratories

6 Ihaka, Ross. R: Past and future history. Computing Science and Statistics. 30. 392–396.

7 https://stat.ethz.ch/pipermail/r-announce/2003/000385.html

8 https://ropensci.org/about/

9 Tina Amirtha. 2014. How The Rise Of The “R” Computer Language Is Bringing Open Source To Science. (March2014). https://www.fastcompany.com/3028381/how-the-rise-of-the-r-computer-language-is-bringing-open-source-to-science

10 Robert Gentleman. 2008. R Programming for Bioinformatics (1 ed.). Chapman & Hall/CRC.

11 Gatto, L. and Christoforou, A. Using R and Bioconductor for proteomics data analysis. Biochimica et Biophysica Acta (BBA) - Proteins and Proteomics 1844, 1. 42–51. http://dx.doi.org/10.1016/j.bbapap.2013.04.032

12 Gentleman, R C. et al. Bioconductor: open software development for computational biology and bioinformatics. Genome biology 5, no. 10 (2004): R80.

13 Brun, R., and Rademakers, F. ROOT—an object oriented data analysis framework. Nuclear Instruments and Methods in Physics Research Section A: Accelerators, Spectrometers, Detectors and Associated Equipment, 389(1-2), 81-86.

14 https://root.cern.ch/

15 Timo. 2017. Statistical software: its use and popularity in Economics. (August2017). https://www.edawax.de/2017/08/statistical-software-its-use-and-popularity-in-economics/

16 Team, R. Core. 2017. "R language definition." Vienna, Austria: R foundation for statistical computing.

17 Garrett Grolemund. 2015. Hands-on programming with R, Sebastopol, CA: O'Reilly.

18 Chambers, J. Object-Oriented Programming, Functional Programming and R. Statistical Science. 29 (2). 167-180.

19 GNU. Static Scoping. https://www.gnu.org/software/mit-scheme/documentation/mit-scheme-ref/Static-Scoping.html

20 Darren Wilson. 2011. Lexical scope and function closures in R. (November 2011). https://darrenjw.wordpress.com/2011/11/23/lexical-scope-and-function-closures-in-r/

21 https://en.wikipedia.org/wiki/R_(programming_language)#Basicsyntax

22 Colin Gillespie and Robin Lovelace. 2016. Efficient R programming: a practical guide to smarter programming, ‎Sebastopol‎, CA: O'Reilly.

23 https://www.rosettacode.org/wiki/Sorting_algorithms/Insertion_sort#R

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment