Blackboard really messed up the nice formatting of my html code that emacs
exported from my .org
file. I've made a Github Gist with the exact same
content as below, but with the proper syntax highlighting and generally nice
format that blackboard ruined. Link.
The R
programming language is a free and openly sourced, declarative, and
interpreted functional programming language focused around statistical data
analysis. 1 Created by senior statisticians Ross Ihaka and Robert
Gentleman, both of the University of Auckland University, R
was derived from
and inspired by both the S
and Scheme
programming languages and first
officially released in 1995.2 R
has now become one of the most popular
programming languages for users working on projects, platforms, or in
businesses that involve major statistical analysis and modeling.3, 4
The R
programming language is a direct outgrowth from the S
programming
language, which was created by John Chambers while at the statistics
research department within Bell Laboratories.5 It was the mid-1970s,
and computing was still done in batch mode; furthermore, outside statistical
software at the time was extremely unfriendly to modification. These two
conditions led to the creation of S
as Chambers, and the statistics
department at Bell generally, were in need of a language that allowed
flexibility in approach and problem topic. To quote Richard Becker, a
co-worker of Chambers and developer of S
, over Bell Lab's creation of S, "It
was the realization that routine data analysis should not require writing
Fortran programs…."5 Graphics was another key aspect of S
that R
would inherited, and it was at Bell Labs that a device-independent library,
GR-Z, was created to allow S
to output high quality statistical plots.
This dual nature of statistical computation and visualization is nodded
towards in the title Ihaka and Gentleman's initial paper on R
, R: A
Language for Data Analysis and Graphics.2 Ihaka and Gentleman (hereon,
IG) developed R
for reasons that plagued S
' uptake in both academic and
commercial usage. Until Unix itself was actively licensed out to academic
institutions, there were not many machines that S
could be installed on,
and, even then, S
itself was never licensed out prior to the opening of
Unix.5 R
, from its onset, is a member program under the GNU Project
which allows a global and open collaborative environment in terms of its
source code and extensions (distributed as R
"Packages").1 Further,
while R
inherited the syntax of S
, its evaluation philosophy was
directly taken from Guy Steele and Gerald Sussman's Scheme
, particularly
its lexical scoping.
Like other free software, R
was largely a "spontaneous" creation of the
programming community. Ihaka and Gentleman worked privately on the language
and made a small announcement on a mailing list for the S
language upon
its completion, but the interest in R
by the greater programming and
statistics community made it clear that Ihaka and Gentleman were on to
something.6 By email correspondence, a user of the initial R
binaries
released by Ihaka and Gentleman argued that they should release their code
under the GNU Public License. As Ihaka writes, "We had some initial doubts
about doing this, but Martin’s arguments were persuasive, and we agreed to
make the source code available by ftp under the terms of the Free Software
Foundation’s GNU general license. This happened in June of 1995."6 1995
would mark the first "official" release of R
, unleashing with it a slow
cascade of community development. By 1998, a "core" team of R developers had
already formed beyond Ihaka and Gentleman to focus on developing the
mechanics of the language and eventual developments such as byte-compilation
for the interpreter.6 The core team itself would lead to the creation
of the R Foundation in 2003 which now oversees the development of R-core
with a board made by stakeholders and developers (such as John Chambers of S
variety).7
Since its initial release, R
has seen massive uptake across industry. That
said, there are a few particularly noteworthy uses of R
, specifically in
the hard and social sciences of biology, physics, and Economics. While the
origins of uptake can vary, the all tend to include R
's status as a free
and open technology, its large and committed developer base, and the ease of
documenting research methodology made possible by R
.8 , 9
-
Biology
By far one of the largest community of
R
users is those found within the bounds of the life sciences. It is by no coincidence that the rise in statistical computing can almost be mapped along with the rise of DNA sequencing as a profession.10 WhileR
joined the scene after languages likePerl
had long established themselves,R
, with users listing its "flexibility, a substantial collection of good statistical algorithms and high-quality numerical routines, the ability to easily model and handle data, numerous documentation, cross-platform compatibility, a well designed extension system and excellent visualisation capabilities" all as reasons for its uptake in biology and bio-informatics.11 A major development was the founding of Bioconductor, an open-source bio-informatics software project meant to propagate open source based software, common standards and methods, and high quality documentation of research.12 -
Physics
Adjacent to biology, the physics community has become a large segment of the
R
user base and community. The vast majority of legacy code within the physics community is based aroundFORTRAN
andc++
because of the high computation needs of physicists, such as the particle physics done at CERN.ROOT
is a data analysis framework and an object based database system that has become critical to physicists.13 While its code remains based inc++
andFORTRAN
, it now has an extremely popular bindings library toR
, allowing researches to have all the benefits of fast, compiled computation thatROOT
is based on but with the niceties provided byR
syntax and development environment.14 -
Economics
A recent development in the
R
community has been the economics profession picking upR
. For decades, economics has been dominated by proprietary software solutions such as STATA and Matlab. However, recent analysis of the job market shows that there has been a huge jump in the demand for economists with experience with open source solutions, withR
taking the lead by a huge margin.15
In R
, the traditional development environment revolves around a
READ-EVAL-PRINT-LOOP (REPL), allowing data to be processed and computed
dynamically by a user.16 It also allows for better feedback on the development
of statistical models and their visualizations in addition to better error
feedback standard with an interpreted language.
> 3 + 2
[1] 5
> x <- "wutang"
> x
[1] "wutang"
> y <- "forever"
> y
[1] "forever"
> paste(x, y, sep=" ")
[1] "wutang forever"
However, R
, much like Python, is also capable of running standard
scripts.16 The code in an R
script is identical to that used while in a REPL
session.
It should be noted, first and foremost, that R
provides no means of
directly accessing data stored in memory. Instead, R
provides
"specialized data structures" known as "objects" by the R
community. Such
objects are referred to via symbols and variables, a characteristic shared
with Scheme
.16 The most common type of "object" in R are its primitive data
types as discussed below.
R
's data types are very different from most other languages, with the most
simple data type being an atomic vector
.17 An atomic vector is a vector
as you conceive of one mathematically, are not unlike those find in other
languages like Python, and can be declared and accessed as so:
> x <- c(1, 2, 3)
> x[1]
[1] 1
Where the command c()
is the constructor for a vector type (though the c
stands for "collection", somewhat confusingly) and [i]
the index operator
that returns whatever value at the given index, i
. As you can see above, one
type of primitive atomic vector is that of double
. However, that is not the
only value. All atomic types in R
are as follows:17
type | example | assignment |
---|---|---|
doubles | 1, 3.0, 4.1 | `x <- c(1, 3.0, 4.1)` |
integers | -1, 0, 1, | `x <- c(-1L, 0L, 1L)` |
characters | "w", "ODB", "WuTang" | `x <- c("WuTang")` |
logicals | TRUE, FALSE | `x <- c(TRUE, FALSE)` |
complex | 0+3i, 2+3i, 1+4i | `x <- c(3i, 2+3i, 1+4i)` |
raw | `57 75 54 61 6e 67` | `x <- charToRaw("WuTang")` |
It should be noted that in R
, the default numeric value is double precision
floats, double
. Given the nature of statistics, integer
type data is not
often dealt with; however, when the time arises, one can create integer
type
numerical data by declaring it with an L
attached as shown above.
Continuing with R
's quirks, there is no difference between the string
and
character
data types. Indeed, there is not actual string
type as the term is
synonymously with character
in R
, going against the norm for most
programming languages.17
Finally, while the above examples all make assignments with atomic vectors with a length greater than one, it is worth noting that an assignment such as `x <- 1.0` is still a vector. It is merely the trivial case: a vector of size one.
Due to their commonality in mathematics, sequences can be readily declared
in R
as a vector with the :
operator as follows:
> mySeq <- 1:5
> mySeq
[1] 1 2 3 4 5
This creates a vector of the type double
from 1 to 5.
Lists are another important data type in R
. They are very close in nature
to vectors as describe above, but allow heterogeneous data. That is to say,
we can have a list with a vector of doubles
, characters
, and
logicals
as shown below:
> myDouble <- c(1, 2, 3)
> myChar <- "WuTang, SUUU"
> myLogical <- c(TRUE, TRUE, FALSE)
> myList <- list(myDouble, myChar, myLogical)
> myList
[[1]]
[1] 1 2 3
[[2]]
[1] "WuTang, SUUU"
[[3]]
[1] TRUE TRUE FALSE
As alluded to earlier, everything in R
is an object. Functions are no
exception to this rule but are due some special attention because of
it. While R
partakes in imperative and object oriented patterns, it is
ultimately a functional programming language because R
treats functions as
"first class" citizens.18 They are treated no differently from other objects
and can be thrown around between other functions as arguments or as return
data. This is largely an inspiration from the Lisp
family of languages,
with Scheme
specifically influencing R
's authors.2 This is a
natural choice as statistics revolves around continuously manipulating and
modifying data in various ways without changing the data itself, a
characteristic functional languages traditionally embody.
To declare a function in R
, you use the function
keyword. An example
declaration is given below. The function takes two values as arguments and
sums them.
> myFunc <- function(x, y){
x + y
}
> myFunc(5,5)
[1] 10
An easy demonstration of R
's functional nature is its built in function,
lapply()
, which operates very similarly to a map
function in lisp:
provide it a function and list of values, and lapply()
will apply the
function to each value and return a new list with those values. An example
follows:
> addOne <- function(x){
x + 1
}
> addOne(1)
[1] 2
> myList <- 1:5 # creates a list from the sequence 1 to 5. Recall, a vector is a single dimensional list.
> lapply(myList, addOne)
[[1]]
[1] 2
[[2]]
[1] 3
[[3]]
[1] 4
[[4]]
[1] 5
[[5]]
[1] 6
While R
is a functional language, it is a dirty one. That is to say, R
is far from a pure functional language. As previously mentioned, R
implements various OOP features, such as classes, and its code is often
iterative in nature. Structures such as loops are also common even though
they are not as succinct and pretty as R
's functional equivalent
solutions.
An example loop is given below:
> for(i in 1:10)
{
print("WuTang Clan Represent!")
}
[1] "WuTang Clan Represent!"
[1] "WuTang Clan Represent!"
[1] "WuTang Clan Represent!"
[1] "WuTang Clan Represent!"
[1] "WuTang Clan Represent!"
[1] "WuTang Clan Represent!"
[1] "WuTang Clan Represent!"
[1] "WuTang Clan Represent!"
[1] "WuTang Clan Represent!"
[1] "WuTang Clan Represent!"
As mentioned above, the core computations within R
revolve around
vectors. Matrices are another major data structure used in statistics;
consequently, they have first class status as an object in R
.
A sample construction is given below. We declare a matrix with the keyword
matrix
, and must describe its data, the number of rows, the number of
columns, and its form (by row or column). Below we have declared a matrix
where each row has the vector x
, with 5 rows, 5 columns, and ordering by
row.
> x <- c(1, 2, 3, 4, 5)
> x
[1] 1, 2, 3, 4
> M <- matrix(x,
nrow=5,
ncol=5,
byrow=TRUE)
> M
[,1] [,2] [,3] [,4] [,5]
[1,] 1 2 3 4 5
[2,] 1 2 3 4 5
[3,] 1 2 3 4 5
[4,] 1 2 3 4 5
[5,] 1 2 3 4 5
Recall that R
takes its scoping from Scheme
and forgoes the dynamic
scoping of its predecessor, S
.2, 19 Instead, scoping is statically
set, with a call reference to a name recursively searching "up" from its
frame for a binding to that name. Take the following snippet as an example20:
> a <- 1
> b <- 2
> f <- function(x)
{
a*x + b
}
> g <- function(x)
{
a <- 2
b <- 1
f(x)
}
> g(2)
4
The above example shows that g(2)
returns 4 because f()
was declared in
the global frame, and thus inherits the values of 1 and 2 for a
and b
,
respectively. If R
behaved like S
, with dynamic scoping, the return
value would be 5
.
The example below demonstrates the very intuitive and powerful built-in
capabilities of R
for statistical analysis21:
> x <- c(1, 2, 3, 4, 5, 6) # Create ordered collection (vector)
> y <- x^2 # Square the elements of x
> print(y) # print (vector) y
[1] 1 4 9 16 25 36
> mean(y) # Calculate average (arithmetic mean) of (vector) y; result is scalar
[1] 15.16667
> var(y) # Calculate sample variance
[1] 178.9667
> lm_1 <- lm(y ~ x) # Fit a linear regression model "y = B0 + (B1 * x)"
# store the results as lm_1
> print(lm_1) # Print the model from the (linear model object) lm_1
Call:
lm(formula = y ~ x)
Coefficients:
(Intercept) x
-9.333 7.000
> summary(lm_1) # Compute and print statistics for the fit
# of the (linear model object) lm_1
Call:
lm(formula = y ~ x)
Residuals:
1 2 3 4 5 6
3.3333 -0.6667 -2.6667 -2.6667 -0.6667 3.3333
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -9.3333 2.8441 -3.282 0.030453 *
x 7.0000 0.7303 9.585 0.000662 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.055 on 4 degrees of freedom
Multiple R-squared: 0.9583, Adjusted R-squared: 0.9478
F-statistic: 91.88 on 1 and 4 DF, p-value: 0.000662
> par(mfrow = c(2, 2)) # Request 2x2 plot layout
Rather than use the iterative solution, we can utilize R
's vector "nature"
to solve many iterative problems more efficiently due to linear
operations22. This includes the insertion sort example below23:
insertion_sort <- function(x) {
for (j in 2:length(x)) {
key <- x[j]
bp <- which.max(x[1:j] > key)
# 'bp' stands for breakpoint
if (bp == 1) {
if (key < ar[1]){
x <- c(key, ar[-j])
}
}
else {
x <- x[-j]
x <- c(ar[1:bp - 1], key, x[bp : (s-1)])
}
return(x)
}
}
Over the course of researching R
, I've come to a few conclusions. First of
all, it is a fantastic language for those who are doing statistical work. My
reasoning for this is has largely been documented above, but I would list its
open source license, the breadth and depth of its community, its interpreted
nature, and its friendly syntax as the primary reasons.
If you are a statistician, whether academic or applied, or a scientist
concerned with modeling some sort of phenomena, R
is an incredibly powerful
tool.
Its licensing means that an entire academic department, let alone a
university, can adopt a research workflow around R
at 100% zero financial
cost in terms of licensing and deployment. This jumps a huge impediment to
much of academic and industrial research from the get-go, which gives R
a
huge plus as compared to its rivals.
Its community, furthermore, further reinforces the utility of R
as a
language. In absence of a corporation supporting and extending a language, as
proprietary competitor languages have, a free language needs a dynamic
community to keep it alive. R
has been taken up across the hard sciences,
is spreading into the social sciences, and is creeping into other parts of
industry. Consequently, the package ecosystem of R
has become absolutely
huge and has something for practically anyone.
Finally, the language itself lends itself as a smart choice for someone
working in numerical analysis. Its interpreted nature allows a much more
natural workflow for users. It lets you focus on the work you are doing as
well as keep an active tab on its development. Compare this to a compiled
language, where you have the constant problem of compilation: the time for
code to compile and the time-gap in feedback (was my code good? was my analysis
good? did my image come out nicely?). Continuing, R
's syntax does not get
in the way of your work. It lends itself naturally to mathematical notation,
and the very formatting of its data and structures similarly follow
mathematical notation. Operations revolve around vectors, lists, and
matrices, which means abstraction at the level of math is allowed rather than
focusing on the details of the language. Altogether, this makes R
a strong
choice for someone who needs a programming language that takes care of the
underlying details and gets out of the way of the user.
The only real downfall of R
comes from one of its main strengths:
speed. As an interpreted language, R
simply cannot compete with compiled
languages when it comes to crunching extremely large datasets. This is a near
unavoidable trade-off. Nearly. R luckily has bindings into other languages
that allow its packages to gain extreme speed boosts as they allow R
to
call into much quicker compiled languages like c++
and FORTRAN
, much like
python does with its Numpy
library. All in all, R
is a great language for
numerical computation and modeling and is generally a strong language.
1 Chambers, J. Facets of R Special invited paper on "The Future of R". The R Journal. 1.
2 Ihaka, R. and Gentleman, R. R: A Language for Data Analysis and Graphics. Journal of Computational and Graphical Statistics 6 (1). 299-314
3 Vance, A. January, 2009. Data Analysts Captivated by R’s Power. New York Times.
4 David Robinson. 2017. The Impressive Growth of R. (October 2017). https://stackoverflow.blog/2017/10/10/impressive-growth-r/
5 Becker, R., A Brief History of S, Murray Hill, New Jersey: AT&T Bell Laboratories
6 Ihaka, Ross. R: Past and future history. Computing Science and Statistics. 30. 392–396.
7 https://stat.ethz.ch/pipermail/r-announce/2003/000385.html
9 Tina Amirtha. 2014. How The Rise Of The “R” Computer Language Is Bringing Open Source To Science. (March2014). https://www.fastcompany.com/3028381/how-the-rise-of-the-r-computer-language-is-bringing-open-source-to-science
10 Robert Gentleman. 2008. R Programming for Bioinformatics (1 ed.). Chapman & Hall/CRC.
11 Gatto, L. and Christoforou, A. Using R and Bioconductor for proteomics data analysis. Biochimica et Biophysica Acta (BBA) - Proteins and Proteomics 1844, 1. 42–51. http://dx.doi.org/10.1016/j.bbapap.2013.04.032
12 Gentleman, R C. et al. Bioconductor: open software development for computational biology and bioinformatics. Genome biology 5, no. 10 (2004): R80.
13 Brun, R., and Rademakers, F. ROOT—an object oriented data analysis framework. Nuclear Instruments and Methods in Physics Research Section A: Accelerators, Spectrometers, Detectors and Associated Equipment, 389(1-2), 81-86.
15 Timo. 2017. Statistical software: its use and popularity in Economics. (August2017). https://www.edawax.de/2017/08/statistical-software-its-use-and-popularity-in-economics/
16 Team, R. Core. 2017. "R language definition." Vienna, Austria: R foundation for statistical computing.
17 Garrett Grolemund. 2015. Hands-on programming with R, Sebastopol, CA: O'Reilly.
18 Chambers, J. Object-Oriented Programming, Functional Programming and R. Statistical Science. 29 (2). 167-180.
19 GNU. Static Scoping. https://www.gnu.org/software/mit-scheme/documentation/mit-scheme-ref/Static-Scoping.html
20 Darren Wilson. 2011. Lexical scope and function closures in R. (November 2011). https://darrenjw.wordpress.com/2011/11/23/lexical-scope-and-function-closures-in-r/
21 https://en.wikipedia.org/wiki/R_(programming_language)#Basicsyntax
22 Colin Gillespie and Robin Lovelace. 2016. Efficient R programming: a practical guide to smarter programming, Sebastopol, CA: O'Reilly.
23 https://www.rosettacode.org/wiki/Sorting_algorithms/Insertion_sort#R