R is a good tool in your toolbox to manipulate and visualize local data. Even if you can only make bar charts and line charts, it can be very useful compared to only display the data in text. It also has a nice IDE (RStudio).
When you look for answer in stackoverflow about R, you might have many wtf moment. There are many absurd-hacky answers that people gladly offer. This is often frustating. I think this paragraph resembles a lot with R situation:
Just for reference, 80% of awful Perl "code" in my $work falls under this - it was written by financial analysts who are smart enough to pick up a Perl book and some earlier scripts, clone off a script that does what business need is, and don't have CS/programming background to worry about how readable/maintainable their code was. - from stackoverflow
This short guide shows a minimal way to get something done in R.
At the time of writing, their versions are R 3.6.1 and RStudio 1.2
RStudio also has built-in vim keybinding support. It is not perfect, but works good enough. The setting is on
Tools > Global Options > Code > Keybindings > Vim
.
Change RStudio working directory: RStudio > Preferences > General > Default working directory
.
To use Spark:
- Install Java. Download JDK 8 from corretto. Using Java 8 because last time I tried, Spark only works with Java 8.
- Install Spark from RStudio:
install.packages("sparklyr")
library(sparklyr)
spark_available_versions() # see available version
spark_install(version = "2.4")
Create a new file, then the top left pane is where you write the script. The console (R REPL) is on the bottom left pane. Plots and documentation is on the bottom right pane.
On mac, Ctrl
is Cmd
, and Alt
is Option
. These shortcut can help you to reach mouse/trackpad less often.
Ctrl+1
-> move cursor to top-left pane (script area).Ctrl+2
-> move cursor to console (REPL).Ctrl+Enter
on the script area -> run the statement/expression where the current cursor is at.
x <-
42 # when your cursor is here and you press ctrl+enter,
# the entire `x <- 42` statement is run, not 42 only.
Alt+Enter
-> same asCtrl+Enter
, but your cursor will stay on the same line (not moved down).- Block text then
Ctrl+enter
-> run selected code.
You can just put your cursor at the end of function declaration's bracket to run the function declaration.
Block lines you want to comment/uncomment, press Ctrl+Shift+C
(Cmd+Shift+C
for Mac). Anything after #
is commented.
Run ?<function name>
to bring up the documentation for that particular function.
?as.factor
Use install.packages('packageName')
to install package/library. Run it once on the Console and it will be installed permanently (until explicitly removed).
install.packages('tidyverse') # contains dplyr & ggplot
install.packages('funModeling') # used only for `freq`
Use <-
for assigning variables.
x <- 42
Variable name quirks:
.
is valid character for variable name, it is not a method call/accessing field like in OOP. Soas.factor
is just one function name, there is noas
object.
Google's R style guide suggests
camelCase
for variables,PascalCase
for functions; yet Advanced R suggestsnake_case
for both variables and functions.
Can use single or double quotes.
'foo' == "foo"
Concat multiple strings using paste
.
paste('foo', 'bar') == 'foo bar' # default separator is space
paste('foo', 'baz', sep='') == 'foobaz'
If you want to print to console, the python-like way is using cat
.
x <- 42
name <- 'foo'
cat(name, 'says the answer is', x, '\n') # doesn't include newline automatically
# to remove spaces in between:
cat('this', 'has', 'no', 'space', 'in', 'between', sep='')
- The values are
TRUE
andFALSE
. - AND operator is
&&
; OR operator is||
. - Equal operator is
==
; not equal is!=
.
If one statement is too long you can break it to multiple line. Note that you need to leave an operator hanging at the end to let R know you still have something below.
40 +
2 # again, if you ctrl+enter on this line,
# the entire 40+2 statement will be executed.
Declaring function example:
identity <- function(x) {
x # last line in function is implicitly returned
} # if you ctrl+enter here, the entire function declaration will be run
identity(3) # return 3
# or, with dplyr syntax
library(dplyr)
3 %>% identity # return 3
For explicit return inside a function, use return()
function e.g: return(x)
.
Vector is this stuff: c(1, 2, 3)
like array in C/Java or list in Python.
List is this stuff: list(a=c(1, 2), b='foo')
. It is like struct/record/dictionary. Useful to pass the parameter dynamically (like kwargs
in Python).
Data frame is like a table in SQL. It has columns and rows. When you read a csv file, it will be read as a data frame.
# create a new data frame
dummy_df <- data.frame(num=c(1, 2, 3), bool=c(TRUE, TRUE, FALSE))
dummy_df %>% colnames
# output: "num" "bool"
Factor is finite set of elements or categories (like colors: red
, blue
, green
). The distinct elements are called levels
. By default the levels are sorted alphabetically.
With dplyr, you can code with much less parentheses ()
and your code will look functional programming-y. It is like SQL (dplyr also has joins) but with a syntax that can be composed (chainable).
You can use %>%
to pass previous value to next function's argument.