Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save Martin-Jung/b0f9f14cdbea895dc6f0 to your computer and use it in GitHub Desktop.
Save Martin-Jung/b0f9f14cdbea895dc6f0 to your computer and use it in GitHub Desktop.
---
title: "Reproducible R coding"
author: "Martin Jung"
date: "12.02.2015"
output:
ioslides_presentation:
highlight: zenburn
highlighter: highlight.js
hitheme: zenburn
slidy_presentation:
incremental: no
subtitle: CMEC R-Group
---
## Goals of reproducible programming? {.vcenter .build .larger .incremental}
- Make your code readible by you and others
- Group your code and functionalize
- Embrace collaboration, version control and automation
## First step - readibility {.vcenter}
### 1. Writing cleaner code
![Coding Mess](codingMess.jpg)
## Writing cleaner R code | Names { .emphasized .build}
- Keep new filenames descriptive and meaningful
```{r,results='hide'}
"helper-functions.R"
# or for sequences of processing work
"01_Download.R"
"02_Preprocessing.R"
#...
```
- Use CamelCase or Snake_case for variables
```{r,results='hide'}
"spatial_data"
"ModelFit"
"regression.results"
```
### Avoid predetermined names like `c` or `plot`
## Writing cleaner R code | Spacing {.emphasized .build}
Use Spacing just as in the english language
```{r,results='hide'}
# Good
model.fit <- lm(age ~ circumference, data = Orange)
# Bad
f1=lm(Orange$age~Orange$circumference)
```
Don't be afraid of using new lines
```{r,results='hide'}
model.results <- data.frame(Type = sample(letters, 10),
Data = NA,
SampleSize = 10 )
# Same goes for loops
# And don't forget good documentation
```
## More on writing clean code {.flexbox .vcenter}
- [Google R Style Guide](https://google-styleguide.googlecode.com/svn/trunk/Rguide.xml)
- [Hadley Wickhams Style Guide](http://adv-r.had.co.nz/Style.html)
- [RopenSci Guide](http://ropensci.github.io/reproducibility-guide/sections/writingCode/)
<br>
<div align="left">
And there even is a r-package to clean up your code:
[formatR](http://yihui.name/formatR/)
</div>
## Further ways to improve reproduciability
- Ideally attach your code + data to publications
- Open-access hoster ([DataDryad](http://datadryad.org/), [Figshare](http://figshare.com/), [Zenodo](http://www.zenodo.org/))
- Restructuring of workflow with RMarkdown / LaTeX / HTML
<div align="center">
![Coding Mess](Knitr-document-structure.png)
</div>
## Functionalize! {.flexbox .vcenter .build}
- Many `R` users are tempted to write their code very specialized and non-reusable
- Number 1 rule for clear coding :
### ***DRY*** - `Don't repeat yourself!`
<br>
***Simple example:***<br>
We want to fit a linear model to test if in an
orange orchard the circumference (mm) increases with age (age of trees).
If so we want to quantify and display the Root-Mean-Square-Error (`RMSE`) of this fit for each
individual orange tree in the dataset (`N = 5`).
***
Normal way:
```{r,results='hide'}
# Linear model
model.fit <- lm(age ~ circumference, data = Orange)
model.resid <- residuals( model.fit )
model.fitted <- fitted( model.fit )
rmse <- sqrt( mean( (model.resid - model.fitted)^2 ))
tapply(model.resid - model.fitted, Orange$Tree,
function(x) sqrt( mean( (x)^2 )))
```
***
```{r,echo=FALSE}
barplot( tapply(model.resid - model.fitted, Orange$Tree, function(x) sqrt( mean( x^2 ))) )
```
## Defining your functions {.build}
Essentially most r-packages are just a compilation of useful functions that users have written.
```{r,results='hide'}
# We want to get the RMSE of a linear model
rmse <- function(fit, groups = NULL, ...)
{
f.resid <- residuals(fit);f.fitted <- fitted(fit)
if(! is.null( groups )) {
tapply((f.resid-f.fitted), groups, function(x) sqrt(mean(x^2, ...)) )
} else {
sqrt(mean((f.resid-f.fitted)^2, ...))
}
}
```
---
```{r}
model.fit <- lm(age ~ circumference, data = Orange)
# This function is more flexible, can be further customized and
# applied in other situations
rmse(model.fit)
rmse(model.fit, Orange$Tree)
```
## (very) short intro into pipes
Pipes (|) are a common tool in the linux / programming world that can be used to chain
inputs and outputs of functions together.
<br>
In `R` there are two packages, namely `dplyr` and `magrittr` that enable general piping between all functions
Goal:
```
Solve complex problems by combining simple pieces
(Hadley Wickham)
```
***
```{r,tidy=FALSE,message=FALSE,results='hide',fig.show='hide'}
library(dplyr)
model.rmse <- Orange %>%
lm(age ~ circumference, data=.) %>%
rmse(., Orange$Tree) %>%
barplot
```
OR like this (Correlation within Iris dataset)
```{r,tidy=FALSE}
iris %>% group_by(Species) %>%
summarize(count = n(), pear_r = cor(Sepal.Length, Petal.Length)) %>%
arrange(desc(pear_r))
```
## Outsource your functions {.flexbox .vcenter}
```{r,results='hide'}
# Put your function into an extra files
# At the beginning of your main processing script
# you simply load them via source
source("outsourced.rmse.R")
```
## Easy package writing {.flexbox .vcenter}
- Open RStudio
- Install the `devtools` and `roxygen2` package
- Create a new package project and use the existing function as basis
- Create the documentation for it
- Update the package metadata and build your package
```{r,results='hide',eval=FALSE}
library(roxygen2)
library(devtools)
# Build your package with two simple commands
# Has to be within your package project
document() # Update the namespace
install() # Install.package
```
## {.flexbox .vcenter}
- However package development has multiple facets and options.
- More detailed info on [Package development with RStudio](https://support.rstudio.com/hc/en-us/sections/200130627-Package-Development).
<br>
- Higher acceptance for method papers and analysis code. [Make it citable with a DOI](https://guides.github.com/activities/citable-code/)
## Software management and collaboration with Github {.flexbox }
- Git is one of the most commonly used revision control systems
- Originally developed for the Linux kernel by Linus Torvalds
***
![How to Git](Git_operations.png)
***
> Github is web-based software repository service offering distributed revision control
> Californian Startup, now the largest code hoster in the world
> Offers public repositories for free, private for money and a nice snippet exchange service called gists
<div align="right">
![Github](github-logo.jpg)
</div>
## How to Git with rstudio (do it later)
1. Setup an account with a git repository hoster like [Github](https://github.com/)
2. Install RStudio and git for your platform (http://www.rstudio.com/ide/docs/version_control/overview)
3. Link to the git executable within the RStudio options
4. Create a new repository on Github and a new project in RStudio -> Version Control git
5. Clone your empty project (`pull`), add new files/changes to it (`commit`) and (`push`)
## {.flexbox .vcenter}
<div align="center">
![Github](createdRepo.png)
</div>
<b> Idea for CMEC R Users: </b>
- Create a Github organization (like a repository basecamp)
## Further developments
There are now packages to push gists and normal git updates directly from within `R`.
In order to use them you need a github api key (instructions on the websites below)
[rgithub](https://github.com/cscheid/rgithub)
To detailed to show here, but have a look at the `gistr` package:
[gistr](https://github.com/ropensci/gistr)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment