sebkopf/analysis.Rmd

## readme.md

      
    Raw
  

              readme.md
            
          
    R markdown and Data Frame Viewer tutorial

This tutorial provides an introduction to R markdown and basic data processing (import, data structuring & plotting) in R. You can download the whole folder by clicking the Download ZIP button above on the right.
Prerequisites:

install R and RStudio

Included:

introduction to markdown format

open the markdown tutorial file (rmd_tutorial.Rmd) in RStudio


introduction to R markdown with an example analysis

open the analysis file in RStudio (analysis.Rmd, make sure you have example.xlsx in the same folder)


introduction to Data Frame Viewer (included in rmd_tutorial.Rmd)


## analysis.Rmd
---
title: "Analysis test"
output: html_document
---

```{r, echo=FALSE, warning=FALSE}
# This code chunk simply makes sure that all the libraries used here are installed, it will not be shown in the report (notice echo = FALSE).
packages <- c("readxl", "knitr", "tidyr", "dplyr", "ggplot2", "plotly")
if ( length(missing_pkgs <- setdiff(packages, rownames(installed.packages()))) > 0) {
  message("Installing missing package(s): ", paste(missing_pkgs, collapse = ", "))
  install.packages(missing_pkgs)
}
```


This is a simple example analysis of data including import from Excel, data structuring and plotting. The data in this case happens to be optical density data over time (replicate growth curves for a microorganism) but the nature of the data matters little to the basics introduced.

## Import OD data

```{r}
library(readxl) # fast excel reader
#library(googlesheets) # fast google spreadsheet reader (not used here but could be useful)
data.raw <- read_excel("example.xlsx", skip = 1)
```

#### Show the raw data

```{r}
library(knitr) # the package that renders R markdown and has some good additional functionality
kable(data.raw)
```

### Restructuring the data

Turning the wide format excel data into *long* format. Note: here we make use of the pipe operator `%>%`, which just simplifies chaining operations.

```{r}
library(tidyr) # for restructuring data very easily
data.long <- data.raw %>% gather(sample, OD600, -Time)
# melt <- gather(raw, sample, OD600, -Time) # this would be identical without using %>%
```

Introducing time in hours.

```{r}
library(dplyr, warn.conflicts = FALSE) # powerful for doing calculations on data (by group, etc.)
data.long <- data.long %>% mutate(time.hrs = as.numeric(Time - Time[1], units = "hours"))
```

First plot of all the data

```{r}
library(ggplot2) # powerful plotting package for aesthetics driven plotting

p1 <-
  ggplot(data.long) + # initiate plot
  aes(x = time.hrs, y = OD600, color = sample) + # setup aesthetic mappings
  geom_point(size = 5) # add points to plot
print(p1) # output plot
```


### Combining data by adding sample meta information from the spreadsheet's second tab

```{r}
data.info <- read_excel("example.xlsx", sheet = "info")
```

Show all information (these are the experimental conditions for each sample)

```{r}
kable(data.info)
```

Combine OD data with sample information.

```{r}
data.all <- merge(data.long, data.info, by = "sample")
```

### Show us the datas

Reuse same plot using `%+%` to substitute the original data set with a new one and changing the color to be determined based on the new information we added (but keep everything else about the plot the same).

```{r}
p1 %+% data.all %+% aes(color = substrate)
```

### Summarize data

To make the figure a little bit easier to navigate, we're going to summarize the data for each condition (combine the replicates) and replot it with an error band showing the whole range of data points for each condition. We could reuse the plot `p1` again, but for clarity are constructing the plot from scratch instead.

```{r}
data.sum <- data.all %>%
  group_by(time.hrs, substrate) %>%
  summarize(
    OD600.avg = mean(OD600),
    OD600.min = min(OD600),
    OD600.max = max(OD600))
data.sum %>% head() %>% kable() # show the first couple of lines

p2 <- ggplot(data.sum) + # initiate plot
  aes(x = time.hrs, y = OD600.avg, ymin = OD600.min, ymax = OD600.max,
      fill = substrate) + # setup global aesthetic mappings
  geom_ribbon(alpha = 0.3) + # value range (uses ymin and ymax, and fill for color)
  geom_line() + # connect averages (uses y)
  geom_point(shape = 21, size = 5) + # add points for averages (uses y and fill for color)
  theme_bw() + # style plot
  labs(title = "My plot", x = "Time [h]", y = "OD600", color = "Condition") # add labels

print(p2)
```

*Note that we could also have had ggplot do the whole statistical summarising for us using `stat_summary` but it's often helpful to have these values separately for other calcluations and purposes.*

Now could e.g. focus on a subset of data but reuse same plot using `%+%` to substitute the original data set with a new one (but keep everythign else about the plot the same).

```{r}
p2 %+% filter(data.sum, !grepl("background", substrate), time.hrs < 25)
```

Save this plot automatically as pdf by setting specific plot options in the r code chunk

```{r this-is-my-plot, dev="pdf", fig.width=7, fig.height=5, fig.path="./"}
print(p2)
```

#### Interactive plot

Last, you can make simple interactive (javascript) plots out of your original ggplots (plotly does not yet work great for all ggplot features but it's a start for easy visualization). You can of course construct plotly plots without ggplot for more customization too but that's for another time.

```{r}
library(plotly, warn.conflicts = FALSE)
ggplotly(p1)
```


## example.xlsx

      
    Raw
  

              example.xlsx
            
          
            View raw
        
    
## rmd_tutorial.Rmd
---
output: html_document
---

# R markdown and Data Frame Viewer tutorial

This tutorial provides an introduction to [R markdown](http://rmarkdown.rstudio.com/)
and the [Data Frame Viewer](https://github.com/sebkopf/dfv#dfv).

## Markdown

**Markdown** is a very basic and easy-to-use syntax for styling written documents.
It's very easy to make some words **bold** and other words *italic* with Markdown.
You can even [link to NCBI](http://www.ncbi.nlm.nih.gov/)!

### Headers

Sometimes it's useful to have different levels of headings to structure your documents.
Start lines with a `#` to create headings. Multiple `##` in a row denote smaller heading sizes.

You can use  one `#` all the way up to `######` six for different heading sizes.

If you'd like to include a quote, use the > character before the line:

> My Software never has bugs. It just develops random features.

### Lists

Sometimes you need numbered lists (here to some useful resources for markdown):

1. [Markdown Basics from R-Studio](http://rmarkdown.rstudio.com/authoring_basics.html)
1. [Mastering Markdown from GitHub](https://guides.github.com/features/mastering-markdown/) (this is where most of the examples above come from)

And sometimes you want bullet points (the kind of things you can do with R markdown
if you want to go beyond the basics):

- [Lots of options for embedded R code](http://rmarkdown.rstudio.com/authoring_rcodechunks.html) (more details below)
- [Bibliographies and References](http://rmarkdown.rstudio.com/authoring_bibliographies_and_citations.html)
- [Interactive Documents](http://rmarkdown.rstudio.com/authoring_shiny.html)
- And if you have sub points, put two spaces before:
    - Like this
    - And this

### Equations

Equation support can be very handy if you need to provide some formulas in your
text, just use $\LaTeX$ [math](https://en.wikibooks.org/wiki/LaTeX/Mathematics): $x=\sum\beta\frac{\pi^2}{\gamma_i}$

Or more complicated large ones:

$$
f(n) =
  \begin{cases}
    n/2       & \quad \text{if } n \text{ is even}\\
    -(n+1)/2  & \quad \text{if } n \text{ is odd}\\
  \end{cases}
$$

### Images

If you want to embed images, this is how you do it:

![Pluto loves you](http://i.space.com/images/i/000/048/999/i02/pluto-new-horizons-july-2015.jpg?1437582878)

And now time for a horizontal break and off to R!

------

## R markdown

**R markdown** is a version of Markdown that is expanded to support running R code
in between your text. The blocks of R code are called `chunks` and you can treat
them as individual little segments of code, you can jump back and forth between them,
run just individual ones or run all of them when you click the **Knit** button - this
will generate a document that includes both content as well as the output of any
embedded R code chunks within the document. This is an R code chunk:

```{r my-first-chunk}
data <- cars # get the cars data set as an example
summary(data) # show a summary of the data set
```

You can also print out your data in table format if you want to include it in
your document:

```{r, results="asis"}
library(knitr)
kable(head(data))
```

Or you can print out the value of a variable in your text, say the value of $\pi$
with 4 significant digits: `r signif(pi, 4)` or the number of data points in
your data set: `r nrow(data)`.

And of course you can embed plots, for example:

```{r my-plot, echo=FALSE, fig.width=10}
plot(data)
```

For additional information on R and R markdown, there are lots of great resources
on the internet and the R user community is very active and extremely helpful. Often,
googling what you'd like to achieve will provide a good starting point but I can
also recommend the following resources specifically:

 - [R reference manual](http://cran.r-project.org/doc/contrib/Short-refcard.pdf) (a great overview of many useful R commands)
 - [Regression analysis functions](http://cran.r-project.org/doc/contrib/Ricci-refcard-regression.pdf) (statistical analysis is one of the great strengths of R, this is a handy overview of useful functionality)
 - [Stack Overflow](http://stackoverflow.com/) (a Q&A site for programming, searching for answers here often provides very helpful information)

With that, time to jump to the last item

------

## Data Frame Viewer

Note upfront: the approach taken in this user interface is not quite up to date with current easiest practices (i.e. more recently developed R packages make things even easier) so if you're already familiar with some basics of coding and generating plots, I recommend jumping straight to the accompanying *analysis.Rmd* file and working through it instead. However, if you'd like to start just by exploring some plotting features without any R or coding background, this is still a great way to get started.

The [Data Frame Viewer](https://github.com/sebkopf/dfv#dfv) is a custom R package that provides a simple user interface to facilitate getting started with using R for data processing. The GUI illustrates how to import data from Excel, melt data frames into plottable format, add additional information to the data and plot it using ggplot. Provides an easy system to keep track of multiple plots and save them in PDF format. Always shows the actual code that is executed to process or plot the data so users can experiment with changing the code directly and copy it to make their own data processing pipeline independent of this GUI.

The user interface is generated using [GTK+](http://www.gtk.org/), a cross-platform toolkit for graphical user interfaces. If GTK is not installed yet, please follow this [link](https://gist.github.com/sebkopf/9405675) for information on installing R with GTK+.

### Install dfv package

The **devtools** package provides a super convenient way of installing the **dfv** package directly from GitHub. To install **devtools**, run the following from the R command line:

```
install.packages('devtools', depen=T) # development tools
```

Then simply install the latest version of the Data Frame Viewer directly from GitHub by running the following code (if it is the first time you install the **dfv** package, all missing dependencies will be automatically installed as well -> **ggplot2, plyr, psych, scales, grid, gWidgets, RGtk2**, and **xlsx** as well as their respective dependencies, which might take a few minutes):


```
library(devtools)
install_github("sebkopf/dfv")
```

For additional information and troubleshooting help, see the [online help](https://github.com/sebkopf/dfv#dfv).

### Run dfv

Once installed, you can now run the Data Frame Viewer simply by typing:

```
library(dfv)
dfv.start()
```
	---
	title: "Analysis test"
	output: html_document
	---

	```{r, echo=FALSE, warning=FALSE}
	# This code chunk simply makes sure that all the libraries used here are installed, it will not be shown in the report (notice echo = FALSE).
	packages <- c("readxl", "knitr", "tidyr", "dplyr", "ggplot2", "plotly")
	if ( length(missing_pkgs <- setdiff(packages, rownames(installed.packages()))) > 0) {
	message("Installing missing package(s): ", paste(missing_pkgs, collapse = ", "))
	install.packages(missing_pkgs)
	}
	```


	This is a simple example analysis of data including import from Excel, data structuring and plotting. The data in this case happens to be optical density data over time (replicate growth curves for a microorganism) but the nature of the data matters little to the basics introduced.

	## Import OD data

	```{r}
	library(readxl) # fast excel reader
	#library(googlesheets) # fast google spreadsheet reader (not used here but could be useful)
	data.raw <- read_excel("example.xlsx", skip = 1)
	```

	#### Show the raw data

	```{r}
	library(knitr) # the package that renders R markdown and has some good additional functionality
	kable(data.raw)
	```

	### Restructuring the data

	Turning the wide format excel data into long format. Note: here we make use of the pipe operator `%>%`, which just simplifies chaining operations.

	```{r}
	library(tidyr) # for restructuring data very easily
	data.long <- data.raw %>% gather(sample, OD600, -Time)
	# melt <- gather(raw, sample, OD600, -Time) # this would be identical without using %>%
	```

	Introducing time in hours.

	```{r}
	library(dplyr, warn.conflicts = FALSE) # powerful for doing calculations on data (by group, etc.)
	data.long <- data.long %>% mutate(time.hrs = as.numeric(Time - Time[1], units = "hours"))
	```

	First plot of all the data

	```{r}
	library(ggplot2) # powerful plotting package for aesthetics driven plotting

	p1 <-
	ggplot(data.long) + # initiate plot
	aes(x = time.hrs, y = OD600, color = sample) + # setup aesthetic mappings
	geom_point(size = 5) # add points to plot
	print(p1) # output plot
	```


	### Combining data by adding sample meta information from the spreadsheet's second tab

	```{r}
	data.info <- read_excel("example.xlsx", sheet = "info")
	```

	Show all information (these are the experimental conditions for each sample)

	```{r}
	kable(data.info)
	```

	Combine OD data with sample information.

	```{r}
	data.all <- merge(data.long, data.info, by = "sample")
	```

	### Show us the datas

	Reuse same plot using `%+%` to substitute the original data set with a new one and changing the color to be determined based on the new information we added (but keep everything else about the plot the same).

	```{r}
	p1 %+% data.all %+% aes(color = substrate)
	```

	### Summarize data

	To make the figure a little bit easier to navigate, we're going to summarize the data for each condition (combine the replicates) and replot it with an error band showing the whole range of data points for each condition. We could reuse the plot `p1` again, but for clarity are constructing the plot from scratch instead.

	```{r}
	data.sum <- data.all %>%
	group_by(time.hrs, substrate) %>%
	summarize(
	OD600.avg = mean(OD600),
	OD600.min = min(OD600),
	OD600.max = max(OD600))
	data.sum %>% head() %>% kable() # show the first couple of lines

	p2 <- ggplot(data.sum) + # initiate plot
	aes(x = time.hrs, y = OD600.avg, ymin = OD600.min, ymax = OD600.max,
	fill = substrate) + # setup global aesthetic mappings
	geom_ribbon(alpha = 0.3) + # value range (uses ymin and ymax, and fill for color)
	geom_line() + # connect averages (uses y)
	geom_point(shape = 21, size = 5) + # add points for averages (uses y and fill for color)
	theme_bw() + # style plot
	labs(title = "My plot", x = "Time [h]", y = "OD600", color = "Condition") # add labels

	print(p2)
	```

	Note that we could also have had ggplot do the whole statistical summarising for us using `stat_summary` but it's often helpful to have these values separately for other calcluations and purposes.

	Now could e.g. focus on a subset of data but reuse same plot using `%+%` to substitute the original data set with a new one (but keep everythign else about the plot the same).

	```{r}
	p2 %+% filter(data.sum, !grepl("background", substrate), time.hrs < 25)
	```

	Save this plot automatically as pdf by setting specific plot options in the r code chunk

	```{r this-is-my-plot, dev="pdf", fig.width=7, fig.height=5, fig.path="./"}
	print(p2)
	```

	#### Interactive plot

	Last, you can make simple interactive (javascript) plots out of your original ggplots (plotly does not yet work great for all ggplot features but it's a start for easy visualization). You can of course construct plotly plots without ggplot for more customization too but that's for another time.

	```{r}
	library(plotly, warn.conflicts = FALSE)
	ggplotly(p1)
	```
	---
	output: html_document
	---

	# R markdown and Data Frame Viewer tutorial

	This tutorial provides an introduction to [R markdown](http://rmarkdown.rstudio.com/)
	and the [Data Frame Viewer](https://github.com/sebkopf/dfv#dfv).

	## Markdown

	Markdown is a very basic and easy-to-use syntax for styling written documents.
	It's very easy to make some words bold and other words italic with Markdown.
	You can even [link to NCBI](http://www.ncbi.nlm.nih.gov/)!

	### Headers

	Sometimes it's useful to have different levels of headings to structure your documents.
	Start lines with a `#` to create headings. Multiple `##` in a row denote smaller heading sizes.

	You can use one `#` all the way up to `######` six for different heading sizes.

	If you'd like to include a quote, use the > character before the line:

	> My Software never has bugs. It just develops random features.

	### Lists

	Sometimes you need numbered lists (here to some useful resources for markdown):

	1. [Markdown Basics from R-Studio](http://rmarkdown.rstudio.com/authoring_basics.html)
	1. [Mastering Markdown from GitHub](https://guides.github.com/features/mastering-markdown/) (this is where most of the examples above come from)

	And sometimes you want bullet points (the kind of things you can do with R markdown
	if you want to go beyond the basics):

	- [Lots of options for embedded R code](http://rmarkdown.rstudio.com/authoring_rcodechunks.html) (more details below)
	- [Bibliographies and References](http://rmarkdown.rstudio.com/authoring_bibliographies_and_citations.html)
	- [Interactive Documents](http://rmarkdown.rstudio.com/authoring_shiny.html)
	- And if you have sub points, put two spaces before:
	- Like this
	- And this

	### Equations

	Equation support can be very handy if you need to provide some formulas in your
	text, just use $\LaTeX$ [math](https://en.wikibooks.org/wiki/LaTeX/Mathematics): $x=\sum\beta\frac{\pi^2}{\gamma_i}$

	Or more complicated large ones:

	$$
	f(n) =
	\begin{cases}
	n/2 & \quad \text{if } n \text{ is even}\\
	-(n+1)/2 & \quad \text{if } n \text{ is odd}\\
	\end{cases}
	$$

	### Images

	If you want to embed images, this is how you do it:

	![Pluto loves you](http://i.space.com/images/i/000/048/999/i02/pluto-new-horizons-july-2015.jpg?1437582878)

	And now time for a horizontal break and off to R!

	------

	## R markdown

	R markdown is a version of Markdown that is expanded to support running R code
	in between your text. The blocks of R code are called `chunks` and you can treat
	them as individual little segments of code, you can jump back and forth between them,
	run just individual ones or run all of them when you click the Knit button - this
	will generate a document that includes both content as well as the output of any
	embedded R code chunks within the document. This is an R code chunk:

	```{r my-first-chunk}
	data <- cars # get the cars data set as an example
	summary(data) # show a summary of the data set
	```

	You can also print out your data in table format if you want to include it in
	your document:

	```{r, results="asis"}
	library(knitr)
	kable(head(data))
	```

	Or you can print out the value of a variable in your text, say the value of $\pi$
	with 4 significant digits: `r signif(pi, 4)` or the number of data points in
	your data set: `r nrow(data)`.

	And of course you can embed plots, for example:

	```{r my-plot, echo=FALSE, fig.width=10}
	plot(data)
	```

	For additional information on R and R markdown, there are lots of great resources
	on the internet and the R user community is very active and extremely helpful. Often,
	googling what you'd like to achieve will provide a good starting point but I can
	also recommend the following resources specifically:

	- [R reference manual](http://cran.r-project.org/doc/contrib/Short-refcard.pdf) (a great overview of many useful R commands)
	- [Regression analysis functions](http://cran.r-project.org/doc/contrib/Ricci-refcard-regression.pdf) (statistical analysis is one of the great strengths of R, this is a handy overview of useful functionality)
	- [Stack Overflow](http://stackoverflow.com/) (a Q&A site for programming, searching for answers here often provides very helpful information)

	With that, time to jump to the last item

	------

	## Data Frame Viewer

	Note upfront: the approach taken in this user interface is not quite up to date with current easiest practices (i.e. more recently developed R packages make things even easier) so if you're already familiar with some basics of coding and generating plots, I recommend jumping straight to the accompanying analysis.Rmd file and working through it instead. However, if you'd like to start just by exploring some plotting features without any R or coding background, this is still a great way to get started.

	The [Data Frame Viewer](https://github.com/sebkopf/dfv#dfv) is a custom R package that provides a simple user interface to facilitate getting started with using R for data processing. The GUI illustrates how to import data from Excel, melt data frames into plottable format, add additional information to the data and plot it using ggplot. Provides an easy system to keep track of multiple plots and save them in PDF format. Always shows the actual code that is executed to process or plot the data so users can experiment with changing the code directly and copy it to make their own data processing pipeline independent of this GUI.

	The user interface is generated using [GTK+](http://www.gtk.org/), a cross-platform toolkit for graphical user interfaces. If GTK is not installed yet, please follow this [link](https://gist.github.com/sebkopf/9405675) for information on installing R with GTK+.

	### Install dfv package

	The devtools package provides a super convenient way of installing the dfv package directly from GitHub. To install devtools, run the following from the R command line:

	```
	install.packages('devtools', depen=T) # development tools
	```

	Then simply install the latest version of the Data Frame Viewer directly from GitHub by running the following code (if it is the first time you install the dfv package, all missing dependencies will be automatically installed as well -> ggplot2, plyr, psych, scales, grid, gWidgets, RGtk2, and xlsx as well as their respective dependencies, which might take a few minutes):


	```
	library(devtools)
	install_github("sebkopf/dfv")
	```

	For additional information and troubleshooting help, see the [online help](https://github.com/sebkopf/dfv#dfv).

	### Run dfv

	Once installed, you can now run the Data Frame Viewer simply by typing:

	```
	library(dfv)
	dfv.start()
	```