halpo/000-instructions.md

## 000-instructions.md

      
    Raw
  

              000-instructions.md
            
          
    Building a beamer presentation with knitr.

Introduction

The documents included are the input for knitr.  In addition you need to have the tool pandoc installed.   I also use a custom beamer template to add the University of Utah \institute command to the template.  It also changes the indentation some.
Steps


knit document with

library(knitr)
knitr("harvestr.Rmd")
This step will require harvestr and it's dependencies to be installed in R.

use pandoc to create beamer slides.

pandoc -t beamer --template=utah.beamer.tex -Vurl:1 -V theme:CambridgeUS -V colortheme:beaver  -V shorttitle:harvestr  harvestr.md -o harvestr.pdf
The final result

  
## harvestr.Rmd
% Reproducible Parallel Simulations with Harvestr
% \href{mailto:Andrew.Redd@hsc.utah.edu}{Andrew Redd}
% UseR! 2012

`ro dev='pdf', cache=T, warning=T, error=T, out.width="\\textwidth"
    , fig.width=4, fig.height=3 or`
```{r ex1_0, include=F}
library(harvestr)
library(plyr)
library(ggplot2)
library(lme4)
library(dostats)
library(doParallel)
options(harvestr.time=F)
```
```{r ex3_parallel_setup, include=F, dependson='ex3', cache=F}
library(parallel)
library(doParallel)
cl <- makeCluster(2)
clusterEvalQ(cl, {
    library(lme4)
    library(harvestr)
})
clusterExport(cl, ls())
registerDoParallel(cl)
```

\includegraphics[width=\textwidth]{./Bill Gates lazy person.jpg)


# Reproducible Simulation

## Reproducibility

* Same script
* Same results
* Anywhere
    + Single thread
    + Multi-core
    + Cloud Scale

## Everything starts with a seed.
Simulation is based off Pseudo-random number generation (PRNG).

* PRNG is sequential, next number depends on the last state.
* Seeds are used to store the state of a random number generator
* by 'Setting a seed' one can place a PRNG into any exact state.

## Parallel Random Number Generation
Simulation is complicated in new parallel environments.

* PRNG is sequential,
* parallel execution is not,
* and order of execution is not guaranteed.

This is where parallel pseudo-random number generators help out.

## Parallel PRNG
Parallel pseudo-random number generators start with a singe state that
can spawn additional streams as well as streams of random numbers.

1. SPRNG
2. L'Ecuyer combined multiple-recursive generator

# Introducing `harvestr`

## R package `harvestr`
<https://github.com/halpo/harvestr>

What `harvestr` does:

* Reproducibility
* Caching
* Under parallelized environments.


## How `harvestr` works

* Analytical elements are separated into work-flows of dependent elements.
    + Set up environment/seed
    + Generate Data
    + Perform analysis
        - Stochastic
        - Non-Stochastic
    + Summarize
* Results from one step carry to another by carrying the seed with the results.


## **Primary work-flow** for `harvestr`

* `gather(n)` - generate `n` random number streams.
* `farm(seeds, expr)` - evaluate `expr` with each seed in `seeds`.
* `harvest(x, fun)` - for each data in `x` call the function `fun`
  (based off `plyr`s `llply`).

\includegraphics[width=\textwidth]{./c0bfeebb.pdf}

## Example - Simple simulation

**Generate Data**

```{r ex1_1}
seeds <- gather(10, seed=20120614)
data  <- farm(seeds, {
    x1  <- rnorm(400)
    x2  <- rnorm(400)
    g   <- rep(rnorm(4), each=100)
    trt <- rep(1:4, each=100)
    y <- rnorm(n=400, mean=3*x1 + x2 + g)
    data.frame(y, x1, x2, trt)
})
```

-------

**Perform Analysis**

```{r  ex1_2, dependson='ex1_1'}
analyses <- harvest(data, lmer, formula = y~x1+(1|trt))
analyses[[1]]
```

-------

**Recombine**

```{r ex1_3, dependson='ex1_2'}
  results <- ldply(analyses, fixef)
  qplot(data=results[,1:2], x=x1, y=`(Intercept)`)
```

## Stochastic Analysis in `harvestr`

* `gather` then `farm` as before.
* `graft` to generate seeds

\includegraphics[width=\textwidth]{./b837b118.pdf}

## Example 2 - Stochastic Analysis
**graft to obtain independent RNG sub-streams**

```{r ex2_1, dependson='ex1_2'}
branch <- graft(analyses[[1]], 5)
chains <- harvest(branch, mcmcsamp, 100)
mcmcfx <- harvest(chains, slot, 'fixef')
names(mcmcfx) <- 1:5
df2 <- ldply(llply(mcmcfx, t), as.data.frame)
```

------

```{r ex2_2, dependson='ex2_1'}
qplot(data=df2, x1, col=.id, geom=c('density'))
```

## Example 3 Chained.

```{r ex3, tidy=F, dependson='ex1_2'}
branches <- harvest(analyses, graft, 5)
chains <- harvest(branches
            , harvest, mcmcsamp, n=100)
```

I'm really impatient and would like to do this in parallel.

## parallel

Just like `plyr` argument `.parallel`.

* uses [`plyr`](http://cran.r-project.org/package=plyr) and
 [`foreach`](http://cran.r-project.org/package=foreach) parallel structures.

```{r ex3_parallel, warning=F, dependson='ex3.parallel.setup', tidy=F}
#! Parallel environment already setup.
system.time(chains <-
  harvest(branches, harvest, mcmcsamp, n=100, .parallel=TRUE)
)
```

## Caching

Results can be made fault tolerant or interruptible by including caching.

Caching in `harvestr` indexes on

* data
* function
* seed

using the `digest` function.

-----
```{r rm_cache, include=F}
unlink("harvestr-cache", T, T)
```

```{r caching, dependson='rm_cache'}
seeds <- gather(100)
system.time(farm(seeds, tail(rnorm(5e5)), cache=T))
system.time(farm(seeds, tail(rnorm(5e5)), cache=T))
head(dir('harvestr-cache'))
```

## So is it really reproducible?

```{r ex5, warning=F}
set.seed(123)
seeds <- gather(4)
a <- farm(seeds, runif(5))
b <- farm(seeds, runif(5))
c <- farm(seeds, runif(5), .parallel=T)
identical(a, b)
identical(a, c)
```


# Miscellaneous Extras

## Building blocks
Some building blocks that might *might* be helpful.

* `plant`- for setting up copies of an object with given seeds.
* `sprout` - for obtaining the sub-streams used with graft.
* `reap` - single object version of `harvest`


## In case you are wondering

* Yes it works with `Rcpp` code,
    + provided the compiled code uses the RNGScope for RNG in C++.
* **But** take care to not carry C++ reference objects across parallel calls.


------

\titlepage

```{r stop_cl, include=F}
stopCluster(cl)
```

## utah.beamer.tex
\documentclass[$if(fontsize)$$fontsize$,$endif$$if(handout)$handout,$endif$$if(beamer)$ignorenonframetext,$endif$]{$documentclass$}
$if(theme)$
\usetheme{$theme$}
$endif$
$if(colortheme)$
\usecolortheme{$colortheme$}
$endif$
\usepackage{amssymb,amsmath}
% \usepackage{ifxetex,ifluatex}
% \ifxetex
  % \usepackage{fontspec,xltxtra,xunicode}
  % \defaultfontfeatures{Mapping=tex-text,Scale=MatchLowercase}
% \else
  % \ifluatex
    % \usepackage{fontspec}
    % \defaultfontfeatures{Mapping=tex-text,Scale=MatchLowercase}
  % \else
    % \usepackage[utf8]{inputenc}
  % \fi
% \fi
$if(natbib)$
\usepackage{natbib}
\bibliographystyle{plainnat}
$endif$
$if(biblatex)$
\usepackage{biblatex}
$if(biblio-files)$
\bibliography{$biblio-files$}
$endif$
$endif$
$if(listings)$
\usepackage{listings}
$endif$
$if(lhs)$
\lstnewenvironment{code}{\lstset{language=Haskell,basicstyle=\small\ttfamily}}{}
$endif$
$if(highlighting-macros)$
$highlighting-macros$
$endif$
$if(verbatim-in-note)$
\usepackage{fancyvrb}
$endif$
$if(fancy-enums)$
% Redefine labelwidth for lists; otherwise, the enumerate package will cause
% markers to extend beyond the left margin.
% \makeatletter\AtBeginDocument{%
  % \renewcommand{\@listi}
    % {\setlength{\labelwidth}{4em}}
% }\makeatother
\usepackage{enumerate}
$endif$
$if(tables)$
\usepackage{ctable}
\usepackage{float} % provides the H option for float placement
$endif$
$if(url)$
\usepackage{url}
$endif$
$if(graphics)$
\usepackage{graphicx}
$endif$
% Comment these out if you don't want a slide with just the
% part/section/subsection/subsubsection title:
\AtBeginPart{\frame{\partpage}}
\AtBeginSection{\frame{\sectionpage}}
\AtBeginSubsection{\frame{\subsectionpage}}
\AtBeginSubsubsection{\frame{\subsubsectionpage}}
$if(strikeout)$
\usepackage[normalem]{ulem}
% avoid problems with \sout in headers with hyperref:
\pdfstringdefDisableCommands{\renewcommand{\sout}{}}
$endif$
$if(subscript)$
\newcommand{\textsubscr}[1]{\ensuremath{_{\scriptsize\textrm{#1}}}}
$endif$
% \setlength{\parindent}{0pt}
% \setlength{\parskip}{6pt plus 2pt minus 1pt}
% \setlength{\emergencystretch}{3em}  % prevent overfull lines
$if(numbersections)$
$else$
\setcounter{secnumdepth}{0}
$endif$
$if(verbatim-in-note)$
\VerbatimFootnotes % allows verbatim text in footnotes
$endif$
$if(lang)$
\usepackage[$lang$]{babel}
$endif$
$for(header-includes)$
$header-includes$
$endfor$

$if(title)$
$if(shorttitle)$
\title[$shorttitle$]{$title$}
$else$
\title{$title$}
$endif$
$endif$
$if(author)$
\author{$for(author)$$author$$sep$ \and $endfor$}
$endif$
$if(date)$
\date{$date$}
$endif$
\institute[Utah]{
University of Utah\\
School of Medicine
}
\setbeamertemplate{navigation symbols}{}
\begin{document}
$if(title)$
\frame{\titlepage}
$endif$

$for(include-before)$
$include-before$

$endfor$
$if(toc)$
\begin{frame}
\tableofcontents[hideallsubsections]
\end{frame}

$endif$
$body$

$if(natbib)$
$if(biblio-files)$
$if(biblio-title)$
$if(book-class)$
\renewcommand\bibname{$biblio-title$}
$else$
\renewcommand\refname{$biblio-title$}
$endif$
$endif$
\bibliography{$biblio-files$}

$endif$
$endif$
$if(biblatex)$
\printbibliography$if(biblio-title)$[title=$biblio-title$]$endif$

$endif$
$for(include-after)$
$include-after$

$endfor$
\end{document}
	% Reproducible Parallel Simulations with Harvestr
	% \href{mailto:Andrew.Redd@hsc.utah.edu}{Andrew Redd}
	% UseR! 2012

	`ro dev='pdf', cache=T, warning=T, error=T, out.width="\\textwidth"
	, fig.width=4, fig.height=3 or`
	```{r ex1_0, include=F}
	library(harvestr)
	library(plyr)
	library(ggplot2)
	library(lme4)
	library(dostats)
	library(doParallel)
	options(harvestr.time=F)
	```
	```{r ex3_parallel_setup, include=F, dependson='ex3', cache=F}
	library(parallel)
	library(doParallel)
	cl <- makeCluster(2)
	clusterEvalQ(cl, {
	library(lme4)
	library(harvestr)
	})
	clusterExport(cl, ls())
	registerDoParallel(cl)
	```

	\includegraphics[width=\textwidth]{./Bill Gates lazy person.jpg)



	# Reproducible Simulation

	## Reproducibility

	* Same script
	* Same results
	* Anywhere
	+ Single thread
	+ Multi-core
	+ Cloud Scale

	## Everything starts with a seed.
	Simulation is based off Pseudo-random number generation (PRNG).

	* PRNG is sequential, next number depends on the last state.
	* Seeds are used to store the state of a random number generator
	* by 'Setting a seed' one can place a PRNG into any exact state.

	## Parallel Random Number Generation
	Simulation is complicated in new parallel environments.

	* PRNG is sequential,
	* parallel execution is not,
	* and order of execution is not guaranteed.

	This is where parallel pseudo-random number generators help out.

	## Parallel PRNG
	Parallel pseudo-random number generators start with a singe state that
	can spawn additional streams as well as streams of random numbers.

	1. SPRNG
	2. L'Ecuyer combined multiple-recursive generator

	# Introducing `harvestr`

	## R package `harvestr`
	<https://github.com/halpo/harvestr>

	What `harvestr` does:

	* Reproducibility
	* Caching
	* Under parallelized environments.



	## How `harvestr` works

	* Analytical elements are separated into work-flows of dependent elements.
	+ Set up environment/seed
	+ Generate Data
	+ Perform analysis
	- Stochastic
	- Non-Stochastic
	+ Summarize
	* Results from one step carry to another by carrying the seed with the results.


	## Primary work-flow for `harvestr`

	* `gather(n)` - generate `n` random number streams.
	* `farm(seeds, expr)` - evaluate `expr` with each seed in `seeds`.
	* `harvest(x, fun)` - for each data in `x` call the function `fun`
	(based off `plyr`s `llply`).

	\includegraphics[width=\textwidth]{./c0bfeebb.pdf}

	## Example - Simple simulation

	Generate Data

	```{r ex1_1}
	seeds <- gather(10, seed=20120614)
	data <- farm(seeds, {
	x1 <- rnorm(400)
	x2 <- rnorm(400)
	g <- rep(rnorm(4), each=100)
	trt <- rep(1:4, each=100)
	y <- rnorm(n=400, mean=3*x1 + x2 + g)
	data.frame(y, x1, x2, trt)
	})
	```

	-------

	Perform Analysis

	```{r ex1_2, dependson='ex1_1'}
	analyses <- harvest(data, lmer, formula = y~x1+(1\|trt))
	analyses[[1]]
	```

	-------

	Recombine

	```{r ex1_3, dependson='ex1_2'}
	results <- ldply(analyses, fixef)
	qplot(data=results[,1:2], x=x1, y=`(Intercept)`)
	```

	## Stochastic Analysis in `harvestr`

	* `gather` then `farm` as before.
	* `graft` to generate seeds

	\includegraphics[width=\textwidth]{./b837b118.pdf}

	## Example 2 - Stochastic Analysis
	graft to obtain independent RNG sub-streams

	```{r ex2_1, dependson='ex1_2'}
	branch <- graft(analyses[[1]], 5)
	chains <- harvest(branch, mcmcsamp, 100)
	mcmcfx <- harvest(chains, slot, 'fixef')
	names(mcmcfx) <- 1:5
	df2 <- ldply(llply(mcmcfx, t), as.data.frame)
	```

	------

	```{r ex2_2, dependson='ex2_1'}
	qplot(data=df2, x1, col=.id, geom=c('density'))
	```

	## Example 3 Chained.

	```{r ex3, tidy=F, dependson='ex1_2'}
	branches <- harvest(analyses, graft, 5)
	chains <- harvest(branches
	, harvest, mcmcsamp, n=100)
	```

	I'm really impatient and would like to do this in parallel.

	## parallel

	Just like `plyr` argument `.parallel`.

	* uses [`plyr`](http://cran.r-project.org/package=plyr) and
	[`foreach`](http://cran.r-project.org/package=foreach) parallel structures.

	```{r ex3_parallel, warning=F, dependson='ex3.parallel.setup', tidy=F}
	#! Parallel environment already setup.
	system.time(chains <-
	harvest(branches, harvest, mcmcsamp, n=100, .parallel=TRUE)
	)
	```

	## Caching

	Results can be made fault tolerant or interruptible by including caching.

	Caching in `harvestr` indexes on

	* data
	* function
	* seed

	using the `digest` function.

	-----
	```{r rm_cache, include=F}
	unlink("harvestr-cache", T, T)
	```

	```{r caching, dependson='rm_cache'}
	seeds <- gather(100)
	system.time(farm(seeds, tail(rnorm(5e5)), cache=T))
	system.time(farm(seeds, tail(rnorm(5e5)), cache=T))
	head(dir('harvestr-cache'))
	```

	## So is it really reproducible?

	```{r ex5, warning=F}
	set.seed(123)
	seeds <- gather(4)
	a <- farm(seeds, runif(5))
	b <- farm(seeds, runif(5))
	c <- farm(seeds, runif(5), .parallel=T)
	identical(a, b)
	identical(a, c)
	```


	# Miscellaneous Extras

	## Building blocks
	Some building blocks that might might be helpful.

	* `plant`- for setting up copies of an object with given seeds.
	* `sprout` - for obtaining the sub-streams used with graft.
	* `reap` - single object version of `harvest`


	## In case you are wondering

	* Yes it works with `Rcpp` code,
	+ provided the compiled code uses the RNGScope for RNG in C++.
	* But take care to not carry C++ reference objects across parallel calls.


	------

	\titlepage

	```{r stop_cl, include=F}
	stopCluster(cl)
	```
	\documentclass[$if(fontsize)$$fontsize$,$endif$$if(handout)$handout,$endif$$if(beamer)$ignorenonframetext,$endif$]{$documentclass$}
	$if(theme)$
	\usetheme{$theme$}
	$endif$
	$if(colortheme)$
	\usecolortheme{$colortheme$}
	$endif$
	\usepackage{amssymb,amsmath}
	% \usepackage{ifxetex,ifluatex}
	% \ifxetex
	% \usepackage{fontspec,xltxtra,xunicode}
	% \defaultfontfeatures{Mapping=tex-text,Scale=MatchLowercase}
	% \else
	% \ifluatex
	% \usepackage{fontspec}
	% \defaultfontfeatures{Mapping=tex-text,Scale=MatchLowercase}
	% \else
	% \usepackage[utf8]{inputenc}
	% \fi
	% \fi
	$if(natbib)$
	\usepackage{natbib}
	\bibliographystyle{plainnat}
	$endif$
	$if(biblatex)$
	\usepackage{biblatex}
	$if(biblio-files)$
	\bibliography{$biblio-files$}
	$endif$
	$endif$
	$if(listings)$
	\usepackage{listings}
	$endif$
	$if(lhs)$
	\lstnewenvironment{code}{\lstset{language=Haskell,basicstyle=\small\ttfamily}}{}
	$endif$
	$if(highlighting-macros)$
	$highlighting-macros$
	$endif$
	$if(verbatim-in-note)$
	\usepackage{fancyvrb}
	$endif$
	$if(fancy-enums)$
	% Redefine labelwidth for lists; otherwise, the enumerate package will cause
	% markers to extend beyond the left margin.
	% \makeatletter\AtBeginDocument{%
	% \renewcommand{\@listi}
	% {\setlength{\labelwidth}{4em}}
	% }\makeatother
	\usepackage{enumerate}
	$endif$
	$if(tables)$
	\usepackage{ctable}
	\usepackage{float} % provides the H option for float placement
	$endif$
	$if(url)$
	\usepackage{url}
	$endif$
	$if(graphics)$
	\usepackage{graphicx}
	$endif$
	% Comment these out if you don't want a slide with just the
	% part/section/subsection/subsubsection title:
	\AtBeginPart{\frame{\partpage}}
	\AtBeginSection{\frame{\sectionpage}}
	\AtBeginSubsection{\frame{\subsectionpage}}
	\AtBeginSubsubsection{\frame{\subsubsectionpage}}
	$if(strikeout)$
	\usepackage[normalem]{ulem}
	% avoid problems with \sout in headers with hyperref:
	\pdfstringdefDisableCommands{\renewcommand{\sout}{}}
	$endif$
	$if(subscript)$
	\newcommand{\textsubscr}[1]{\ensuremath{_{\scriptsize\textrm{#1}}}}
	$endif$
	% \setlength{\parindent}{0pt}
	% \setlength{\parskip}{6pt plus 2pt minus 1pt}
	% \setlength{\emergencystretch}{3em} % prevent overfull lines
	$if(numbersections)$
	$else$
	\setcounter{secnumdepth}{0}
	$endif$
	$if(verbatim-in-note)$
	\VerbatimFootnotes % allows verbatim text in footnotes
	$endif$
	$if(lang)$
	\usepackage[$lang$]{babel}
	$endif$
	$for(header-includes)$
	$header-includes$
	$endfor$

	$if(title)$
	$if(shorttitle)$
	\title[$shorttitle$]{$title$}
	$else$
	\title{$title$}
	$endif$
	$endif$
	$if(author)$
	\author{$for(author)$$author$$sep$ \and $endfor$}
	$endif$
	$if(date)$
	\date{$date$}
	$endif$
	\institute[Utah]{
	University of Utah\\
	School of Medicine
	}
	\setbeamertemplate{navigation symbols}{}
	\begin{document}
	$if(title)$
	\frame{\titlepage}
	$endif$

	$for(include-before)$
	$include-before$

	$endfor$
	$if(toc)$
	\begin{frame}
	\tableofcontents[hideallsubsections]
	\end{frame}

	$endif$
	$body$

	$if(natbib)$
	$if(biblio-files)$
	$if(biblio-title)$
	$if(book-class)$
	\renewcommand\bibname{$biblio-title$}
	$else$
	\renewcommand\refname{$biblio-title$}
	$endif$
	$endif$
	\bibliography{$biblio-files$}

	$endif$
	$endif$
	$if(biblatex)$
	\printbibliography$if(biblio-title)$[title=$biblio-title$]$endif$

	$endif$
	$for(include-after)$
	$include-after$

	$endfor$
	\end{document}