Skip to content

@halpo /000-instructions.md
Created

Embed URL

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP
harvestr R users conference presentation.

Building a beamer presentation with knitr.

Introduction

The documents included are the input for knitr. In addition you need to have the tool pandoc installed. I also use a custom beamer template to add the University of Utah \institute command to the template. It also changes the indentation some.

Steps

  1. knit document with
library(knitr)
knitr("harvestr.Rmd")

This step will require harvestr and it's dependencies to be installed in R.

  1. use pandoc to create beamer slides.
pandoc -t beamer --template=utah.beamer.tex -Vurl:1 -V theme:CambridgeUS -V colortheme:beaver  -V shorttitle:harvestr  harvestr.md -o harvestr.pdf

The final result

% Reproducible Parallel Simulations with Harvestr % \href{mailto:Andrew.Redd@hsc.utah.edu}{Andrew Redd} % UseR! 2012

ro dev='pdf', cache=T, warning=T, error=T, out.width="\\textwidth" , fig.width=4, fig.height=3 or

library(harvestr)
library(plyr)
library(ggplot2)
library(lme4)
library(dostats)
library(doParallel)
options(harvestr.time=F)
library(parallel)
library(doParallel)
cl <- makeCluster(2)
clusterEvalQ(cl, {
    library(lme4)
    library(harvestr)
})
clusterExport(cl, ls())
registerDoParallel(cl)

\includegraphics[width=\textwidth]{./Bill Gates lazy person.jpg)

Reproducible Simulation

Reproducibility

  • Same script
  • Same results
  • Anywhere
    • Single thread
    • Multi-core
    • Cloud Scale

Everything starts with a seed.

Simulation is based off Pseudo-random number generation (PRNG).

  • PRNG is sequential, next number depends on the last state.
  • Seeds are used to store the state of a random number generator
  • by 'Setting a seed' one can place a PRNG into any exact state.

Parallel Random Number Generation

Simulation is complicated in new parallel environments.

  • PRNG is sequential,
  • parallel execution is not,
  • and order of execution is not guaranteed.

This is where parallel pseudo-random number generators help out.

Parallel PRNG

Parallel pseudo-random number generators start with a singe state that can spawn additional streams as well as streams of random numbers.

  1. SPRNG
  2. L'Ecuyer combined multiple-recursive generator

Introducing harvestr

R package harvestr

https://github.com/halpo/harvestr

What harvestr does:

  • Reproducibility
  • Caching
  • Under parallelized environments.

How harvestr works

  • Analytical elements are separated into work-flows of dependent elements.
    • Set up environment/seed
    • Generate Data
    • Perform analysis
      • Stochastic
      • Non-Stochastic
    • Summarize
  • Results from one step carry to another by carrying the seed with the results.

Primary work-flow for harvestr

  • gather(n) - generate n random number streams.
  • farm(seeds, expr) - evaluate expr with each seed in seeds.
  • harvest(x, fun) - for each data in x call the function fun (based off plyrs llply).

\includegraphics[width=\textwidth]{./c0bfeebb.pdf}

Example - Simple simulation

Generate Data

seeds <- gather(10, seed=20120614)
data  <- farm(seeds, {
    x1  <- rnorm(400)
    x2  <- rnorm(400)
    g   <- rep(rnorm(4), each=100)
    trt <- rep(1:4, each=100)
    y <- rnorm(n=400, mean=3*x1 + x2 + g)
    data.frame(y, x1, x2, trt)
}) 

Perform Analysis

analyses <- harvest(data, lmer, formula = y~x1+(1|trt))
analyses[[1]]

Recombine

  results <- ldply(analyses, fixef)
  qplot(data=results[,1:2], x=x1, y=`(Intercept)`)

Stochastic Analysis in harvestr

  • gather then farm as before.
  • graft to generate seeds

\includegraphics[width=\textwidth]{./b837b118.pdf}

Example 2 - Stochastic Analysis

graft to obtain independent RNG sub-streams

branch <- graft(analyses[[1]], 5)
chains <- harvest(branch, mcmcsamp, 100)
mcmcfx <- harvest(chains, slot, 'fixef')
names(mcmcfx) <- 1:5
df2 <- ldply(llply(mcmcfx, t), as.data.frame)

qplot(data=df2, x1, col=.id, geom=c('density'))

Example 3 Chained.

branches <- harvest(analyses, graft, 5)
chains <- harvest(branches
            , harvest, mcmcsamp, n=100)

I'm really impatient and would like to do this in parallel.

parallel

Just like plyr argument .parallel.

#! Parallel environment already setup.
system.time(chains <- 
  harvest(branches, harvest, mcmcsamp, n=100, .parallel=TRUE)
)

Caching

Results can be made fault tolerant or interruptible by including caching.

Caching in harvestr indexes on

  • data
  • function
  • seed

using the digest function.


unlink("harvestr-cache", T, T)
seeds <- gather(100)
system.time(farm(seeds, tail(rnorm(5e5)), cache=T))
system.time(farm(seeds, tail(rnorm(5e5)), cache=T))
head(dir('harvestr-cache'))

So is it really reproducible?

set.seed(123)
seeds <- gather(4)
a <- farm(seeds, runif(5))
b <- farm(seeds, runif(5))
c <- farm(seeds, runif(5), .parallel=T)
identical(a, b)
identical(a, c)

Miscellaneous Extras

Building blocks

Some building blocks that might might be helpful.

  • plant- for setting up copies of an object with given seeds.
  • sprout - for obtaining the sub-streams used with graft.
  • reap - single object version of harvest

In case you are wondering

  • Yes it works with Rcpp code,
    • provided the compiled code uses the RNGScope for RNG in C++.
  • But take care to not carry C++ reference objects across parallel calls.

\titlepage

stopCluster(cl)
\documentclass[$if(fontsize)$$fontsize$,$endif$$if(handout)$handout,$endif$$if(beamer)$ignorenonframetext,$endif$]{$documentclass$}
$if(theme)$
\usetheme{$theme$}
$endif$
$if(colortheme)$
\usecolortheme{$colortheme$}
$endif$
\usepackage{amssymb,amsmath}
% \usepackage{ifxetex,ifluatex}
% \ifxetex
% \usepackage{fontspec,xltxtra,xunicode}
% \defaultfontfeatures{Mapping=tex-text,Scale=MatchLowercase}
% \else
% \ifluatex
% \usepackage{fontspec}
% \defaultfontfeatures{Mapping=tex-text,Scale=MatchLowercase}
% \else
% \usepackage[utf8]{inputenc}
% \fi
% \fi
$if(natbib)$
\usepackage{natbib}
\bibliographystyle{plainnat}
$endif$
$if(biblatex)$
\usepackage{biblatex}
$if(biblio-files)$
\bibliography{$biblio-files$}
$endif$
$endif$
$if(listings)$
\usepackage{listings}
$endif$
$if(lhs)$
\lstnewenvironment{code}{\lstset{language=Haskell,basicstyle=\small\ttfamily}}{}
$endif$
$if(highlighting-macros)$
$highlighting-macros$
$endif$
$if(verbatim-in-note)$
\usepackage{fancyvrb}
$endif$
$if(fancy-enums)$
% Redefine labelwidth for lists; otherwise, the enumerate package will cause
% markers to extend beyond the left margin.
% \makeatletter\AtBeginDocument{%
% \renewcommand{\@listi}
% {\setlength{\labelwidth}{4em}}
% }\makeatother
\usepackage{enumerate}
$endif$
$if(tables)$
\usepackage{ctable}
\usepackage{float} % provides the H option for float placement
$endif$
$if(url)$
\usepackage{url}
$endif$
$if(graphics)$
\usepackage{graphicx}
$endif$
% Comment these out if you don't want a slide with just the
% part/section/subsection/subsubsection title:
\AtBeginPart{\frame{\partpage}}
\AtBeginSection{\frame{\sectionpage}}
\AtBeginSubsection{\frame{\subsectionpage}}
\AtBeginSubsubsection{\frame{\subsubsectionpage}}
$if(strikeout)$
\usepackage[normalem]{ulem}
% avoid problems with \sout in headers with hyperref:
\pdfstringdefDisableCommands{\renewcommand{\sout}{}}
$endif$
$if(subscript)$
\newcommand{\textsubscr}[1]{\ensuremath{_{\scriptsize\textrm{#1}}}}
$endif$
% \setlength{\parindent}{0pt}
% \setlength{\parskip}{6pt plus 2pt minus 1pt}
% \setlength{\emergencystretch}{3em} % prevent overfull lines
$if(numbersections)$
$else$
\setcounter{secnumdepth}{0}
$endif$
$if(verbatim-in-note)$
\VerbatimFootnotes % allows verbatim text in footnotes
$endif$
$if(lang)$
\usepackage[$lang$]{babel}
$endif$
$for(header-includes)$
$header-includes$
$endfor$
$if(title)$
$if(shorttitle)$
\title[$shorttitle$]{$title$}
$else$
\title{$title$}
$endif$
$endif$
$if(author)$
\author{$for(author)$$author$$sep$ \and $endfor$}
$endif$
$if(date)$
\date{$date$}
$endif$
\institute[Utah]{
University of Utah\\
School of Medicine
}
\setbeamertemplate{navigation symbols}{}
\begin{document}
$if(title)$
\frame{\titlepage}
$endif$
$for(include-before)$
$include-before$
$endfor$
$if(toc)$
\begin{frame}
\tableofcontents[hideallsubsections]
\end{frame}
$endif$
$body$
$if(natbib)$
$if(biblio-files)$
$if(biblio-title)$
$if(book-class)$
\renewcommand\bibname{$biblio-title$}
$else$
\renewcommand\refname{$biblio-title$}
$endif$
$endif$
\bibliography{$biblio-files$}
$endif$
$endif$
$if(biblatex)$
\printbibliography$if(biblio-title)$[title=$biblio-title$]$endif$
$endif$
$for(include-after)$
$include-after$
$endfor$
\end{document}
@ptoche

This is a great template. Am having problems with:
Error in try(fun(x, ...), getOption("harvestr.try.silent", FALSE)) :
object 'mcmcsamp' not found
both harvestr and lme4 are installed...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.