Skip to content

Instantly share code, notes, and snippets.

@sinarueeger
Last active July 17, 2019 14:09
Show Gist options
  • Save sinarueeger/172fa0435e4f73876caed74531cc327a to your computer and use it in GitHub Desktop.
Save sinarueeger/172fa0435e4f73876caed74531cc327a to your computer and use it in GitHub Desktop.
Notes + examples from useR!2019

Notes from useR!2019

ToC

Slides + Material

Keynote videos

Monday

Julia Stewart Lowndes: Open science (keynote)

Hadley Wickham: tidydata

Irene Steves: puzzles for teaching

teaching concepts around DS:

  1. how to name files
  2. use small test cases
  3. work with self-contained code
  4. use projects & version control

Material

library(tidiesofmarch)
start_puzzle()
#'
#+

Colin Rundel: ghclass

  • facilitate ... making teaching assessment easy
  • automate, cause many classes and semesters
  • distribute assingents to a repo (mirroring), add a team or an indiivual to a repository
  • usethis::ui_*
  • for each class an organisation, invites all students, one repo with templates
  • wercker for automatic assesment
  • feedback for style
  • Slides: https://github.com/rundel/Presentations/blob/master/UseR2019/UseR2019.pdf
  • rundel/ghclass
  • similar to github classrooms

Material

Mine Cetinkaya-Rundel: DS in a box

  • rstd.io/dsbox-slides
  • three things: content, pedagogy, infrastructure = DS in a box
  • back: rstudio-education/datascience-box, front: datasciencebox.org
  • clientele: used teacher, new teacher,
  • five design principles
    1. cherish day one = use R studio cloud
    2. start with cake = show the end result, then start manipulating
    3. skip baby steps = do drill exercises at home
    4. hide the veggies = broccoli is the analogy for reg expression wrapped in web scraping
    5. leverage the eco system = endulge in the R eco system (ghclass, blogdown, xaringan), learn use stuff.

Material

Julie Josse: Missing data (Keynote)

  • different types of missing values: NA (forgotten to fill the form), imp (impossible to take measure), ...
  • systematic values
  • presenting alteratives to na.action = na.omit
  • study NAs with VIM, naniar, FactoMineR
    1. handling missing values
    1. doing supervised learning with missing values

1) handling missing values

  • modify estimation process to deal with missing values
  • Imputation to get a complete dataset (e.g. with the mean)
  • misaem package

2) Supervised learning

  • Consistency of supervised learning with missing values
  • Theory whats consistent
  • Then test all the algorithms what works, e.g. with EM
  • publish imputation algorithm
  • use a lot of data and any constant (e..g the mean or a value out of range)

Material


Thursday

Lightening talk biostatistics & epidemiology

Summary in a tweet

Lightening talk Text mining

Summary in a tweet

Scott Chamberlain: crul, webmockr, vcr

crul, webmockr, vcr

  • ropensci has lots of packages that do http requests
  • crul: replaces httr and curl (friendlier)
  • mocking and caching
  • forked form another language (perl?)
  • webmockr: like unittesting / expectation > set what to match agains, only allowing http requests that match a certain pattern
  • vcr: speeds up your test (caching)

Material

Jenny Bryan: usethis

  • convencience function for workflows
  • 3 main functions;
    1. create_package()
    2. create_project()
    3. create_from_github()
    4. use_* >> add or modify something to a project or package
  • devtools and usethis uncoupling
  • devtools (meta package), broken up in small packages, e.g. usethis
  • interactive way : add to .Rprofile or attach with devtools
  • programmatic use: use with correct namespace

1) create package

  • use_git
  • use_license()
  • check()
  • use_github()
  • install()
  • use_readme_rmd()

2) make PR

  • create_from_github()

3) look at PR

  • use_pr_fetch({ISSUENUMBER})
  • pr_push()

4) merge all stuff

  • pr_finish()

5) teaching

  • use_course("ZIP FILE") or github repo
  • use_zip, less pedantic
  • use_course("USER/REPO")

Material

Lionel Henry: {{

  • datamasking - like base::subset and base::transform or lm formulas are datamasking + data.table too.
  • rlang package tidyeval
  • {{ arg }} curly-curly
  • inspired by glue
  • {{ var }} shorcut for !!enquo(var)

Material

Davis Vaughan: rray

  • rray stricter array class
  • matrix are a specific case of array

1. subsetting

  • drop the dimenson of 1
  • bag[,1, drop = FALSE]

2. broadcasting

  • increasing dimensionality
  • recycling dimensions

Material

Zhian Kamvar: RECONEPI + field epidemiologists

  • Field epidemiologists are essentials for getting context!
  • epidemiologists spend lots of time in data entry, data cleaning: frustrating, wasted time, repeated task.
  • collect data + prepare report + operational decisions
  • should spend the least amount of time during report preparation

Partners:

Done: take template reports + automate.

Material


Friday

Kieran Martin: R in pharma

Kevin Kuo

KASA.AI

Alexander Kowarik

Roxane Legaie

Gábor Csárdi: pak

install.packages("pak")

Two main functions: pakk:::pkg_ and pakk:::proj_*

FAST

  • lazy installation: only update if needed
  • caching

SAVE

  • install al dependencies in the same library
  • report conflicts up front

CONVENIENT

  • CRAN BioC
  • GH

installing packages

pak::pkg_install()
pak::pkg_remove()
pak::pkg_install()

pak::pkg_install("r-lib/usethis")

project based workflows

To achieve this:

R/
DESCRIPTION
.Rprofile
r-packages/
.Rbuildignore  

do this:

pak:::proj_create
pak:::proj_install
pak:::proj_install_dev()

Material

Arun Srinivasan: new + old data.table stuff

  • Summary of developments in R's data.table package
  • 69 contributers to r-datatable.com
  • 15th most dependent package

short example

DT[ k, j, by]

  • on which rows
  • what to do
  • grouped by what?

Goal

  • speeding up i
  • auto indexing

Material

Jim Hester: vroom

Using altrep functionality (R 3.5+) for data read in and read out.

10 sec benchmark

  • .1 sec : instatn
  • 1 sec : some delay, keep flow
  • 10 sec: stops

vroom

  • memory mapped
  • multi-threaded
  • some C function
  • altrep : alternative representation (since R 3.5+) for on demand parsing

global string pool

  • adv: less memory
  • disadv: hash lookup
  • disadv: single threaded

vroom fast, whats the price for this?

  • operations after
  • print, head, tail, sample, filter, agg > still faster

example: all doubles

  • ex: 1e6 x 25 (468 MB)
  • data.table fastest

example: all characters

  • vroom altrep full fastest

features

  • vroom with select vroom(PATH, col_select = list(medaillon, ...))
  • with remove vroom(PATH, col_select = -hack_license)
  • on the fly renaming
  • multiple dataset to one vroom( c(PATH-1, PATH-2), id = "path") (vroom altrep both faster)
  • vroom_fwf
  • vroom_writer (incl. gzip extension)

Material

David Smith: Model deployment

  • Model deployment similar to Travis CI
  • DevOps: union of ppl, process and products, all together delivering value (bit.ly/WhatIs-DevOps)
  • Uses YAML file like Travis CI
  • try here: azure.com/pipelines (for free)

Material

Frie: sealr

  • authentication: verifying an identity/credentials
  • authorisation: verifying access rights / permissions
  • sealr inspired by passport.js

Material

Julien Cornebisse

datakind.org

Material


TODO

## 1. tidy eval
## 2. usethis
## 3. pkg
## 4. tidyr
## 5. vroom
## 6. data.table
## 7. rray
## ===================================================================
## tidy eval
## ===================================================================
## by Lionel Henry
## example with dplyr ------------------------------------------------
library(dplyr)
iris %>%
group_by(Species) %>%
summarise(avg = mean(Petal.Length, na.rm = TRUE))
## with tidyeval
group_mean <- function(data, by, var) {
data %>%
group_by({{ by }}) %>%
summarise(avg = mean({{ var }}, na.rm = TRUE))
}
group_mean(data = iris, by = Species, var = Petal.Length)
group_mean(data = msleep, by = vore, var = sleep_total)
## an example with ggplot2 -------------------------------------------
library(ggplot2)
theme_set(theme_bw())
ggplot(data = iris, aes(x = Sepal.Length, y = Petal.Length, group = Species, color = Species)) +
geom_point() +
geom_smooth(method = "lm")
## tidyeval
plot_point_smooth <- function(data, x, y, gr = NULL, method = "lm") {
ggplot(data = data, aes({{ x }}, {{ y }}, group = {{ gr }}, color = {{ gr }})) +
geom_point() +
geom_smooth(method = method)
}
plot_point_smooth(iris, x = Sepal.Length, y = Petal.Length, gr = Species)
plot_point_smooth(msleep, x = sleep_total, y = sleep_rem, gr = NULL)
## ===================================================================
## usethis
## ===================================================================
## by Jenny Bryan
library(usethis)
## 1.
create_package("~/tmp/mypackage")
## 2.
use_git()
## 3.
use_mit_license()
use_rprofile()
use_mit_license()
## 4.
devtools::check()
## 5. commit
## 6.
use_github()
## updates DESCRIPTION file
## 7.
devtools::install()
## 8.
use_readme_rmd()
## knit + commit + push
## clean up
fs::dir_delete("~/tmp/mypackage")
## ===================================================================
## pak
## ===================================================================
## by Gábor Csárdi
#install.packages("pak")
## basic pkg installation --------------------------------------------
pak::pkg_install("usethis")
pak::pkg_remove("usethis")
pak::pkg_install("r-lib/usethis")
pak::pkg_status("usethis")
## go to a project (needs a .Proj file) ------------------------------
usethis::create_project("~/tmp/test")
dir()
pak:::proj_create()
dir()
readLines("DESCRIPTION")
pak:::proj_install("usethis")
## this installs dependencies into a private project library
readLines("DESCRIPTION")
## cleaning up -------------------------------------------------------
fs::dir_delete("~/tmp/test")
## ===================================================================
## pivot_wide and long
## ===================================================================
## by Hadley Wickham
# pak::pkg_install("tidyverse/tidyr")
library(tidyr)
# pak::pkg_install("chrk623/dataAnim")
# Master's Thesis project by Charco Hui
library(dataAnim)
datoy_wide
## this needs to be longer
datoy_long
## this needs to be wider
## lets make it longer
datoy_wide %>%
pivot_longer(-Name, names_to = "Subject", values_to = "Grade")
## lets make it wider
datoy_long %>%
dplyr::mutate(Time = 1:nrow(datoy_long)) %>%
pivot_wider(names_from = "Subject", values_from = c("Score", "Time"))
## ===================================================================
## vroom
## ===================================================================
## by Jim Hester
## get a large data (> 10 MB) ----------------------------------------
## from here: https://portals.broadinstitute.org/collaboration/giant/index.php/GIANT_consortium_data_files
## Height
download.file("https://portals.broadinstitute.org/collaboration/giant/images/8/80/Height_AA_add_SV.txt.gz", "Height_AA_add_SV.txt.gz")
## BMI
download.file("https://portals.broadinstitute.org/collaboration/giant/images/3/33/BMI_African_American.fmt.gzip", "BMI_African_American.fmt.gzip")
path_to_file_1 <- "Height_AA_add_SV.txt.gz"
path_to_file_2 <- "BMI_African_American.fmt.gzip" #Height_HA_add_SV.txt.gz"
## file size
## pak::pkg_install("fs")
## library(fs)
fs::file_size(path_to_file_1)
fs::file_size(path_to_file_2)
## vroom -------------------------------------------------------------
## pak::pkg_install("vroom")
library(vroom)
library(dplyr)
system.time(
giant_vroom <- vroom::vroom(path_to_file_1)
)
giant_vroom
## 0.154 sec
system.time(
tmp <- giant_vroom %>% select(CHR, POS) %>% filter(CHR == 1)
)
## 0.001 sec
## with DT
system.time(
giant_DT <- data.table::fread(path_to_file_1)
)
## 0.107 sec
system.time(
giant_DT %>% select(CHR, POS) %>% filter(CHR == 1)
)
## 0.002 sec
## col_select --------------------------------------------------------
giant_vroom_select <- vroom::vroom(path_to_file_1,
col_select = list(SNPNAME, ends_with("_MAF")))
head(giant_vroom_select)
giant_vroom_remove <- vroom::vroom(path_to_file_1,
col_select = -ExAC_AFR_MAF)
head(giant_vroom_remove)
giant_vroom_rename <- vroom::vroom(path_to_file_1,
col_select = list(p = Pvalue, everything()))
head(giant_vroom_rename)
## multiple dataset --------------------------------------------------
data_combined <- vroom::vroom(
c(path_to_file_1, path_to_file_2),
id = "path")
table(data_combined$path)
## compare to purrr + fread
## ===================================================================
## data.table
## ===================================================================
## by Arun Srinivasan
# pak::pkg_install("data.table")
library(data.table)
p <- 2e8
dat <- data.table(x = sample(1e5, p, TRUE), y = runif(p))
system.time(
tmp <- dat[x %in% 2000:3000, ]
)
system.time(
tmp <- dat[x %in% 2000:3000, ]
)
## ===================================================================
## rray
## ===================================================================
## by Davis Vaughan
## devtools::install_github("r-lib/rray")
## default -----------------------------------------------------------
mat_1 <- matrix(c(15, 10, 8, 6, 12, 9), byrow = FALSE, nrow = 2)
mat_2 <- matrix(c(5, 2, 3), nrow = 1)
## lesson 1
mat_1 + mat_2
## lesson 2
dim(mat_1[,2:3])
dim(mat_1[,1]) ## why not 2x1?
length(mat_1[,1]) ## turned into a vector!
dim(mat_1[,1, drop = FALSE]) ## but with drop = FALSE we can keep it a matrix
## rray ---------------------------------------------------------------
library(rray)
(mat_1_rray <- rray(c(15, 10, 8, 6, 12, 9), dim = c(2, 3)))
(mat_2_rray <- rray(c(5, 2, 3), dim = c(1, 3)))
## lesson 1
mat_1_rray + mat_2_rray
## lesson 2
dim(mat_1_rray[,2:3])
dim(mat_1_rray[,1])
## smart functions
mat_1_rray / rray_sum(mat_1_rray, axes = 1)
rray_bind(mat_1_rray, mat_2_rray, .axis = 1)
rray_bind(mat_1_rray, mat_2_rray, .axis = 2)
## ===================================================================
## rstatsmeme
## ===================================================================
## discovered via Frie
# pak::pkg_install("favstats/rstatsmemes")
library(rstatsmemes)
show_me_an_R_meme()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment