Skip to content

Instantly share code, notes, and snippets.

@javierluraschi
Last active March 7, 2020 20:36
Show Gist options
  • Save javierluraschi/4b87c05fe32b5489835cabd38f3c5626 to your computer and use it in GitHub Desktop.
Save javierluraschi/4b87c05fe32b5489835cabd38f3c5626 to your computer and use it in GitHub Desktop.
Amazon EMR with tidymodels and tune

Create EMR cluster with support for R.

Fix Development Tools to fix gower package installation,

sudo yum remove gcc72-c++.x86_64 libgcc72.x86_64
sudo yum groupinstall 'Development Tools'

also create file .R\Makevars:

CC = /usr/bin/gcc64
CXX = /usr/bin/g++
SHLIB_OPENMP_CFLAGS = -fopenmp

Also this:

sudo yum install R-devel

Then install required packages

install.packages("tidymodels")
install.packages("tune")
install.packages("mlbench")
install.packages("magrittr")
install.packages("dplyr")
install.packages("parsnip")
install.packages("kernlab")
library(dplyr)
library(magrittr)
library(parsnip)
library(recipes)
library(rsample)
library(yardstick)
library(tune)

Follow the tidymodels Grid Search Tutorial.

Now lets try with sparklyr, in this case, using a 3 node cluster:

install.packages("remotes")
remotes::install_github("sparklyr/sparklyr")
library(sparklyr)

# Connect to Spark using 3 nodes with 8 CPUs each
sc <- spark_connect(
  master = "yarn",
  spark_home = "/usr/lib/spark/",
  config = list(
    "spark.executor.instances" = 24
  )
)

# Validate spark_apply() is working properely, repartition to 3 nodes with 8 CPUs each
sdf_len(sc, 3 * 8, repartition = 3 * 8) %>% spark_apply(~ 42)

First, lets capture execution time without using Spark:

system.time({
    tune_grid(
        Class ~ .,
        model = svm_mod,
        resamples = iono_rs,
        metrics = roc_vals,
        control = ctrl
    )
})
   user  system elapsed 
133.386   0.503 133.883 

You can then register Spark as a foreeach backend, notice this is a new feature to be released in sparklyr 1.2:

# Registere Spark as foreach backend
registerDoSpark(sc)

# Check number of parallel workers
foreach::getDoParWorkers()
[1] 24

And then rerun using grid search using Spark this time:

system.time({
    tune_grid(
        Class ~ .,
        model = svm_mod,
        resamples = iono_rs,
        metrics = roc_vals,
        control = ctrl
    )
})
   user  system elapsed 
  3.735   0.310  85.088 
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment