javierluraschi/amazon-emr-tidymodels-tune.md

## amazon-emr-tidymodels-tune.md

      
    Raw
  

              amazon-emr-tidymodels-tune.md
            
          
    Create EMR cluster with support for R.
Fix Development Tools
to fix gower package installation,
sudo yum remove gcc72-c++.x86_64 libgcc72.x86_64
sudo yum groupinstall 'Development Tools'
also create file .R\Makevars:
CC = /usr/bin/gcc64
CXX = /usr/bin/g++
SHLIB_OPENMP_CFLAGS = -fopenmp

Also this:
sudo yum install R-devel
Then install required packages
install.packages("tidymodels")
install.packages("tune")
install.packages("mlbench")
install.packages("magrittr")
install.packages("dplyr")
install.packages("parsnip")
install.packages("kernlab")
library(dplyr)
library(magrittr)
library(parsnip)
library(recipes)
library(rsample)
library(yardstick)
library(tune)

Follow the tidymodels Grid Search Tutorial.
Now lets try with sparklyr, in this case, using a 3 node cluster:
install.packages("remotes")
remotes::install_github("sparklyr/sparklyr")
library(sparklyr)

# Connect to Spark using 3 nodes with 8 CPUs each
sc <- spark_connect(
  master = "yarn",
  spark_home = "/usr/lib/spark/",
  config = list(
    "spark.executor.instances" = 24
  )
)

# Validate spark_apply() is working properely, repartition to 3 nodes with 8 CPUs each
sdf_len(sc, 3 * 8, repartition = 3 * 8) %>% spark_apply(~ 42)
First, lets capture execution time without using Spark:
system.time({
    tune_grid(
        Class ~ .,
        model = svm_mod,
        resamples = iono_rs,
        metrics = roc_vals,
        control = ctrl
    )
})
   user  system elapsed 
133.386   0.503 133.883 

You can then register Spark as a foreeach backend, notice this is a new feature to be released in sparklyr 1.2:
# Registere Spark as foreach backend
registerDoSpark(sc)

# Check number of parallel workers
foreach::getDoParWorkers()

[1] 24

And then rerun using grid search using Spark this time:
system.time({
    tune_grid(
        Class ~ .,
        model = svm_mod,
        resamples = iono_rs,
        metrics = roc_vals,
        control = ctrl
    )
})
   user  system elapsed 
  3.735   0.310  85.088