A script to demonstrate using TensorFlow in Spark with Amazon EMR and sparklyr.
- Create an EMR cluster for sparklyr, connect to EMR and install required tools:
install.packages(tensorflow)
devtools::install_github("rstudio/tfdeploy")
- Connect to Spark using
sparklyr
, copy some data and the mtcars TensorFlow model:
library(sparklyr)
sc <- spark_connect(
master = "yarn-client",
config = list(
sparklyr.apply.env.WORKON_HOME = "/tmp/.virtualenvs",
sparklyr.shell.files = "tfestimators-mtcars.tar"
)
)
mtcars_tbl <- sdf_copy_to(sc, mtcars)
- Install TensorFlow over each worker node (1 nodes in this example); alternatevely, one can install tensorflow while the cluster is being created
sdf_len(sc, 1, repartition = 1) %>% spark_apply(function(e) {
tensorflow::install_tensorflow(extra_packages = c("protobuf==3.0.0b2"))
})
- Perform a prediction in TensorFlow across the Spark cluster:
mtcars_tbl %>% spark_apply(function(df) {
instances <- unname(apply(df, 1, function(e)
list(cyl = e[2], disp = e[3])
))
results <- tfdeploy::predict_savedmodel(
instances,
"tfestimators-mtcars.tar",
signature_name = "predict"
)
unname(unlist(results))
})
# Source: table<sparklyr_tmp_7a8b27d1c8d5> [?? x 1]
# Database: spark_connection
mpg
<dbl>
1 7.90
2 7.90
3 5.41
4 12.1
5 16.7
6 10.7
7 16.7
8 7.05
9 6.80
10 8.22
# ... with more rows
installing Tensorflow in the
spark_apply
session seems too heavy, which will slow down the prediction process.Directly copy all environment dependencies to each worker may be a better option.