cosmincatalin/install-rstudio-server.sh

## readme.md

      
    Raw
  

              readme.md
            
          
    AWS EMR bootstrap to install RStudio Server along with sparklyr

How to use the bootstrap

Update 2019-10-08:
Unfortunately, this script cannot run succesfully as a bootstrap anymore. On the bright side, you can run it like a step, so if you execute it before all other steps, you can still look at it as being a "bootstrap". Instructions are updated to reflect this.


You will first have to download the gist to a file and then upload it to S3 in a bucket of your choice.


Using the AWS EMR Console create a cluster and choose advanced options.


In Step 1 make sure you check the Spark x.x.x checkbox if you want to make use of the sparklyr library in RStudio. You can customize the Spark version by choosing a different emr Release version.


Add a step by selecting Custom JAR and clicking Configure.

For the Name you can fill something like Install RStudio Server
For JAR location fill in something like s3://us-east-1.elasticmapreduce/libs/script-runner/script-runner.jar. If you are not running in us-east-1, change the region accordingly.
As Arguments add the following:


Something like s3://my-bucket/emr/bootstrap/install-rstudio-server.sh. This is mandatory and it is the location of the script on S3. The EMR cluster must have the permissions so that it can read from that location.


--sd-version - optional, default is 1.1.463. The script downloads the artefact from the daily builds bucket, You can use a CLI command like aws s3 ls s3://rstudio-dailybuilds/rstudio- to check what versions are available.


--sd-user - optional, defaults to drwho. RStudio Server needs a real system user. The script creates one as part of the bootstrap process.


--sd-pass - optional, default to tardis. The password for the above specified user. If you're going to use the defaults for the credentials, make sure the EMR cluster is not Internet accessible, as this could be a serious security vunerability.


--spark-version - optional, defaults to 2.4.3. sparklyr which is installed as part of the bootstrap process, needs a locally downloaded version of Spark. You should make sure that this version matches the Spark version installed on the cluster. This is only relevant if you are actually going to use the sparklyr capabilities.


EMR release
--spark-version


4.0.0
1.4.1


4.1.0
1.5.0


4.2.0
1.5.2


4.3.0
1.6.0


.....
.....


4.5.0
1.6.1


.....
.....


4.7.2
1.6.2


.....
.....


5.0.0
2.0.0


5.0.3
2.0.1


.....
.....


5.2.0
2.0.2


.....
.....


5.3.0
2.1.0


.....
.....


5.6.0
2.1.1


.....
.....


5.8.0
2.2.0


.....
.....


6.0.0
2.4.3 (default)


After the cluster has started, you will need to access your cluster's master address and specify port 8787. RStudio Server is only available on the master instance. Depending on where you cluster is launched, you might need to establish a tunnel/proxy connection.


After logging in using the default/custom credentials provided, you can connect to the Spark cluster with the following script:


library(sparklyr)
library(dplyr)
sc <- spark_connect(master = "yarn-client")

Other interesting material

Take a look at my other Related gists:

AWS EMR bootstrap to install R packages from CRAN
Shiny community server with OAuth on Amazon EC2


## install-rstudio-server.sh
#!/bin/bash

# These variables can be overwritten using the arguments below
VERSION="1.1.463"
# drwho is listed as user in YARN's Resource Manager UI.
USER="drwho"
# Depending on where the EMR cluster lives, you might have to change this to avoid security issues.
# To change the default password (and user), use the arguments bellow.
# If the cluster is not visible on the Internet, you can just leave the defaults for convenience.
PASS="tardis"
# A Spark version to install. sparklyr needs to have a "local" installed version of Spark to function.
# It should match the EMR cluster Spark version. Automatic detection at bootstrap time is
# unfortunately very difficult.
SPARK="2.4.3"

# To connect to Spark via YARN, after logging in the RStudio Server Web UI execute the following code:
#
# library(sparklyr)
# library(dplyr)
# sc <- spark_connect(master = "yarn-client")
#

grep -Fq "\"isMaster\": true" /mnt/var/lib/info/instance.json
if [ $? -eq 0 ];
then
    while [[ $# > 1 ]]; do
        key="$1"

        case $key in
            # RStudio Server version to install. Executing `aws s3 ls s3://rstudio-dailybuilds/rstudio-server-rhel-` will give you valid versions
            --sd-version)
                VERSION="$2"
                shift
                ;;
            # A user to create. It is going to be this user under which all RStudio Server actions will be executed
            --sd-user)
                USER="$2"
                shift
                ;;
            # The password for the above specified user
            --sd-user-password)
                PASS="$2"
                shift
                ;;
            # The version of Spark to install locally
            --spark-version)
                SPARK="$2"
                shift
                ;;
            *)
                echo "Unknown option: ${key}"
                exit 1;
        esac
        shift
    export LC_ALL="en_US.utf8"
    export LANG="en_US.utf8"
    export LANGUAGE="en_US.utf8"
    export LC_CTYPE="en_US.utf8"
    done
    echo "*****************************************"
    echo "  1. Download RStudio Server ${VERSION}   "
    echo "*****************************************"
    wget https://s3.amazonaws.com/rstudio-dailybuilds/rstudio-server-rhel-${VERSION}-x86_64.rpm
    echo "         2. Install dependencies         "
    echo "*****************************************"
    # This is needed for installing devtools
    sudo yum -y install libcurl libcurl-devel 1>&2
    echo "        3. Install RStudio Server        "
    echo "*****************************************"
    sudo yum -y install --nogpgcheck rstudio-server-rhel-${VERSION}-x86_64.rpm 1>&2
    echo "      4. Create R Studio Server user     "
    echo "*****************************************"
    epass=$(perl -e 'print crypt($ARGV[0], "password")' ${PASS})
    sudo useradd -m -p ${epass} ${USER}
    # This is to allow access to HDFS
    sudo usermod -a -G hadoop ${USER}
    echo "  5. Create environment variables file   "
    echo "*****************************************"
    # This file contains env variables that are loaded into RStudio. Using RStudio with Spark
    # is the main use case for installing it in EMR in the first place, so it only makes sense that
    # SPARK_HOME is added to the environment. The location is based on version ^5.0.0 of EMR.
    sudo runuser -l ${USER} -c "touch /home/${USER}/.Renviron"
    sudo runuser -l ${USER} -c "echo 'SPARK_HOME=/usr/lib/spark' >> /home/${USER}/.Renviron"
    echo "     6. Install devtools and sparkyr     "
    echo "*****************************************"
    # Create global install script and execute it
    touch /home/hadoop/install-global.R
    echo 'install.packages("devtools", "/usr/share/R/library/", repos="http://cran.rstudio.com/")' >> /home/hadoop/install-global.R
    echo 'devtools::install_github("rstudio/sparklyr")' >> /home/hadoop/install-global.R
    sudo R CMD BATCH /home/hadoop/install-global.R
    # Create user install script and execute it
    sudo runuser -l ${USER} -c 'touch /home/'${USER}'/install-user.R'
    sudo runuser -l ${USER} -c "echo 'library(sparklyr)' >> /home/${USER}/install-user.R"
    sudo runuser -l ${USER} -c "echo 'spark_install(version = \"${SPARK}\")' >> /home/${USER}/install-user.R"
    sudo runuser -l ${USER} -c 'R CMD BATCH /home/'${USER}'/install-user.R'
    sudo rstudio-server start 1>&2
    echo "                  Done                   "
    echo "*****************************************"

else
    echo "RStudio Server is only installed on the master node. This is a slave."
    exit 0;
fi
EMR release	`--spark-version`
4.0.0	1.4.1
4.1.0	1.5.0
4.2.0	1.5.2
4.3.0	1.6.0
.....	.....
4.5.0	1.6.1
.....	.....
4.7.2	1.6.2
.....	.....
5.0.0	2.0.0
5.0.3	2.0.1
.....	.....
5.2.0	2.0.2
.....	.....
5.3.0	2.1.0
.....	.....
5.6.0	2.1.1
.....	.....
5.8.0	2.2.0
.....	.....
6.0.0	2.4.3 (default)
	#!/bin/bash

	# These variables can be overwritten using the arguments below
	VERSION="1.1.463"
	# drwho is listed as user in YARN's Resource Manager UI.
	USER="drwho"
	# Depending on where the EMR cluster lives, you might have to change this to avoid security issues.
	# To change the default password (and user), use the arguments bellow.
	# If the cluster is not visible on the Internet, you can just leave the defaults for convenience.
	PASS="tardis"
	# A Spark version to install. sparklyr needs to have a "local" installed version of Spark to function.
	# It should match the EMR cluster Spark version. Automatic detection at bootstrap time is
	# unfortunately very difficult.
	SPARK="2.4.3"

	# To connect to Spark via YARN, after logging in the RStudio Server Web UI execute the following code:
	#
	# library(sparklyr)
	# library(dplyr)
	# sc <- spark_connect(master = "yarn-client")
	#

	grep -Fq "\"isMaster\": true" /mnt/var/lib/info/instance.json
	if [ $? -eq 0 ];
	then
	while [[ $# > 1 ]]; do
	key="$1"

	case $key in
	# RStudio Server version to install. Executing `aws s3 ls s3://rstudio-dailybuilds/rstudio-server-rhel-` will give you valid versions
	--sd-version)
	VERSION="$2"
	shift
	;;
	# A user to create. It is going to be this user under which all RStudio Server actions will be executed
	--sd-user)
	USER="$2"
	shift
	;;
	# The password for the above specified user
	--sd-user-password)
	PASS="$2"
	shift
	;;
	# The version of Spark to install locally
	--spark-version)
	SPARK="$2"
	shift
	;;
	*)
	echo "Unknown option: ${key}"
	exit 1;
	esac
	shift
	export LC_ALL="en_US.utf8"
	export LANG="en_US.utf8"
	export LANGUAGE="en_US.utf8"
	export LC_CTYPE="en_US.utf8"
	done
	echo "*****************************************"
	echo " 1. Download RStudio Server ${VERSION} "
	echo "*****************************************"
	wget https://s3.amazonaws.com/rstudio-dailybuilds/rstudio-server-rhel-${VERSION}-x86_64.rpm
	echo " 2. Install dependencies "
	echo "*****************************************"
	# This is needed for installing devtools
	sudo yum -y install libcurl libcurl-devel 1>&2
	echo " 3. Install RStudio Server "
	echo "*****************************************"
	sudo yum -y install --nogpgcheck rstudio-server-rhel-${VERSION}-x86_64.rpm 1>&2
	echo " 4. Create R Studio Server user "
	echo "*****************************************"
	epass=$(perl -e 'print crypt($ARGV[0], "password")' ${PASS})
	sudo useradd -m -p ${epass} ${USER}
	# This is to allow access to HDFS
	sudo usermod -a -G hadoop ${USER}
	echo " 5. Create environment variables file "
	echo "*****************************************"
	# This file contains env variables that are loaded into RStudio. Using RStudio with Spark
	# is the main use case for installing it in EMR in the first place, so it only makes sense that
	# SPARK_HOME is added to the environment. The location is based on version ^5.0.0 of EMR.
	sudo runuser -l ${USER} -c "touch /home/${USER}/.Renviron"
	sudo runuser -l ${USER} -c "echo 'SPARK_HOME=/usr/lib/spark' >> /home/${USER}/.Renviron"
	echo " 6. Install devtools and sparkyr "
	echo "*****************************************"
	# Create global install script and execute it
	touch /home/hadoop/install-global.R
	echo 'install.packages("devtools", "/usr/share/R/library/", repos="http://cran.rstudio.com/")' >> /home/hadoop/install-global.R
	echo 'devtools::install_github("rstudio/sparklyr")' >> /home/hadoop/install-global.R
	sudo R CMD BATCH /home/hadoop/install-global.R
	# Create user install script and execute it
	sudo runuser -l ${USER} -c 'touch /home/'${USER}'/install-user.R'
	sudo runuser -l ${USER} -c "echo 'library(sparklyr)' >> /home/${USER}/install-user.R"
	sudo runuser -l ${USER} -c "echo 'spark_install(version = \"${SPARK}\")' >> /home/${USER}/install-user.R"
	sudo runuser -l ${USER} -c 'R CMD BATCH /home/'${USER}'/install-user.R'
	sudo rstudio-server start 1>&2
	echo " Done "
	echo "*****************************************"

	else
	echo "RStudio Server is only installed on the master node. This is a slave."
	exit 0;
	fi