cjzamora/hadoop_spark_osx

## hadoop_spark_osx
Source: http://datahugger.org/datascience/setting-up-hadoop-v2-with-spark-v1-on-osx-using-homebrew/

This post builds on the previous setup Hadoop (v1) guide, to explain how to setup a single node Hadoop (v2) cluster with Spark (v1) on OSX (10.9.5).

Apache Hadoop is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures. The Apache Hadoop framework is composed of the following core modules:

HDFS (Distributed File System): a distributed file-system that stores data on commodity machines, providing very high aggregate bandwidth across the cluster.
YARN (Yet Another Resource Negotiator): a resource-management platform responsible for managing compute resources in clusters and using them for scheduling of users’ applications.
MapReduce: a programming model for large scale data processing. A MapReduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. Although the Hadoop framework is implemented in Java, any programming language can be used with Hadoop Streaming to implement the “map” and “reduce” functions. Apache Pig and Spark expose higher level user interfaces like Pig Latin and a SQL variant respectively.
Apache Spark is a fast and general processing engine compatible with Hadoop. It can run in Hadoop clusters through YARN or in a standalone mode, and it can process data in HDFS, HBase, Hive, and any Hadoop InputFormat. Engineered from the bottom-up for performance, Spark can be 100x faster than Hadoop for large scale data processing, by exploiting in memory computing and other optimizations. Spark is also fast when data is stored on disk, and currently holds the world record for large-scale on-disk sorting. Spark has easy-to-use APIs (e.g. Scala or Python) for operating on large datasets in batch, interactive or streaming modes. Spark provides a unified engine, packaged with higher-level libraries providing support for SQL queries, streaming data, machine learning and graph processing. These standard libraries increase developer productivity and can be seamlessly combined to create complex workflows.

Instructions

STEP 1 – PREPARE ENVIRONMENT

First need to uninstall old versions of Hadoop

brew cleanup hadoop
Next update Homebrew formulae

brew update
brew upgrade
brew cleanup

Check versions in Homebrew formulae (as of 10/10/14)

brew info hadoop = 2.5.1
brew info apache-spark = 1.1.0
brew info scala = 2.11.2
brew info sbt = 0.13.6

STEP 2 – INSTALL ENVIRONMENT

Install Hadoop

brew install hadoop

Install Spark (and dependencies)

brew install apache-spark scala sbt

STEP 3 – CONFIGURE ENVIRONMENT

Optionally, set the environment variables in your shell profile, by default they are set it the hadoop or yarn environment shell scripts. Edit your bash profile (‘nano ~/.bash_profile’), add the lines, save and then force the terminal to refresh (‘source ~/.bash_profile’).

# set environment variables
export JAVA_HOME=$(/usr/libexec/java_home)
export HADOOP_HOME=/usr/local/Cellar/hadoop/2.5.1
export HADOOP_CONF_DIR=$HADOOP_HOME/libexec/etc/hadoop
export SCALA_HOME=/usr/local/Cellar/apache-spark/1.1.0

# set path variables
export PATH=$PATH:$HADOOP_HOME/bin:$SCALA_HOME/bin

# set alias start & stop scripts
alias hstart=$HADOOP_HOME/sbin/start-dfs.sh;$HADOOP_HOME/sbin/start-yarn.sh
alias hstop=$HADOOP_HOME/sbin/stop-dfs.sh;$HADOOP_HOME/sbin/stop-yarn.sh
Configure passphraseless SSH on localhost and check remote login is enabled (System Preferences >> Sharing)

1. ssh-keygen -t rsa
Press enter for each line
2. cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
3. chmod og-wx ~/.ssh/authorized_keys

STEP 4 – CONFIGURE HADOOP (FOR PSEUDO-DISTRIBUTED MODE)

The following instructions are to configure Hadoop as a single-node in a pseudo-distributed mode with MapReduce job execution on YARN. Alternative configurations are: pseudo-distributed mode with local MapReduce job execution, or local / standalone mode, or fully-distributed mode.

Move to the Hadoop libexec directory and edit the configuration files (e.g. ‘nano {filename}’)

cd usr/local/Cellar/hadoop/2.5.1/libexec/

Edit ‘etc/hadoop/hadoop-env.sh’:

# this fixes the "scdynamicstore" warning
export HADOOP_OPTS="$HADOOP_OPTS -Djava.security.krb5.realm= -Djava.security.krb5.kdc="

Edit ‘etc/hadoop/core-site.xml’:

<configuration>
  <property>
    <name>fs.defaultFS</name>
    <value>hdfs://localhost:9000</value>
  </property>
</configuration>

Edit ‘etc/hadoop/hdfs-site.xml’:

<configuration>
  <property>
    <name>dfs.replication</name>
    <value>1</value>
  </property>
</configuration>

Edit ‘etc/hadoop/mapred-site.xml’:

<configuration>
  <property>
    <name>mapreduce.framework.name</name>
    <value>yarn</value>
  </property>
</configuration>

Edit ‘etc/hadoop/yarn-site.xml’:

<configuration>
  <property>
    <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce_shuffle</value>
  </property>
</configuration>

STEP 5 – START USING HADOOP (EXECUTE MAPREDUCE JOB)

Move to the Hadoop root directory

cd /usr/local/Cellar/hadoop/2.5.1

Format the Hadoop HDFS filesystem

./bin/hdfs namenode -format

Start the NameNode daemon & DataNode daemon

./sbin/start-dfs.sh

Browse the web interface for the NameNode

http://localhost:50070/

Start ResourceManager daemon and NodeManager daemon:

./sbin/start-yarn.sh

Check the daemons are all running:

jps

Browse the web interface for the ResourceManager

http://localhost:8088/

Create the HDFS directories required to execute MapReduce jobs:

./bin/hdfs dfs -mkdir -p /user/{username}

Run an example MapReduce job (calculate pi)

# calculate pi
./bin/hadoop jar libexec/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.5.1.jar pi 10 100

Try more examples / experiments, or stop the daemons

./sbin/stop-dfs.sh
./sbin/stop-yarn.sh

STEP 6 – START USING SPARK

Spark has already earned a huge fan base and community of users and contributors because it’s faster than MapReduce (in memory and on disk) and easier to program – hence I want to learn how to use it.

Move to the Spark directory

cd /usr/local/Cellar/apache-spark/1.1.0

Run an example Spark application (calculate pi)

./bin/run-example SparkPi

Let’s try working with the Spark (scala) shell which provides a simple way to learn the API and a powerful framework to analyse data interactively. A special interpreter-aware SparkContext is automatically created and assigned to a variable (‘sc’).

Start the Spark shell (local or yarn mode)

# use the spark shell (local with 1 thread)
./bin/spark-shell
# or ... (local with 4 threads)
./bin/spark-shell --master local[4]
# or ... (yarn)
./bin/spark-shell --master yarn
# or ... (use the help flag for more options)
./bin/spark-shell --help

Every spark context launches a web interface for monitoring

http://localhost:4040/

Try some basic scala programming in the shell (use ‘exit’ command to end the session)

println("Hello, World!")

val a = 5
a + 3

sc.parallelize(1 to 1000).count()

exit

Let’s try to execute an example Spark application on the Hadoop cluster using YARN. There are two deploy modes that can be used to launch Spark applications on YARN. In yarn-cluster mode, the Spark driver runs inside an application master process which is managed by YARN on the cluster, and the client can go away after initiating the application. In yarn-client mode, the driver runs in the client process, and the application master is only used for requesting resources from YARN.

# pattern to launch an application in yarn-cluster mode
./bin/spark-submit --class <path.to.class> --master yarn-cluster [options] <app.jar> [options]

# run example application (calculate pi)
./bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn-cluster libexec/lib/spark-examples-*.jar

THE END!
	Source: http://datahugger.org/datascience/setting-up-hadoop-v2-with-spark-v1-on-osx-using-homebrew/

	This post builds on the previous setup Hadoop (v1) guide, to explain how to setup a single node Hadoop (v2) cluster with Spark (v1) on OSX (10.9.5).

	Apache Hadoop is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures. The Apache Hadoop framework is composed of the following core modules:

	HDFS (Distributed File System): a distributed file-system that stores data on commodity machines, providing very high aggregate bandwidth across the cluster.
	YARN (Yet Another Resource Negotiator): a resource-management platform responsible for managing compute resources in clusters and using them for scheduling of users’ applications.
	MapReduce: a programming model for large scale data processing. A MapReduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. Although the Hadoop framework is implemented in Java, any programming language can be used with Hadoop Streaming to implement the “map” and “reduce” functions. Apache Pig and Spark expose higher level user interfaces like Pig Latin and a SQL variant respectively.
	Apache Spark is a fast and general processing engine compatible with Hadoop. It can run in Hadoop clusters through YARN or in a standalone mode, and it can process data in HDFS, HBase, Hive, and any Hadoop InputFormat. Engineered from the bottom-up for performance, Spark can be 100x faster than Hadoop for large scale data processing, by exploiting in memory computing and other optimizations. Spark is also fast when data is stored on disk, and currently holds the world record for large-scale on-disk sorting. Spark has easy-to-use APIs (e.g. Scala or Python) for operating on large datasets in batch, interactive or streaming modes. Spark provides a unified engine, packaged with higher-level libraries providing support for SQL queries, streaming data, machine learning and graph processing. These standard libraries increase developer productivity and can be seamlessly combined to create complex workflows.

	Instructions

	STEP 1 – PREPARE ENVIRONMENT

	First need to uninstall old versions of Hadoop

	brew cleanup hadoop
	Next update Homebrew formulae

	brew update
	brew upgrade
	brew cleanup

	Check versions in Homebrew formulae (as of 10/10/14)

	brew info hadoop = 2.5.1
	brew info apache-spark = 1.1.0
	brew info scala = 2.11.2
	brew info sbt = 0.13.6

	STEP 2 – INSTALL ENVIRONMENT

	Install Hadoop

	brew install hadoop

	Install Spark (and dependencies)

	brew install apache-spark scala sbt

	STEP 3 – CONFIGURE ENVIRONMENT

	Optionally, set the environment variables in your shell profile, by default they are set it the hadoop or yarn environment shell scripts. Edit your bash profile (‘nano ~/.bash_profile’), add the lines, save and then force the terminal to refresh (‘source ~/.bash_profile’).

	# set environment variables
	export JAVA_HOME=$(/usr/libexec/java_home)
	export HADOOP_HOME=/usr/local/Cellar/hadoop/2.5.1
	export HADOOP_CONF_DIR=$HADOOP_HOME/libexec/etc/hadoop
	export SCALA_HOME=/usr/local/Cellar/apache-spark/1.1.0

	# set path variables
	export PATH=$PATH:$HADOOP_HOME/bin:$SCALA_HOME/bin

	# set alias start & stop scripts
	alias hstart=$HADOOP_HOME/sbin/start-dfs.sh;$HADOOP_HOME/sbin/start-yarn.sh
	alias hstop=$HADOOP_HOME/sbin/stop-dfs.sh;$HADOOP_HOME/sbin/stop-yarn.sh
	Configure passphraseless SSH on localhost and check remote login is enabled (System Preferences >> Sharing)

	1. ssh-keygen -t rsa
	Press enter for each line
	2. cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
	3. chmod og-wx ~/.ssh/authorized_keys

	STEP 4 – CONFIGURE HADOOP (FOR PSEUDO-DISTRIBUTED MODE)

	The following instructions are to configure Hadoop as a single-node in a pseudo-distributed mode with MapReduce job execution on YARN. Alternative configurations are: pseudo-distributed mode with local MapReduce job execution, or local / standalone mode, or fully-distributed mode.

	Move to the Hadoop libexec directory and edit the configuration files (e.g. ‘nano {filename}’)

	cd usr/local/Cellar/hadoop/2.5.1/libexec/

	Edit ‘etc/hadoop/hadoop-env.sh’:

	# this fixes the "scdynamicstore" warning
	export HADOOP_OPTS="$HADOOP_OPTS -Djava.security.krb5.realm= -Djava.security.krb5.kdc="

	Edit ‘etc/hadoop/core-site.xml’:

	<configuration>
	<property>
	<name>fs.defaultFS</name>
	<value>hdfs://localhost:9000</value>
	</property>
	</configuration>

	Edit ‘etc/hadoop/hdfs-site.xml’:

	<configuration>
	<property>
	<name>dfs.replication</name>
	<value>1</value>
	</property>
	</configuration>

	Edit ‘etc/hadoop/mapred-site.xml’:

	<configuration>
	<property>
	<name>mapreduce.framework.name</name>
	<value>yarn</value>
	</property>
	</configuration>

	Edit ‘etc/hadoop/yarn-site.xml’:

	<configuration>
	<property>
	<name>yarn.nodemanager.aux-services</name>
	<value>mapreduce_shuffle</value>
	</property>
	</configuration>

	STEP 5 – START USING HADOOP (EXECUTE MAPREDUCE JOB)

	Move to the Hadoop root directory

	cd /usr/local/Cellar/hadoop/2.5.1

	Format the Hadoop HDFS filesystem

	./bin/hdfs namenode -format

	Start the NameNode daemon & DataNode daemon

	./sbin/start-dfs.sh

	Browse the web interface for the NameNode

	http://localhost:50070/

	Start ResourceManager daemon and NodeManager daemon:

	./sbin/start-yarn.sh

	Check the daemons are all running:

	jps

	Browse the web interface for the ResourceManager

	http://localhost:8088/

	Create the HDFS directories required to execute MapReduce jobs:

	./bin/hdfs dfs -mkdir -p /user/{username}

	Run an example MapReduce job (calculate pi)

	# calculate pi
	./bin/hadoop jar libexec/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.5.1.jar pi 10 100

	Try more examples / experiments, or stop the daemons

	./sbin/stop-dfs.sh
	./sbin/stop-yarn.sh

	STEP 6 – START USING SPARK

	Spark has already earned a huge fan base and community of users and contributors because it’s faster than MapReduce (in memory and on disk) and easier to program – hence I want to learn how to use it.

	Move to the Spark directory

	cd /usr/local/Cellar/apache-spark/1.1.0

	Run an example Spark application (calculate pi)

	./bin/run-example SparkPi

	Let’s try working with the Spark (scala) shell which provides a simple way to learn the API and a powerful framework to analyse data interactively. A special interpreter-aware SparkContext is automatically created and assigned to a variable (‘sc’).

	Start the Spark shell (local or yarn mode)

	# use the spark shell (local with 1 thread)
	./bin/spark-shell
	# or ... (local with 4 threads)
	./bin/spark-shell --master local[4]
	# or ... (yarn)
	./bin/spark-shell --master yarn
	# or ... (use the help flag for more options)
	./bin/spark-shell --help

	Every spark context launches a web interface for monitoring

	http://localhost:4040/

	Try some basic scala programming in the shell (use ‘exit’ command to end the session)

	println("Hello, World!")

	val a = 5
	a + 3

	sc.parallelize(1 to 1000).count()

	exit

	Let’s try to execute an example Spark application on the Hadoop cluster using YARN. There are two deploy modes that can be used to launch Spark applications on YARN. In yarn-cluster mode, the Spark driver runs inside an application master process which is managed by YARN on the cluster, and the client can go away after initiating the application. In yarn-client mode, the driver runs in the client process, and the application master is only used for requesting resources from YARN.

	# pattern to launch an application in yarn-cluster mode
	./bin/spark-submit --class <path.to.class> --master yarn-cluster [options] <app.jar> [options]

	# run example application (calculate pi)
	./bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn-cluster libexec/lib/spark-examples-*.jar

	THE END!