Skip to content

Instantly share code, notes, and snippets.

@cjzamora
Last active March 25, 2024 07:23
Show Gist options
  • Star 19 You must be signed in to star a gist
  • Fork 12 You must be signed in to fork a gist
  • Save cjzamora/9fcc740228875f4642a1 to your computer and use it in GitHub Desktop.
Save cjzamora/9fcc740228875f4642a1 to your computer and use it in GitHub Desktop.
Hadoop + Spark installation (OSX)
Source: http://datahugger.org/datascience/setting-up-hadoop-v2-with-spark-v1-on-osx-using-homebrew/
This post builds on the previous setup Hadoop (v1) guide, to explain how to setup a single node Hadoop (v2) cluster with Spark (v1) on OSX (10.9.5).
Apache Hadoop is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures. The Apache Hadoop framework is composed of the following core modules:
HDFS (Distributed File System): a distributed file-system that stores data on commodity machines, providing very high aggregate bandwidth across the cluster.
YARN (Yet Another Resource Negotiator): a resource-management platform responsible for managing compute resources in clusters and using them for scheduling of users’ applications.
MapReduce: a programming model for large scale data processing. A MapReduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. Although the Hadoop framework is implemented in Java, any programming language can be used with Hadoop Streaming to implement the “map” and “reduce” functions. Apache Pig and Spark expose higher level user interfaces like Pig Latin and a SQL variant respectively.
Apache Spark is a fast and general processing engine compatible with Hadoop. It can run in Hadoop clusters through YARN or in a standalone mode, and it can process data in HDFS, HBase, Hive, and any Hadoop InputFormat. Engineered from the bottom-up for performance, Spark can be 100x faster than Hadoop for large scale data processing, by exploiting in memory computing and other optimizations. Spark is also fast when data is stored on disk, and currently holds the world record for large-scale on-disk sorting. Spark has easy-to-use APIs (e.g. Scala or Python) for operating on large datasets in batch, interactive or streaming modes. Spark provides a unified engine, packaged with higher-level libraries providing support for SQL queries, streaming data, machine learning and graph processing. These standard libraries increase developer productivity and can be seamlessly combined to create complex workflows.
Instructions
STEP 1 – PREPARE ENVIRONMENT
First need to uninstall old versions of Hadoop
brew cleanup hadoop
Next update Homebrew formulae
brew update
brew upgrade
brew cleanup
Check versions in Homebrew formulae (as of 10/10/14)
brew info hadoop = 2.5.1
brew info apache-spark = 1.1.0
brew info scala = 2.11.2
brew info sbt = 0.13.6
STEP 2 – INSTALL ENVIRONMENT
Install Hadoop
brew install hadoop
Install Spark (and dependencies)
brew install apache-spark scala sbt
STEP 3 – CONFIGURE ENVIRONMENT
Optionally, set the environment variables in your shell profile, by default they are set it the hadoop or yarn environment shell scripts. Edit your bash profile (‘nano ~/.bash_profile’), add the lines, save and then force the terminal to refresh (‘source ~/.bash_profile’).
# set environment variables
export JAVA_HOME=$(/usr/libexec/java_home)
export HADOOP_HOME=/usr/local/Cellar/hadoop/2.5.1
export HADOOP_CONF_DIR=$HADOOP_HOME/libexec/etc/hadoop
export SCALA_HOME=/usr/local/Cellar/apache-spark/1.1.0
# set path variables
export PATH=$PATH:$HADOOP_HOME/bin:$SCALA_HOME/bin
# set alias start & stop scripts
alias hstart=$HADOOP_HOME/sbin/start-dfs.sh;$HADOOP_HOME/sbin/start-yarn.sh
alias hstop=$HADOOP_HOME/sbin/stop-dfs.sh;$HADOOP_HOME/sbin/stop-yarn.sh
Configure passphraseless SSH on localhost and check remote login is enabled (System Preferences >> Sharing)
1. ssh-keygen -t rsa
Press enter for each line
2. cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
3. chmod og-wx ~/.ssh/authorized_keys
STEP 4 – CONFIGURE HADOOP (FOR PSEUDO-DISTRIBUTED MODE)
The following instructions are to configure Hadoop as a single-node in a pseudo-distributed mode with MapReduce job execution on YARN. Alternative configurations are: pseudo-distributed mode with local MapReduce job execution, or local / standalone mode, or fully-distributed mode.
Move to the Hadoop libexec directory and edit the configuration files (e.g. ‘nano {filename}’)
cd usr/local/Cellar/hadoop/2.5.1/libexec/
Edit ‘etc/hadoop/hadoop-env.sh’:
# this fixes the "scdynamicstore" warning
export HADOOP_OPTS="$HADOOP_OPTS -Djava.security.krb5.realm= -Djava.security.krb5.kdc="
Edit ‘etc/hadoop/core-site.xml’:
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
Edit ‘etc/hadoop/hdfs-site.xml’:
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
Edit ‘etc/hadoop/mapred-site.xml’:
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
Edit ‘etc/hadoop/yarn-site.xml’:
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>
STEP 5 – START USING HADOOP (EXECUTE MAPREDUCE JOB)
Move to the Hadoop root directory
cd /usr/local/Cellar/hadoop/2.5.1
Format the Hadoop HDFS filesystem
./bin/hdfs namenode -format
Start the NameNode daemon & DataNode daemon
./sbin/start-dfs.sh
Browse the web interface for the NameNode
http://localhost:50070/
Start ResourceManager daemon and NodeManager daemon:
./sbin/start-yarn.sh
Check the daemons are all running:
jps
Browse the web interface for the ResourceManager
http://localhost:8088/
Create the HDFS directories required to execute MapReduce jobs:
./bin/hdfs dfs -mkdir -p /user/{username}
Run an example MapReduce job (calculate pi)
# calculate pi
./bin/hadoop jar libexec/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.5.1.jar pi 10 100
Try more examples / experiments, or stop the daemons
./sbin/stop-dfs.sh
./sbin/stop-yarn.sh
STEP 6 – START USING SPARK
Spark has already earned a huge fan base and community of users and contributors because it’s faster than MapReduce (in memory and on disk) and easier to program – hence I want to learn how to use it.
Move to the Spark directory
cd /usr/local/Cellar/apache-spark/1.1.0
Run an example Spark application (calculate pi)
./bin/run-example SparkPi
Let’s try working with the Spark (scala) shell which provides a simple way to learn the API and a powerful framework to analyse data interactively. A special interpreter-aware SparkContext is automatically created and assigned to a variable (‘sc’).
Start the Spark shell (local or yarn mode)
# use the spark shell (local with 1 thread)
./bin/spark-shell
# or ... (local with 4 threads)
./bin/spark-shell --master local[4]
# or ... (yarn)
./bin/spark-shell --master yarn
# or ... (use the help flag for more options)
./bin/spark-shell --help
Every spark context launches a web interface for monitoring
http://localhost:4040/
Try some basic scala programming in the shell (use ‘exit’ command to end the session)
println("Hello, World!")
val a = 5
a + 3
sc.parallelize(1 to 1000).count()
exit
Let’s try to execute an example Spark application on the Hadoop cluster using YARN. There are two deploy modes that can be used to launch Spark applications on YARN. In yarn-cluster mode, the Spark driver runs inside an application master process which is managed by YARN on the cluster, and the client can go away after initiating the application. In yarn-client mode, the driver runs in the client process, and the application master is only used for requesting resources from YARN.
# pattern to launch an application in yarn-cluster mode
./bin/spark-submit --class <path.to.class> --master yarn-cluster [options] <app.jar> [options]
# run example application (calculate pi)
./bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn-cluster libexec/lib/spark-examples-*.jar
THE END!
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment