A document describing the configuration of a local, apache-based hadoop distribution running in pseudo-distributed mode. While there are useful VM's provided by various hadoop vendors, running natively provides better performance and more control over the environment for testing purposes (such as running multiple versions). For developers interested in underlying details of the hadoop stack, having a native version based on compiled apache projects is much more clear versus trying to make sense of Cloudera's internal versions.
This document was composed using (at the time) the following versions:
- Hadoop =2.7.x
- HBase 1.1.x | 1.2
- Hive 1.2.x
- Spark 1.6.x or 2.x
- Kafka >=0.10.2.0
#### Prerequisites
-
Oracle JDK 1.8
Note that this must be a JDK, not just the JRE. The Oracle release is preferred. Most vendors push the Oracle release due to its support of the strong encryption algorithms.
-
IPV6
There have been known issues with Ubuntu, Hadoop and IPv6 in the past and vendors currently recommend disabling the ipv6 stack in the kernel.
-
sysctl.conf:
net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.lo.disable_ipv6 = 1
Optional: While not always possible across hosts depending on your environment, it is nice to choose a specific UID and GID for the Hadoop user and group to ensure consistency across hosts. If you wish to run the ecosystem as the Hadoop user, a homedir is needed for .ssh/authorized_keys for most start scripts..
# UID=xxx; GID=yyy
# groupadd -g $GID hadoop
# useradd -g hadoop -m -u $UID hadoop
Alternatively, if simply running a pseudo-distributed instance, running as yourself works great.
While it is functional to get services to run on localhost only (loopback), there are some hacks involved for some services like Spark that traditionally have not supported loopback. However, for development work, running on a laptop is nice, but can suffer from not having a fixed available network interface and IP. The easiest solution in such cases is to use a virtual interface.
Additionally do NOT have the loopback entry in /etc/hosts set to the hostname. Among the provided scripts, the 'hadoop-init.sh' script validates the configuration prior to starting HDFS. Running either 'status' or 'start' will verify the detected hostname configuration.
SSH keyes are required for starting services (such as the secondary namenode).
# su - hadoop
$ ssh-keygen
$ cat .ssh/id_rsa.pub >> .ssh/authorized_keys
$ chmod 600 !$
Choose a base path for the hadoop ecosystem. eg. /opt/hadoop. From here, install the various ecosystem components complete with versions.
# mkdir -p /opt/tdh && cd /opt/tdh
# wget http://url/to/hadoop-2.7.1-bin.tar.gz
# tar -zxvf hadoop-2.7.1.tar.gz
# mv hadoop-2.7.1-bin hadoop-2.7.1
# chown -R hadoop:hadoop hadoop-2.7.1
# ln -s hadoop-2.7.1 hadoop
Use this pattern for other ecosystem components as well:
$ ls -l /opt/tdh/
lrwxrwxrwx 1 hadoop hadoop 12 Dec 29 12:44 hadoop -> hadoop-2.7.1
drwxrwxr-x 10 hadoop hadoop 4096 Dec 29 13:07 hadoop-2.7.1
lrwxrwxrwx 1 hadoop hadoop 11 Nov 7 20:38 hbase -> hbase-1.1.5
drwxr-xr-x 8 hadoop hadoop 4096 Nov 6 11:38 hbase-1.1.5
drwxr-xr-x 4 hadoop hadoop 4096 Nov 4 07:43 hdfs
lrwxrwxrwx 1 hadoop hadoop 10 Feb 24 15:25 hive -> hive-1.2.1
drwxr-xr-x 9 hadoop hadoop 4096 May 4 16:58 hive-1.2.1
lrwxrwxrwx 1 hadoop hadoop 9 May 4 10:48 hue -> hue-4.1.0
drwxr-xr-x 12 hadoop hadoop 4096 May 4 23:12 hue-4.1.0
lrwxrwxrwx 1 hadoop hadoop 18 Nov 16 19:59 kafka -> kafka_2.11-0.10.2.0
drwxr-xr-x 6 hadoop hadoop 4096 Nov 17 10:51 kafka_2.11-0.10.2.0
lrwxrwxrwx 1 hadoop hadoop 11 Nov 19 11:05 spark -> spark-2.3.1
drwxr-xr-x 12 hadoop hadoop 4096 Dec 2 10:23 spark-1.6.1
lrwxrwxrwx 12 hadoop hadoop 4096 Dec 2 10:23 sqoop -> sqoop-1.99.6
drwxr-xr-x 12 hadoop hadoop 4096 Dec 2 10:23 sqoop-1.99.6
lrwxrwxrwx 12 hadoop hadoop 4096 Dec 2 10:23 zeppelin -> zeppelin-0.8.0
drwxr-xr-x 12 hadoop hadoop 4096 Dec 2 10:23 zeppelin-0.8.0
Update the configs in '/opt/tdh/hadoop/etc/hadoop'. Set JAVA_HOME in the hadoop-env.sh file. This should be set to the Oracle JDK previously installed.
core-site.xml:
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://hostname:8020</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/var/tmp/hadoop/data</value>
</property>
</configuration>
hdfs-site.xml:
Choose a path for the Namenode and Datanode directories. Note that the replication parameter must be set properly. Even though we are running in distributed mode, this is still a single node so we do not want any replication.
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.name.dir</name>
<value>file:///opt/tdh/hdfs/namenode</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>file:///opt/tdh/hdfs/datanode</value>
</property>
</configuration>
yarn-site.xml:
<configuration>
<property>
<name>yarn.resourcemanager.address</name>
<value>hostname:8050</value>
</property>
<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
<value>hostname:8025</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>hostname:8030</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
</configuration>
If intending to use Spark2.x and Dynamic Execution, then the external Spark Shuffle service should be configured:
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle,spark_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.spark_shuffle.class</name>
<value>org.apache.spark.network.yarn.YarnShuffleService</value>
</property>
This environment serves as example and can be added to a user's .bashrc file, though I prefer to keep these in a separate env file like hadoop-env-user.sh which can then be sourced from the .bashrc file. This also makes it easier to share/use these settings with other users/accounts.
.bashrc:
if [ -f ~/hadoop-env-user.sh ]; then
. ~/hadoop-env-user.sh
fi
hadoop-env-user.sh:
# User oriented environment variables (for use with bash)
# The java implementation to use.
export JAVA_HOME=/usr/lib/jvm/oracle-jdk-bin-1.8
export HADOOP_ROOT="/opt/tdh"
export HADOOP_HOME="$HADOOP_ROOT/hadoop"
export HADOOP_CONF_DIR="$HADOOP_HOME/etc/hadoop"
export HADOOP_COMMON_HOME="$HADOOP_HOME"
export HADOOP_HDFS_HOME="$HADOOP_COMMON_HOME"
export HADOOP_MAPRED_HOME="$HADOOP_COMMON_HOME"
export YARN_HOME="$HADOOP_COMMON_HOME"
export HBASE_HOME="$HADOOP_ROOT/hbase"
export HIVE_HOME="$HADOOP_ROOT/hive"
export KAFKA_HOME="$HADOOP_ROOT/kafka"
export SPARK_HOME="$HADOOP_ROOT/spark"
export OOZIE_HOME="$HADOOP_ROOT/oozie"
export HADOOP_PATH="\
$HADOOP_HOME/bin:\
$HBASE_HOME/bin:\
$HIVE_HOME/bin:\
$KAFKA_HOME/bin:\
$SPARK_HOME/bin"
# shouldn't need to add anything below this line
# this part is intended to avoiding stomping on these vars
if [ "$LD_LIBRARY_PATH" ]; then
export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:$HADOOP_HOME/lib/native"
else
export LD_LIBRARY_PATH="$HADOOP_HOME/lib/native"
fi
if [ "$HADOOP_CLASSPATH" ]; then
if [ "$CLASSPATH" ]; then
export CLASSPATH="$CLASSPATH:$HADOOP_CLASSPATH"
else
export CLASSPATH="$HADOOP_CLASSPATH"
fi
fi
if [ "$HADOOP_PATH" ]; then
if [ "$PATH" ]; then
export PATH="$PATH:$HADOOP_PATH"
else
export PATH="$HADOOP_PATH"
fi
fi
Once the environment is setup, the /opt/tdh/hadoop/bin/hadoop binary should be in the path. The following will format the name and datanodes as specified in the hdfs-site.xml.
# mkdir -p /opt/hadoop/hdfs/namenode
# mkdir -p /opt/hadoop/hdfs/datanode
# sudo -u hadoop hadoop namenode -format
Ensure various start scripts are run as the correct user if applicable (eg. sudo -i -u hadoop).
$ $HADOOP_HOME/sbin/start-dfs.sh
$ $HADOOP_HOME/sbin/start-yarn.sh
Perform a quick test to verify that HDFS is working.
$ hdfs dfs -mkdir /user
$ hdfs dfs -ls /
This installation is fairly straightforward and follows the same pattern as earlier.
$ cd /opt/tdh
$ wget http://url/to/download/hbase-1.0.2-bin.tar.gz
$ tar -zxvf hbase-1.1.5-bin.tar.gz
$ mv hbase-1.1.5-bin hbase-1.1.5
$ chown -R hadoop:hadoop hbase-1.1.5
$ ln -s !$ hbase
Set the JAVA_HOME variable in $HBASE_HOME/conf/hbase-env.sh. Then update the hbase-site.xml file with the configuration below. Note that some of these values are defaults and as such are not necessary, but are included for reference.
hbase-site.xml:
<configuration>
<property>
<name>hbase.master.port</name>
<value>16000</value>
</property>
<property>
<name>hbase.master.info.bindAddress</name>
<value>10.10.10.60</value>
</property>
<property>
<name>hbase.master.info.port</name>
<value>16010</value>
</property>
<property>
<name>hbase.regionserver.port</name>
<value>16020</value>
</property>
<property>
<name>hbase.regionserver.info.bindAddress</name>
<value>10.10.10.60</value>
</property>
<property>
<name>hbase.regionserver.info.port</name>
<value>16030</value>
</property>
<property>
<name>hbase.cluster.distributed</name>
<value>true</value>
</property>
<property>
<name>hbase.rootdir</name>
<value>hdfs://host:8020/hbase</value>
</property>
<property>
<name>hbase.zookeeper.quorum</name>
<value>host</value>
</property>
</configuration>
Use the script $HBASE_HOME/bin/start-hbase.sh and $HBASE_HOME/bin/stop-hbase.sh to start and stop HBase respectively. Note that HBase needs a running zookeeper, which is done automatically. Since many other ecosystem components make use of zookeeper, such as Spark and Kafka, it is important that HBase is started after YARN and before other components. The hadoop-eco.sh script handles this properly. Alternatively, Zookeeper can be installed and configured separately from HBase.
$ cd /opt/tdh
$ wget http://url/to/spark-1.6.2-bin-hadoop2.6.tgz
$ tar -zxf spark-1.6.2-bin-hadoop2.6.tgz
$ rm !$
$ mv spark-1.6.2-bin-bin-hadoop2.6 spark-1.6.2
$ chown -R hadoop:hadoop !$
$ ln -s !$ spark
Configuring spark depends a bit on the requirements. The following configuration options are simply an example and is not complete. You will likely want to tweak the default number of worker cores, instances, and memory. Note that the settings below applies primarily to a Spark Standalone configuration, however for Spark on Yarn, there is no need to start the spark master. Additionally, there may be a need to set JAVA_HOME and SPARK_DIST_CLASSPATH variables in spark-env.sh.
$ cd /opt/tdh/spark/conf
$ cp spark-env.sh.template spark-env.sh
$ cp spark-defaults.conf.template spark-defaults.conf
$ cp slaves.template slaves
$ cp log4j.properties.template log4j.properties
spark-env.sh:
export SPARK_DAEMON_JAVA_OPTS="-Dlog4j.configuration=file:///opt/tdh/spark/conf/log4j.pro
perties"
export SPARK_DIST_CLASSPATH=$(/opt/tdh/hadoop/bin/hadoop classpath)
# for standalone mode
export SPARK_MASTER_IP="localhost"
export SPARK_MASTER_PORT=7077
export SPARK_MASTER_WEBUI_PORT=8080
export SPARK_EXECUTOR_INSTANCES=1
export SPARK_EXECUTOR_CORES=2
export SPARK_EXECUTOR_MEMORY="1g"
export SPARK_WORKER_DIR=/var/tmp/spark/work
export SPARK_LOCAL_DIRS=/var/tmp/spark
export SPARK_DAEMON_MEMORY=1g
spark-defaults.conf:
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.streaming.receiver.writeAheadLog.enable true
spark.streaming.backpressure.enabled true
spark.eventLog.enabled true
spark.executor.memory 1g
#spark.streaming.receiver.maxRate 0
#spark.streaming.kavak.maxRatePerPartition 0
To test running a spark job on YARN, try the following spark example:
$SPARK_HOME/bin/spark-submit --class org.apache.spark.examples.SparkPi \
--master yarn \
--deploy-mode cluster \
--num-executors 1 \
--executor-cores 2 \
lib/spark-examples*.jar \
100
Check the YARN UI http://host:8088/
Jobs can be submitted directy to the spark master as well and viewed via the Spark UI at http://host:8080/
$SPARK_HOME/bin/spark-submit --class org.apache.spark.examples.SparkPi \
--master spark://$host:7077 \
--num-executors 1 \
--executor-cores 2 \
lib/spark-examples*.jar \
100
For running spark-shell or pyspark use the '--master' switch with either yarn as the target or the spark master URL.
pyspark --master spark://$host:7077
The following is a sample configuration for spark-defaults.conf and spark-env.sh intended for Spark2.
spark-defaults.conf
spark.master=yarn
spark.submit.deployMode=client
spark.ui.killEnabled=true
spark.serializer=org.apache.spark.serializer.KryoSerializer
spark.eventLog.enabled=true
spark.eventLog.dir=hdfs://thebe:8020/tmp/spark
spark.yarn.historyServer.address=http://thebe:18088
spark.history.fs.logDirectory=hdfs://thebe:8020/tmp/spark
#spark.sql.hive.metastore.jars=${HIVE_HOME}/lib/*
#spark.sql.hive.metastore.version=1.1.0
#spark.sql.catalogImplementation=hive
#spark.yarn.jars=local:/opt/hadoop/spark/jars/*
spark.driver.extraLibraryPath=${HADOOP_COMMON_HOME}/lib/native
spark.executor.extraLibraryPath=${HADOOP_COMMON_HOME}/lib/native
spark.yarn.am.extraLibraryPath=${HADOOP_COMMON_HOME}/lib/native
spark.hadoop.mapreduce.application.classpath=
spark.hadoop.yarn.application.classpath=
spark.driver.memory=1g
spark.executor.memory=1g
spark-env.sh:
export STANDALONE_SPARK_MASTER_HOST=`hostname`
export SPARK_MASTER_IP=$STANDALONE_SPARK_MASTER_HOST
if [ -z "$SPARK_HOME" ]; then
SELF="$(cd $(dirname $BASH_SOURCE) && pwd)"
if [ -z "$SPARK_CONF_DIR" ]; then
export SPARK_CONF_DIR="$SELF"
fi
export SPARK_HOME="/opt/hadoop/spark"
fi
SPARK_PYTHON_PATH=""
if [ -n "$SPARK_PYTHON_PATH" ]; then
export PYTHONPATH="$PYTHONPATH:$SPARK_PYTHON_PATH"
fi
if [ -z "$HADOOP_HOME" ]; then
export HADOOP_HOME="/opt/hadoop/hadoop"
fi
if [ -n "$HADOOP_HOME" ]; then
LD_LIBRARY_PATH=$LD_LIBRARY_PATH:${HADOOP_HOME}/lib/native
fi
export LD_LIBRARY_PATH
PYLIB="$SPARK_HOME/python/lib"
if [ -f "$PYLIB/pyspark.zip" ]; then
PYSPARK_ARCHIVES_PATH=
for lib in "$PYLIB"/*.zip; do
if [ -n "$PYSPARK_ARCHIVES_PATH" ]; then
PYSPARK_ARCHIVES_PATH="$PYSPARK_ARCHIVES_PATH,local:$lib"
else
PYSPARK_ARCHIVES_PATH="local:$lib"
fi
done
export PYSPARK_ARCHIVES_PATH
fi
export SPARK_LIBRARY_PATH=${SPARK_HOME}/jars
export SPARK_MASTER_WEBUI_PORT=18080
export SPARK_MASTER_PORT=7077
export SPARK_WORKER_PORT=7078
export SPARK_WORKER_WEBUI_PORT=18081
export SPARK_WORKER_DIR=/var/run/spark/work
export SPARK_LOG_DIR=/var/log/spark
export SPARK_PID_DIR='/var/run/spark/'
export HADOOP_CONF_DIR=${HADOOP_CONF_DIR:-/etc/hadoop/conf}
if [[ -d $SPARK_HOME/python ]]
then
for i in
do
SPARK_DIST_CLASSPATH=${SPARK_DIST_CLASSPATH}:$i
done
fi
SPARK_DIST_CLASSPATH="$SPARK_DIST_CLASSPATH:$SPARK_LIBRARY_PATH/*"
SPARK_DIST_CLASSPATH="$SPARK_DIST_CLASSPATH:$(/opt/hadoop/hadoop/bin/hadoop classpath)"
SPARK_DIST_CLASSPATH="$SPARK_DIST_CLASSPATH:$HBASE_CONF_DIR:$HBASE_HOME/lib/*"
SPARK_DIST_CLASSPATH="$SPARK_DIST_CLASSPATH:$HIVE_HOME/conf/hive-site.xml:$HIVE_HOME/lib/*"
SPARK_DIST_CLASSPATH="$SPARK_DIST_CLASSPATH:$KAFKA_HOME/libs/*"
echo "SPARK_DIST_CLASSPATH=\"$SPARK_DIST_CLASSPATH\""
This is a nice feature, especially with constrained resources and notebook users. To enable dynamic allocation, the external spark shuffle service must be added to YARN.
yarn-site.xml:
<property>
<name>yarn.nodemanager.aux-services</name>
<value>spark_shuffle,mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.spark_shuffle.class</name>
<value>org.apache.spark.network.yarn.YarnShuffleService</value>
</property>
spark-defaults.conf
spark.authenticate=false
spark.dynamicAllocation.enabled=true
spark.dynamicAllocation.executorIdleTimeout=60
spark.dynamicAllocation.minExecutors=0
spark.dynamicAllocation.schedulerBacklogTimeout=1
spark.shuffle.service.enabled=true
spark.shuffle.service.port=7337
$ cd /opt/tdh
$ wget https://www.apache.org/dyn/closer.cgi?path=/kafka/0.8.2.2/kafka_2.11-0.8.2.2.tgz
$ tar -zxf kafka_2.11-0.8.2.2.tgz
$ rm !$
$ ln -s kafka_2.11-0.8.2.2 kafka
$ chown -R hadoop:hadoop kafka_2.11-0.8.2.2
For the configuration, Kafka comes out of the package, mostly ready for a single node setup. Nonetheless, it would be good to peruse the various configurations in the 'config' directory. For one, set the zookeeper.connect string in the consumer.properties:
zookeeper.connect=10.10.10.11:2181
Additionally, verify the settings in server.properties are sane for your system. Once complete, the Kafka service can be started by running the following command.
sudo -u hadoop $KAFKA_HOME/bin/kafka-server-start.sh $KAFKA_HOME/config/server.properties &
-
Install Mysql
-
Create metastore database
-
Locate schema $HIVE_HOME/scripts/metastore/upgrade/mysql/hive-schema-x.x.x.mysql.sql edit and search for the txn-schema file to update fully-qualified path.
-
source schema
-
configure hive-site.xml
<configuration>
<property>
<name>mapred.reduce.tasks</name>
<value>-1</value>
<description>The default number of reduce tasks per job.</description>
</property>
<property>
<name>hive.exec.scratchdir</name>
<value>/tmp/hive</value>
<description>Scratch space for Hive jobs</description>
</property>
<property>
<name>hive.metastore.warehouse.dir</name>
<value>/user/hive/warehouse</value>
<description>location of default database for the warehouse</description>
</property>
<property>
<name>hive.enforce.bucketing</name>
<value>true</value>
<description>Whether bucketing is enforced. If true, while inserting into the table, bu
cketing is enforced. </description>
</property>
<property>
<name>hive.hwi.war.file</name>
<value>lib/hive-hwi-1.2.1.jar</value>
<description>This sets the path to the HWI file, relative to${HIVE_HOME}</description>
</property>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://dbhost/metastore?createDatabaseIfNotExist=true</value>
<description>JDBC connect string for a JDBC metastore</description>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
<description>Driver class name for a JDBC metastore</description>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>hive</value>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>hivesql</value>
</property>
<property>
<name>hive.metastore.uris</name>
<value>thrift://hostname:9083</value>
<description>IP address (or fqdn) and port of the metastore host</description>
</property>
<property>
<name>datanucleus.fixedDatastore</name>
<value>true</value>
</property>
<property>
<name>datanucleus.autoCreateSchema</name>
<value>false</value>
</property>
</configuration>
test with: ./bin/hive -hiveconf hive.root.logger=DEBUG,console
- Start the MetaStore via
$HIVE_HOME/bin/hive --service metastore
Note that this does not daemonize properly, so a better way might be to use nohup
nohup $HIVE_HOME/bin/hive --service metastore > /var/tmp/hivemetastore.log