Skip to content

Instantly share code, notes, and snippets.

@tcarland
Last active September 11, 2020 15:04
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save tcarland/a3186f794ea6fd24413beb9e8a0cc759 to your computer and use it in GitHub Desktop.
Save tcarland/a3186f794ea6fd24413beb9e8a0cc759 to your computer and use it in GitHub Desktop.
Information for building hadoop and related components from source.

Building Hadoop and Various Ecosystem Components

A guide for building hadoop and other ecosystem components from source.

Building Hadoop (v2.7.4)

Prerequisites:

  • Oracle JDK 1.8
  • Maven 3.x
  • protobuf 2.5.0
  • cmake
  • openssl

Both Hadoop 2.7.x and Hadoop 3.x require protobuf 2.5.0 specifically.

$ ./configure --prefix=/usr/local
$ make
$ make install

Hadoop 2.7.4:

export MAVEN_OPTS="-Xms256m -Xmx512m"
mvn clean package -Pdist,native,docs -DskipTests -Dtar

Building HBase (v1.1)

Prerequisites:

  • snappy
  • zlib
mvn compile -Dsnappy
or
MAVEN_OPTS="-Xmx1g -XX:MaxPermSize=512m" mvn clean site install assembly:assembly -Dsnappy -DskipTests -Prelease

Building Spark (v1.6.x)

Prerequisites:

  • Spark 1.4.x requires Maven 3.0.x
  • Spark 1.5.x requires Maven 3.3.x

If building for Spark on YARN, or Hadoop dependencies will be available, then the -Phadoop-provided flag will keep the Hadoop dependent jars from being included in the resulting distribution. For spark standalone on hosts that do not have a hadoop distribution installed the flag should be omitted. Note the --name parameter to label the specific build.

export MAVEN_OPTS="-Xmx2g -XX:ReservedCodeCacheSize=512m"
./make-distribution.sh --name custom-spark --tgz --skip-java-test -Phadoop-2.6 \
-Dhadoop.version=2.7.1 -Pyarn -Phive -Phive-thriftserver -Phadoop-provided

Spark 2.x.x

export MAVEN_OPTS="-Xmx2g -XX:ReservedCodeCacheSize=512m"
./dev/make-distribution.sh --name custom-spark --tgz -Phadoop-2.7 \
-Dhadoop.version=2.7.4 -Pyarn -Phive -Phive-thriftserver -Phadoop-provided

Spark 3.x.x

For hadoop-provided distribution:

./dev/make-distribution.sh --name callisto --pip --tgz -Psparkr -Phive -Phive-thriftserver -Pyarn -Pkubernetes -Dhadoop.version=2.8.5 -DskipTests -Phadoop-provided

Hive 1.2.1

mvn clean package -Phadoop-2,dist

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment