Skip to content

Instantly share code, notes, and snippets.

@arnobroekhof
Created June 27, 2019 08:47
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save arnobroekhof/3b2f0a39e7eb15ca3f1aa77534d338d9 to your computer and use it in GitHub Desktop.
Save arnobroekhof/3b2f0a39e7eb15ca3f1aa77534d338d9 to your computer and use it in GitHub Desktop.
Copy files from HDFS to s3 using distcp behind a proxy
#!/bin/sh -e
PROXY_HOST="<XXX.XXX.XXX.XXX>"
PROXY_PORT=3128
HDFS_URI=hdfs://<NAMENODE>
HDFS_PATH="/my/data/path"
S3_ACCESS_KEY="<S3_ACCESS_KEY>"
S3_SECRET_KEY="<S3_SECRET_KEY>"
S3_BUCKET="<S3_BUCKET>"
S3_PATH="<PATH_ON_S3>"
S3_ENDPOINT="s3.eu-west-1.amazonaws.com"
export NO_PROXY_HOSTS="127.0.0.1|example.com"
export DISTCP_PROXY_OPTS="-Dhttp.nonProxyHosts="${NO_PROXY_HOSTS}" -Dhttps.nonProxyHosts="${NO_PROXY_HOSTS}" -Dhttps.proxyHost=${PROXY_HOST} -Dhttps.proxyPort=${PROXY_PORT} -Dhttp.proxyHost=${PROXY_HOST} -Dhttp.proxyPort=${PROXY_PORT}"
export DISTCP_S3_OPTS="-Dfs.s3a.endpoint=${S3_ENDPOINT} -Dfs.s3a.fast.upload=true -Dfs.s3a.access.key=${S3_ACCESS_KEY} -Dfs.s3a.secret.key=${S3_SECRET_KEY}"
export DISTCP_S3_OPTS="${DISTCP_S3_OPTS} -Dfs.s3a.proxy.host=${PROXY_HOST} -Dfs.s3a.proxy.port=${PROXY_PORT}"
export JAVA_OPTS="${DISTCP_PROXY_OPTS} ${DISTCP_S3_OPTS}"
export HADOOP_OPTS=${JAVA_OPTS}
hadoop distcp ${DISTCP_PROXY_OPTS} ${DISTCP_S3_OPTS} \
-D mapreduce.map.java.opts="${DISTCP_S3_OPTS}" \
-D mapreduce.reduce.java.opts="${DISTCP_S3_OPTS}" \
-update -skipcrccheck -numListstatusThreads 40 \
${HDFS_URI}${HDFS_PATH} s3a://${S3_BUCKET}${S3_PATH}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment