Skip to content

Instantly share code, notes, and snippets.

@netvl
Last active March 5, 2019 07:59
Show Gist options
  • Save netvl/525aa31a4a700e0c8829c7d5bcf1f78d to your computer and use it in GitHub Desktop.
Save netvl/525aa31a4a700e0c8829c7d5bcf1f78d to your computer and use it in GitHub Desktop.
Hadoop environment in Docker

This environment is based on a nice set of Docker images from https://bitbucket.org/uhopper/hadoop-docker. They provide an older version of Hadoop (2.7.2), but it could be updated easily if needed.

In order to run this compose file, you first need to create a network called hadoop:

% docker network create hadoop

This compose file also has a DNS proxy container which may be used to access the nodes in the Docker network by their names. The respective Dockerfile is attached; run docker build -t netvl/dnsproxy:latest . in the directory with that Dockerfile. The DNS proxy can be used by adding its bridge network address to the resolv.conf file, e.g. with resolvconf (on Linux systems):

% resolvconf -a dnsproxy <<<'nameserver 172.19.0.2'  # or whatever address the container gets

If you do not need a DNS proxy, just remove the respective section from the compose file. Although it may be difficult to access the web UIs without it. If you decide to keep the DNS proxy and if your host system is configured, you will be able to access the Hadoop cluster nodes by their hostnames as they are declared in the compose file, e.g. http://resourcemanager.hadoop:8088.

You may want to run this compose file on a proper Linux machine rather than using Docker for Mac/Windows, because all Hadoop processes take a large amount of memory and because networking in non-native environments works really bad.

When you run the compose file with docker-compose up, make sure that you have the hadoop.env file in the same directory. This file contains various important configuration options which get written into Hadoop configuration files.

You can add more datanodes to the cluster by copy-pasting the respective section in the compose file.

To access the cluster, run the uhopper/hadoop image in the same network and with the same environment file:

% docker run --rm -it --network hadoop --env-file hadoop.env uhopper/hadoop bash
version: "3"
services:
namenode:
image: uhopper/hadoop-namenode
hostname: namenode.hadoop
volumes:
- namenode-vol:/hadoop/dfs/name
environment:
CLUSTER_NAME: hadoop
env_file:
- ./hadoop.env
historyserver:
image: uhopper/hadoop-historyserver
depends_on:
- namenode
hostname: historyserver.hadoop
volumes:
- historyserver-vol:/hadoop/yarn/timeline
env_file:
- ./hadoop.env
mrjobhistoryserver:
image: netvl/hadoop-mrjobhistoryserver
depends_on:
- namenode
hostname: mrjobhistoryserver.hadoop
env_file:
- ./hadoop.env
datanode-1:
image: uhopper/hadoop-datanode
depends_on:
- namenode
hostname: datanode-1.hadoop
volumes:
- datanode-1-vol:/hadoop/dfs/data
env_file:
- ./hadoop.env
datanode-2:
image: uhopper/hadoop-datanode
depends_on:
- namenode
hostname: datanode-2.hadoop
volumes:
- datanode-2-vol:/hadoop/dfs/data
env_file:
- ./hadoop.env
resourcemanager:
image: uhopper/hadoop-resourcemanager
depends_on:
- namenode
hostname: resourcemanager.hadoop
env_file:
- ./hadoop.env
nodemanager:
image: uhopper/hadoop-nodemanager
hostname: nodemanager.hadoop
depends_on:
- namenode
- resourcemanager
env_file:
- ./hadoop.env
dnsproxy:
image: netvl/dnsproxy
hostname: dnsproxy.hadoop
networks:
default:
external:
name: hadoop
volumes:
namenode-vol:
datanode-1-vol:
datanode-2-vol:
historyserver-vol:
# Dockerfile for a DNS proxy
FROM alpine:latest
RUN apk add --no-cache dnsmasq
EXPOSE 53/udp
CMD ip a && dnsmasq -u root --no-daemon
CORE_CONF_fs_defaultFS=hdfs://namenode.hadoop:8020
CORE_CONF_hadoop_http_staticuser_user=root
YARN_CONF_yarn_log___aggregation___enable=true
YARN_CONF_yarn_resourcemanager_recovery_enabled=true
YARN_CONF_yarn_resourcemanager_store_class=org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore
YARN_CONF_yarn_resourcemanager_fs_state___store_uri=/rmstate
YARN_CONF_yarn_nodemanager_remote___app___log___dir=/app-logs
YARN_CONF_yarn_nodemanager_aux___services=mapreduce_shuffle
YARN_CONF_yarn_log_server_url=http://historyserver.hadoop:8188/applicationhistory/logs/
YARN_CONF_yarn_timeline___service_enabled=true
YARN_CONF_yarn_timeline___service_generic___application___history_enabled=true
YARN_CONF_yarn_resourcemanager_system___metrics___publisher_enabled=true
YARN_CONF_mapreduce_jobhistory_address=mrjobhistoryserver:10020
YARN_CONF_mapreduce_jobhistory_webapp_address=mrjobhistoryserver:19888
YARN_CONF_mapreduce_framework_name=yarn
YARN_CONF_yarn_resourcemanager_hostname=resourcemanager.hadoop
YARN_CONF_yarn_timeline___service_hostname=historyserver.hadoop
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment