Skip to content

Instantly share code, notes, and snippets.

@ozbillwang
Last active August 29, 2015 14:13
Show Gist options
  • Save ozbillwang/f23cc21814b76a9d8519 to your computer and use it in GitHub Desktop.
Save ozbillwang/f23cc21814b76a9d8519 to your computer and use it in GitHub Desktop.
How-to: Use Vagrant to Set Up a Virtual Hadoop Cluster (For CDH 4)
How-to: Use Vagrant to Set Up a Virtual Hadoop Cluster (For CDH 4)
http://blog.cloudera.com/blog/2013/04/how-to-use-vagrant-to-set-up-a-virtual-hadoop-cluster/
http://blog.cloudera.com/blog/2014/06/how-to-install-a-virtual-apache-hadoop-cluster-with-vagrant-and-cloudera-manager/
by Justin Kestelyn (@kestelyn)April 09, 20139 comments
This guest post comes to us from David Greco, CTO of Eligotech. For a how-to on this subject for CDH 5, see this post.
Vagrant is a very nice tool for programmatically managing many virtual machines (VMs) on a single physical machine. It natively supports VirtualBox and also provides plugins for VMware Fusion and Amazon EC2, supporting the management of VMs in those environments as well.
Vagrant provides a very easy-to-use, Ruby-based internal DSL that allows the user to define one or more virtual machines together with their configuration parameters. Furthermore, it offers different mechanisms for automatic provisioning: You can use Puppet, Chef, or shell scripts for automating software installation and configuration on the machines defined in the Vagrant configuration file.
So, using Vagrant, it’s possible to define complex virtual infrastructures based on multiple VMs running on your system. Pretty cool, no?
A typical use case for Vagrant is to build working/development environments in a simple and consistent way. At my company, Eligotech, we are developing a product aimed to simplify the usage of Apache Hadoop, and CDH, Cloudera’s open source distribution, is our reference Hadoop distribution. We often need to set up a Hadoop environment on our machine for testing purposes, and we found Vagrant to be a very handy tool for that purpose.
I put together an example of a Vagrant configuration file that you can test for yourself. You’ll need to download and install Vagrant (instructions) and VirtualBox. Once everything has been installed, just copy-and-paste the text below to a file named Vagrantfile and put it in a directory named, for example, VagrantHadoop. This configuration file assumes you have at least 32GB of memory on your box; if that’s not the case, you can edit the file to suit your environment (to run fewer slaves, for example, by commenting out some of the slave configurations).
# -*- mode: ruby -*-
# vi: set ft=ruby :
$master_script = <<SCRIPT
#!/bin/bash
cat > /etc/hosts <<EOF
127.0.0.1 localhost
# The following lines are desirable for IPv6 capable hosts
::1 ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
10.211.55.100 vm-cluster-node1
10.211.55.101 vm-cluster-node2
10.211.55.102 vm-cluster-node3
10.211.55.103 vm-cluster-node4
10.211.55.104 vm-cluster-node5
10.211.55.105 vm-cluster-client
EOF
apt-get install curl -y
REPOCM=${REPOCM:-cm4}
CM_REPO_HOST=${CM_REPO_HOST:-archive.cloudera.com}
CM_MAJOR_VERSION=$(echo $REPOCM | sed -e 's/cm\\([0-9]\\).*/\\1/')
CM_VERSION=$(echo $REPOCM | sed -e 's/cm\\([0-9][0-9]*\\)/\\1/')
OS_CODENAME=$(lsb_release -sc)
OS_DISTID=$(lsb_release -si | tr '[A-Z]' '[a-z]')
if [ $CM_MAJOR_VERSION -ge 4 ]; then
cat > /etc/apt/sources.list.d/cloudera-$REPOCM.list <<EOF
deb [arch=amd64] http://$CM_REPO_HOST/cm$CM_MAJOR_VERSION/$OS_DISTID/$OS_CODENAME/amd64/cm $OS_CODENAME-$REPOCM contrib
deb-src http://$CM_REPO_HOST/cm$CM_MAJOR_VERSION/$OS_DISTID/$OS_CODENAME/amd64/cm $OS_CODENAME-$REPOCM contrib
EOF
curl -s http://$CM_REPO_HOST/cm$CM_MAJOR_VERSION/$OS_DISTID/$OS_CODENAME/amd64/cm/archive.key > key
apt-key add key
rm key
fi
apt-get update
export DEBIAN_FRONTEND=noninteractive
apt-get -q -y --force-yes install oracle-j2sdk1.6 cloudera-manager-server-db cloudera-manager-server cloudera-manager-daemons
service cloudera-scm-server-db initdb
service cloudera-scm-server-db start
service cloudera-scm-server start
SCRIPT
$slave_script = <<SCRIPT
cat > /etc/hosts <<EOF
127.0.0.1 localhost
# The following lines are desirable for IPv6 capable hosts
::1 ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
10.211.55.100 vm-cluster-node1
10.211.55.101 vm-cluster-node2
10.211.55.102 vm-cluster-node3
10.211.55.103 vm-cluster-node4
10.211.55.104 vm-cluster-node5
10.211.55.105 vm-cluster-client
EOF
SCRIPT
$client_script = <<SCRIPT
cat > /etc/hosts <<EOF
127.0.0.1 localhost
# The following lines are desirable for IPv6 capable hosts
::1 ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
10.211.55.100 vm-cluster-node1
10.211.55.101 vm-cluster-node2
10.211.55.102 vm-cluster-node3
10.211.55.103 vm-cluster-node4
10.211.55.104 vm-cluster-node5
10.211.55.105 vm-cluster-client
EOF
SCRIPT
Vagrant.configure("2") do |config|
config.vm.define :master do |master|
master.vm.box = "precise64"
master.vm.provider "vmware_fusion" do |v|
v.vmx["memsize"] = "4096"
end
master.vm.provider :virtualbox do |v|
v.name = "vm-cluster-node1"
v.customize ["modifyvm", :id, "--memory", "4096"]
end
master.vm.network :private_network, ip: "10.211.55.100"
master.vm.hostname = "vm-cluster-node1"
master.vm.provision :shell, :inline => $master_script
end
config.vm.define :slave1 do |slave1|
slave1.vm.box = "precise64"
slave1.vm.provider "vmware_fusion" do |v|
v.vmx["memsize"] = "5120"
end
slave1.vm.provider :virtualbox do |v|
v.name = "vm-cluster-node2"
v.customize ["modifyvm", :id, "--memory", "5120"]
end
slave1.vm.network :private_network, ip: "10.211.55.101"
slave1.vm.hostname = "vm-cluster-node2"
slave1.vm.provision :shell, :inline => $slave_script
end
config.vm.define :slave2 do |slave2|
slave2.vm.box = "precise64"
slave2.vm.provider "vmware_fusion" do |v|
v.vmx["memsize"] = "5120"
end
slave2.vm.provider :virtualbox do |v|
v.name = "vm-cluster-node3"
v.customize ["modifyvm", :id, "--memory", "5120"]
end
slave2.vm.network :private_network, ip: "10.211.55.102"
slave2.vm.hostname = "vm-cluster-node3"
slave2.vm.provision :shell, :inline => $slave_script
end
config.vm.define :slave3 do |slave3|
slave3.vm.box = "precise64"
slave3.vm.provider "vmware_fusion" do |v|
v.vmx["memsize"] = "5120"
end
slave3.vm.provider :virtualbox do |v|
v.name = "vm-cluster-node4"
v.customize ["modifyvm", :id, "--memory", "5120"]
end
slave3.vm.network :private_network, ip: "10.211.55.103"
slave3.vm.hostname = "vm-cluster-node4"
slave3.vm.provision :shell, :inline => $slave_script
end
config.vm.define :slave4 do |slave4|
slave4.vm.box = "precise64"
slave4.vm.provider "vmware_fusion" do |v|
v.vmx["memsize"] = "5120"
end
slave4.vm.provider :virtualbox do |v|
v.name = "vm-cluster-node5"
v.customize ["modifyvm", :id, "--memory", "5120"]
end
slave4.vm.network :private_network, ip: "10.211.55.104"
slave4.vm.hostname = "vm-cluster-node5"
slave4.vm.provision :shell, :inline => $slave_script
end
config.vm.define :client do |client|
client.vm.box = "precise64"
client.vm.provider "vmware_fusion" do |v|
v.vmx["memsize"] = "4096"
end
client.vm.provider :virtualbox do |v|
v.name = "vm-cluster-client"
v.customize ["modifyvm", :id, "--memory", "4096"]
end
client.vm.network :private_network, ip: "10.211.55.105"
client.vm.hostname = "vm-cluster-client"
client.vm.provision :shell, :inline => $client_script
end
end
This file defines six machines to be assigned the following CDH 4 roles:
vm-cluster-node1: This is the master; besides running the CM master, it should run the namenode, secondary namenode, and jobtracker.
vm-cluster-node2: This is a slave, it should run a datanode and a tasktracker.
vm-cluster-node3: This is a slave, it should run a datanode and a tasktracker.
vm-cluster-node4: This is a slave, it should run a datanode and a tasktracker.
vm-cluster-node5: This is a slave, it should run a datanode and a tasktracker.
vm-cluster-client: This machine plays the role of gateway for the cluster.
Click here to learn the meaning of the different items in the configuration file. In particular, you can see that depending on the particular provider, either VirtualBox or VMware Fusion, the memory size is changed in a different way. Observe how simple it is to switch between providers for customizing environment-specific things!
This Vagrant file does another very important thing: It installs Cloudera Manager automatically on the master node, vm-cluster-node1.
To create the virtual cluster, open a shell and just go to the directory holding the Vagrant file, i.e. VagrantHadoop. Under that directory, run:
1
> vagrant up --provider=virtualbox
After a while, depending on how fast your machine is, Vagrant will return control — meaning that all the VMs are up and running.
At this point you are ready to configure your cluster through CM’s web UI via http://vm-cluster-node1:7180.
Have fun!
# -*- mode: ruby -*-
# vi: set ft=ruby :
$master_script = <<SCRIPT
#!/bin/bash
cat > /etc/hosts <<EOF
127.0.0.1 localhost
# The following lines are desirable for IPv6 capable hosts
::1 ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
10.211.55.100 vm-cluster-node1
10.211.55.101 vm-cluster-node2
10.211.55.102 vm-cluster-node3
10.211.55.103 vm-cluster-node4
10.211.55.104 vm-cluster-node5
10.211.55.105 vm-cluster-client
EOF
apt-get install curl -y
REPOCM=${REPOCM:-cm4}
CM_REPO_HOST=${CM_REPO_HOST:-archive.cloudera.com}
CM_MAJOR_VERSION=$(echo $REPOCM | sed -e 's/cm\\([0-9]\\).*/\\1/')
CM_VERSION=$(echo $REPOCM | sed -e 's/cm\\([0-9][0-9]*\\)/\\1/')
OS_CODENAME=$(lsb_release -sc)
OS_DISTID=$(lsb_release -si | tr '[A-Z]' '[a-z]')
if [ $CM_MAJOR_VERSION -ge 4 ]; then
cat > /etc/apt/sources.list.d/cloudera-$REPOCM.list <<EOF
deb [arch=amd64] http://$CM_REPO_HOST/cm$CM_MAJOR_VERSION/$OS_DISTID/$OS_CODENAME/amd64/cm $OS_CODENAME-$REPOCM contrib
deb-src http://$CM_REPO_HOST/cm$CM_MAJOR_VERSION/$OS_DISTID/$OS_CODENAME/amd64/cm $OS_CODENAME-$REPOCM contrib
EOF
curl -s http://$CM_REPO_HOST/cm$CM_MAJOR_VERSION/$OS_DISTID/$OS_CODENAME/amd64/cm/archive.key > key
apt-key add key
rm key
fi
apt-get update
export DEBIAN_FRONTEND=noninteractive
apt-get -q -y --force-yes install oracle-j2sdk1.6 cloudera-manager-server-db cloudera-manager-server cloudera-manager-daemons
service cloudera-scm-server-db initdb
service cloudera-scm-server-db start
service cloudera-scm-server start
SCRIPT
$slave_script = <<SCRIPT
cat > /etc/hosts <<EOF
127.0.0.1 localhost
# The following lines are desirable for IPv6 capable hosts
::1 ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
10.211.55.100 vm-cluster-node1
10.211.55.101 vm-cluster-node2
10.211.55.102 vm-cluster-node3
10.211.55.103 vm-cluster-node4
10.211.55.104 vm-cluster-node5
10.211.55.105 vm-cluster-client
EOF
SCRIPT
$client_script = <<SCRIPT
cat > /etc/hosts <<EOF
127.0.0.1 localhost
# The following lines are desirable for IPv6 capable hosts
::1 ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
10.211.55.100 vm-cluster-node1
10.211.55.101 vm-cluster-node2
10.211.55.102 vm-cluster-node3
10.211.55.103 vm-cluster-node4
10.211.55.104 vm-cluster-node5
10.211.55.105 vm-cluster-client
EOF
SCRIPT
Vagrant.configure("2") do |config|
config.vm.define :master do |master|
master.vm.box = "precise64"
master.vm.provider "vmware_fusion" do |v|
v.vmx["memsize"] = "4096"
end
master.vm.provider :virtualbox do |v|
v.name = "vm-cluster-node1"
v.customize ["modifyvm", :id, "--memory", "4096"]
end
master.vm.network :private_network, ip: "10.211.55.100"
master.vm.hostname = "vm-cluster-node1"
master.vm.provision :shell, :inline => $master_script
end
config.vm.define :slave1 do |slave1|
slave1.vm.box = "precise64"
slave1.vm.provider "vmware_fusion" do |v|
v.vmx["memsize"] = "5120"
end
slave1.vm.provider :virtualbox do |v|
v.name = "vm-cluster-node2"
v.customize ["modifyvm", :id, "--memory", "5120"]
end
slave1.vm.network :private_network, ip: "10.211.55.101"
slave1.vm.hostname = "vm-cluster-node2"
slave1.vm.provision :shell, :inline => $slave_script
end
config.vm.define :slave2 do |slave2|
slave2.vm.box = "precise64"
slave2.vm.provider "vmware_fusion" do |v|
v.vmx["memsize"] = "5120"
end
slave2.vm.provider :virtualbox do |v|
v.name = "vm-cluster-node3"
v.customize ["modifyvm", :id, "--memory", "5120"]
end
slave2.vm.network :private_network, ip: "10.211.55.102"
slave2.vm.hostname = "vm-cluster-node3"
slave2.vm.provision :shell, :inline => $slave_script
end
config.vm.define :slave3 do |slave3|
slave3.vm.box = "precise64"
slave3.vm.provider "vmware_fusion" do |v|
v.vmx["memsize"] = "5120"
end
slave3.vm.provider :virtualbox do |v|
v.name = "vm-cluster-node4"
v.customize ["modifyvm", :id, "--memory", "5120"]
end
slave3.vm.network :private_network, ip: "10.211.55.103"
slave3.vm.hostname = "vm-cluster-node4"
slave3.vm.provision :shell, :inline => $slave_script
end
config.vm.define :slave4 do |slave4|
slave4.vm.box = "precise64"
slave4.vm.provider "vmware_fusion" do |v|
v.vmx["memsize"] = "5120"
end
slave4.vm.provider :virtualbox do |v|
v.name = "vm-cluster-node5"
v.customize ["modifyvm", :id, "--memory", "5120"]
end
slave4.vm.network :private_network, ip: "10.211.55.104"
slave4.vm.hostname = "vm-cluster-node5"
slave4.vm.provision :shell, :inline => $slave_script
end
config.vm.define :client do |client|
client.vm.box = "precise64"
client.vm.provider "vmware_fusion" do |v|
v.vmx["memsize"] = "4096"
end
client.vm.provider :virtualbox do |v|
v.name = "vm-cluster-client"
v.customize ["modifyvm", :id, "--memory", "4096"]
end
client.vm.network :private_network, ip: "10.211.55.105"
client.vm.hostname = "vm-cluster-client"
client.vm.provision :shell, :inline => $client_script
end
end
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment