By now, you should have the following prerequisites installed and configured on your machine:
- Ubuntu 12.04
- Java (Oracle JDK 6; ensure
$JAVA_HOME
is set) - Git
- Apache Maven
After successful installation, you should be setup for running:
- Hadoop (local-mode + pseudo-distributed)
Copy the below shell commands into a file ~/hadoop-install.sh
(note that for this project, we will be using Cloudera's CDH3 distribution of Hadoop for the sake of consistency):
# Download the Cloudera CDH3 Debian package descriptor
wget http://archive.cloudera.com/one-click-install/maverick/cdh3-repository_1.0_all.deb
dpkg -i ./cdh3-repository_1.0_all.deb
# Add the Squeeze main source to the apt source listing
touch /etc/apt/sources.list
echo '
deb http://ftp.us.debian.org/debian squeeze main' >> /etc/apt/sources.list
# Add the CDH3 sources to the apt source listing
touch /etc/apt/sources.list.d/cloudera.list
echo '
deb http://archive.cloudera.com/debian maverick-cdh3 contrib' >> /etc/apt/sources.list.d/cloudera.list
echo '
deb-src http://archive.cloudera.com/debian maverick-cdh3 contrib' >> /etc/apt/sources.list.d/cloudera.list
# Update the apt caches
apt-get -y --force-yes update
# Install libzip1
apt-get -y --force-yes install libzip1
# Install Hadoop Core/Native
apt-get -y --force-yes install hadoop-0.20 hadoop-0.20-native
# Install Hadoop daemons
apt-get -y --force-yes install hadoop-0.20-namenode hadoop-0.20-datanode hadoop-0.20-jobtracker hadoop-0.20-tasktracker
Once copied, run the following command to make the shell script executable (this command will prompt you for your password, as it requires root user authentication):
sudo chmod 775 ~/hadoop-install.sh
Next, execute the shell script as root to install Hadoop and Pig:
sudo ~/hadoop-install.sh
You will then notice a whole flurry of things going on as packages are installed and dependencies are checked, etc. This is normal :)
Once you have run the above commands to install Hadoop, you can verify that the installation was successful by running the built-in example MapReduce jobs provided during the Hadoop installation.
For testing your installation, we will create sample input files, and then run a MapReduce job that performs a distributed form of the Unix grep
command. Copy the following commands into your terminal:
# Testing Hadoop installation
mkdir input
echo 'hello
world' > input/file.txt
echo 'hi
world' > input/file2.txt
hadoop jar /usr/lib/hadoop-0.20/hadoop-*-examples.jar grep input output '^h'
If the installation is correct, and the job succeeds, you should find a file named output/part-0000
in the current directory with the contents:
2 h