jerluc/hadoop-installation.md

## hadoop-installation.md

      
    Raw
  

              hadoop-installation.md
            
          
    What you need

By now, you should have the following prerequisites installed and configured on your machine:

Ubuntu 12.04
Java (Oracle JDK 6; ensure $JAVA_HOME is set)
Git
Apache Maven

What you get

After successful installation, you should be setup for running:

Hadoop (local-mode + pseudo-distributed)

Installing Hadoop

Copy the below shell commands into a file ~/hadoop-install.sh (note that for this project, we will be using Cloudera's CDH3 distribution of Hadoop for the sake of consistency):
    # Download the Cloudera CDH3 Debian package descriptor
wget http://archive.cloudera.com/one-click-install/maverick/cdh3-repository_1.0_all.deb
dpkg -i ./cdh3-repository_1.0_all.deb

# Add the Squeeze main source to the apt source listing
touch /etc/apt/sources.list
echo '
deb http://ftp.us.debian.org/debian squeeze main' >> /etc/apt/sources.list

# Add the CDH3 sources to the apt source listing
touch /etc/apt/sources.list.d/cloudera.list
echo '
deb http://archive.cloudera.com/debian maverick-cdh3 contrib' >> /etc/apt/sources.list.d/cloudera.list
echo '
deb-src http://archive.cloudera.com/debian maverick-cdh3 contrib' >> /etc/apt/sources.list.d/cloudera.list

# Update the apt caches
apt-get -y --force-yes update

# Install libzip1
apt-get -y --force-yes install libzip1

# Install Hadoop Core/Native
apt-get -y --force-yes install hadoop-0.20 hadoop-0.20-native

# Install Hadoop daemons
apt-get -y --force-yes install hadoop-0.20-namenode hadoop-0.20-datanode hadoop-0.20-jobtracker hadoop-0.20-tasktracker

Once copied, run the following command to make the shell script executable (this command will prompt you for your password, as it requires root user authentication):
sudo chmod 775 ~/hadoop-install.sh

Next, execute the shell script as root to install Hadoop and Pig:
sudo ~/hadoop-install.sh

You will then notice a whole flurry of things going on as packages are installed and dependencies are checked, etc. This is normal :)
Verifying your Hadoop installation

Once you have run the above commands to install Hadoop, you can verify that the installation was successful by running the built-in example MapReduce jobs provided during the Hadoop installation.
For testing your installation, we will create sample input files, and then run a MapReduce job that performs a distributed form of the Unix grep command. Copy the following commands into your terminal:
# Testing Hadoop installation
mkdir input
echo 'hello
world' > input/file.txt
echo 'hi
world' > input/file2.txt
hadoop jar /usr/lib/hadoop-0.20/hadoop-*-examples.jar grep input output '^h' 

If the installation is correct, and the job succeeds, you should find a file named output/part-0000 in the current directory with the contents:
2	h