gree2/setup

## setup
##########
# For verification, you can display the OS release.
##########
$ cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=11.10
DISTRIB_CODENAME=oneiric
DISTRIB_DESCRIPTION="Ubuntu 11.10"

##########
# Download all of the packages you'll need. Hopefully,
# you have a fast download connection.
##########

$ sudo apt-get update
$ sudo apt-get upgrade
$ sudo apt-get install curl
$ sudo apt-get install git
$ sudo apt-get install maven2
$ sudo apt-get install openssh-server openssh-client
$ sudo apt-get install openjdk-7-jdk

##########
# Switch to the new Java. On my system, it was
# the third option (marked '2' naturally)
##########

$ sudo update-alternatives --config java

##########
# Set the JAVA_HOME variable. I took the
# time to update my .bashrc script.
##########

$ export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-i386

##########
# Now we can download Cloudera's version of Hadoop. The
# first step is adding the repository. Note that oneiric
# is not explicitly supported as of 2011-Dec-20. So I am
# using the 'maverick' repository.
##########

# Create a repository list file. Add the two indented lines
# to the new file.

$ sudo vi /etc/apt/sources.list.d/cloudera.list
  deb http://archive.cloudera.com/debian maverick-cdh3 contrib
  deb-src http://archive.cloudera.com/debian maverick-cdh3 contrib

# Add public key

$ curl -s http://archive.cloudera.com/debian/archive.key | sudo apt-key add -
$ sudo apt-get update

# Install all of the Hadoop components.

$ sudo apt-get install hadoop-0.20
$ sudo apt-get install hadoop-0.20-namenode
$ sudo apt-get install hadoop-0.20-datanode
$ sudo apt-get install hadoop-0.20-secondarynamenode
$ sudo apt-get install hadoop-0.20-jobtracker
$ sudo apt-get install hadoop-0.20-tasktracker

# Set some environment variables. I added these to my
# .bashrc file.

$ export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-i386
$ export HADOOP_HOME=/usr/lib/hadoop-0.20

$ cd $HADOOP_HOME/conf

# Create the hadoop temp directory. It should not
# be in the /tmp directory because that directory
# disappears after each system restart. Something
# that is done a lot with virtual machines.
sudo mkdir /hadoop_tmp_dir
sudo chmod 777 /hadoop_tmp_dir

# Replace the existing file with the indented lines.
$ sudo vi core-site.xml
    <?xml version="1.0"?>
    <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
    <configuration>
      <property>
        <name>hadoop.tmp.dir</name>
        <value>/hadoop_tmp_dir</value>
      </property>
      <property>
        <name>fs.default.name</name>
        <value>hdfs://localhost:9000</value>
      </property>
    </configuration>

##########
# Notice that the dfs secondary http address is not
# the default in the XML below. I don't know what
# process was using the default, but I needed to
# change it to avoid the 'port already in use' message.
##########

# Replace the existing file with the indented lines.
$ sudo vi hdfs-site.xml
    <?xml version="1.0"?>
    <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
    <configuration>
        <property>
          <name>dfs.secondary.http.address</name>
          <value>0.0.0.0:50090</value>
        </property>
        <property>
          <name>dfs.replication</name>
          <value>1</value>
        </property>
        <property>
          <name>dfs.datanode.max.xcievers</name>
          <value>4096</value>
        </property>
    </configuration>

# Replace the existing file with the indented lines.
$ sudo vi mapred-site.xml
    <?xml version="1.0"?>
    <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
    <configuration>
      <property>
        <name>mapred.job.tracker</name>
        <value>localhost:9001</value>
      </property>
    </configuration>

# format the hadoop filesystem
$ hadoop namenode -format


##########
# Time to setup password-less ssh to localhost
##########
$ cd ~
$ ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
$ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys

# If you want to test that the ssh works, do this. Then exit.
$ ssh localhost

#######
#######
#######
#######
# REPEAT FOR EACH RESTART
#
# Since we are working inside a virtual machine, I found that
# some settings did not survive a shutdown or reboot. From this
# point on, repeat these command for each instance startup.
#######

# hadoop was installed as root. Therefore we need to
# change the ownership so that your username can
# write. IF YOU ARE NOT USING 'ubuntu', CHANGE THE
# COMMAND ACCORDINGLY.

$ sudo chown -R ubuntu:ubuntu /usr/lib/hadoop-0.20
$ sudo chown -R ubuntu:ubuntu /var/run/hadoop-0.20
$ sudo chown -R ubuntu:ubuntu /var/log/hadoop-0.20

# Start hadoop. I remove the logs so that I can find errors
# faster when I iterate through configuration settings.

$ cd $HADOOP_HOME
$ rm -rf logs/*
$ bin/start-all.sh

======================================
HBASE
======================================

For the sake of sanity:
$ sudo service hadoop-zookeeper-server stop
STOP ALL PROCESSES (Hadoop)
Then...

$ sudo apt-get install hadoop-hbase

$ echo "hdfs  -       nofile  32768" >> /etc/security/limits.conf
$ echo "hbase  -       nofile  32768" >> /etc/security/limits.conf
$ echo "session required  pam_limits.so" >> /etc/pam.d/common-session

Double check that $HADOOP_HOME/conf/hdfs-site.xml has the dfs.datanode.max.xcievers property set = 4096

$ sudo apt-get install hadoop-hbase-master

- Ensure that you edit $HBASE_HOME/conf/hbase-env.sh such that JAVA_HOME is set. For some reason, when installing the HBase Master, it only looks at that file for JAVA_HOME. Placing JAVA_HOME in /etc/environment is not sufficient (or .bashrc for that matter).

- Edit /etc/hosts such that the ubuntu user points to 127.0.0.1. Otherwise HMaster will be unable to connect properly. This may only be a problem on the VM and not when Ubuntu is the native OS.

$ sudo /etc/init.d/hadoop-hbase-master start

$ hbase shell

Test out the hbase shell and make sure you can create 'test', 'cf'

STOP ALL PROCESSES NOW... going into pseudo-distributed mode and tying in with Hadoop

Edit /etc/hbase/conf/hbase-site.xml and add:

<property>
  <name>hbase.cluster.distributed</name>
  <value>true</value>
</property>
<property>
  <name>hbase.rootdir</name>
  <value>hdfs://localhost:9000/hbase</value>
</property>

Now start up HDFS again and create the /hbase directory (with no security constraints [aka fs instead of dfs])
$ hadoop fs -mkdir /hbase
$ hadoop fs -chown hbase /hbase

=========================================
ZOOKEEPER
=========================================
$ sudo apt-get install hadoop-zookeeper-server
$ sudo /etc/init.d/hadoop-zookeeper-server start

Back to HBase...

Ensure HDFS / Zookeeper is running. Then...

$ sudo /etc/init.d/hadoop-hbase-master start

$ sudo apt-get install hadoop-hbase-regionserver

$ sudo /etc/init.d/hadoop-hbase-regionserver start

Navigate to http://localhost:60010 to ensure region server is working with the master.

You should be able to now manage HBase tables in its shell, and see the results in HDFS.

================================================
SQOOP
================================================
Install Sqoop through Cloudera:
$ sudo apt-get install sqoop

For configuration and installation (JDBC Driver / SQL Server Connector):
After installing and configuring Sqoop, verify the following environment variables are set on the machine with Sqoop installation, as described in the following table. These must be set for SQL Server-Hadoop Connector to work correctly.

Environment Variable = Value to Assign

SQOOP_HOME = Absolute path to the Sqoop installation directory

SQOOP_CONF_DIR = $SQOOP_HOME/conf

Step 3: Download and install the Microsoft JDBC Driver

Sqoop and SQL Server-Hadoop use JDBC technology to establish connections to remote RDBMS servers and therefore needs the JDBC driver for SQL Server. To install this driver on Linux node where Sqoop is already installed:

Visit http://www.microsoft.com/download/en/details.aspx?displaylang=en&id=21599 and download “sqljdbc_<version>_enu.tar.gz”
Copy it on the machine with Sqoop installation
Unpack the tar file using following command: tar –zxvf sqljdbc_<version>_enu.tar.gz. This will create a directory “sqljdbc_3.0” in current directory
Copy the driver jar (sqljdbc_3.0/enu/sqljdbc4.jar) file to the $SQOOP_HOME/lib directory on machine with Sqoop installation.


Download and Install SQL Server-Hadoop Connector

After all of the previous steps have completed, you are ready to download, install and configure the SQL Server-Hadoop Connector on the machine with Sqoop installation. The SQL Server–Hadoop connector is distributed as a compressed tar archive named sqoop-sqlserver-1.0.tar.gz. Download the tar archive from http://download.microsoft.com, and save the archive on the same machine where Sqoop is installed.

This archive is composed of the following files and directories:

File/Directory = Description

install.sh = Is a shell script that installs the SQL Server - Hadoop Connector files into the Sqoop directory structure

Microsoft SQL Server - Hadoop Connector User Guide.pdf = Contains instructions to deploy and execute SQL Server – Hadoop Connector.

lib/ = Contains the sqoop-sqlserver-1.0.jar file

conf/ = Contains the configuration files for SQL Server – Hadoop Connector.

THIRDPARTYNOTICES FOR HADOOP-BASED CONNECTORS.txt = Contains the third party notices.

SQL Server Connector for Apache Hadoop MSLT.pdf = EULA for the SQL Server Connector for Apache Hadoop.

To install SQL Server – Hadoop Connector:

1. Login to the machine where Sqoop is installed as a user who has permission to install files

2. Extract the archive with the command: “tar –zxvf sqoop-sqlserver-1.0.tar.gz”. This will create “sqoop-sqlserver-1.0” directory in current directory

3. Change directory (cd) to “sqoop-sqlserver-1.0”

4. Ensure that MSSQL_CONNECTOR_HOME environment variable is set to the absolute path of the sqoop -sqlserver-1.0 directory.

5. Run the shell script install.sh with no additional arguments.

6. Installer will copy the connector jar and configuration file under existing Sqoop installation

Example SQL Server Sqoop import statement:
$ bin/sqoop import --connect 'jdbc:sqlserver://<ip-address>;instanceName=<instance-name>;username=<user-name>;password=<password>;database=<database-name>' --query 'SELECT * FROM [Database].[prefix].[table-name] WHERE $CONDITIONS' --split-by <column-to-split-by> --target-dir <hdfs-target-directory>

For importing into Hbase...

bin/sqoop import --connect 'jdbc:sqlserver://<ip-address>;instanceName=SQLExpress;username=<username>;password=<password>;database=<database>' --query 'SELECT * FROM [database].[prefix].[table] WHERE $CONDITIONS' --split-by <primary-key> --hbase-table <hbase-table> --column-family <column-family>

* Note that the table must be created with a column family in HBase before executing the above command.


For configuration with importing from Oracle:

- Download ojdbc6.jar and place in $SQOOP_HOME/lib
- Connection string format: sqoop --connect jdbc:oracle:thin:@//<address>:<port>/<instance-name>

  (all other options such as --query apply)
  - Another example: $ sqoop import --connect jdbc:oracle:thin:@//dbserver:1521/masterdb --username myuser --table DBFUNC1.R1_EVOLUTION --where 'rownum=1' --verbose -P

======================================================
JSON INTEGRATION
======================================================

JSON can be sent through the REST interface (Stargate), however everything sent through the REST interface is encoded in base64, and so any GETs or PUTs should first be decoded/encoded in base64.

See Gist: https://gist.github.com/2284007 for examples

* Remember - $ sudo hbase rest start

======================================================
LILY
======================================================
- Install SOLR
- Install Lily cluster
	##########
	# For verification, you can display the OS release.
	##########
	$ cat /etc/lsb-release
	DISTRIB_ID=Ubuntu
	DISTRIB_RELEASE=11.10
	DISTRIB_CODENAME=oneiric
	DISTRIB_DESCRIPTION="Ubuntu 11.10"

	##########
	# Download all of the packages you'll need. Hopefully,
	# you have a fast download connection.
	##########

	$ sudo apt-get update
	$ sudo apt-get upgrade
	$ sudo apt-get install curl
	$ sudo apt-get install git
	$ sudo apt-get install maven2
	$ sudo apt-get install openssh-server openssh-client
	$ sudo apt-get install openjdk-7-jdk

	##########
	# Switch to the new Java. On my system, it was
	# the third option (marked '2' naturally)
	##########

	$ sudo update-alternatives --config java

	##########
	# Set the JAVA_HOME variable. I took the
	# time to update my .bashrc script.
	##########

	$ export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-i386

	##########
	# Now we can download Cloudera's version of Hadoop. The
	# first step is adding the repository. Note that oneiric
	# is not explicitly supported as of 2011-Dec-20. So I am
	# using the 'maverick' repository.
	##########

	# Create a repository list file. Add the two indented lines
	# to the new file.

	$ sudo vi /etc/apt/sources.list.d/cloudera.list
	deb http://archive.cloudera.com/debian maverick-cdh3 contrib
	deb-src http://archive.cloudera.com/debian maverick-cdh3 contrib

	# Add public key

	$ curl -s http://archive.cloudera.com/debian/archive.key \| sudo apt-key add -
	$ sudo apt-get update

	# Install all of the Hadoop components.

	$ sudo apt-get install hadoop-0.20
	$ sudo apt-get install hadoop-0.20-namenode
	$ sudo apt-get install hadoop-0.20-datanode
	$ sudo apt-get install hadoop-0.20-secondarynamenode
	$ sudo apt-get install hadoop-0.20-jobtracker
	$ sudo apt-get install hadoop-0.20-tasktracker

	# Set some environment variables. I added these to my
	# .bashrc file.

	$ export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-i386
	$ export HADOOP_HOME=/usr/lib/hadoop-0.20

	$ cd $HADOOP_HOME/conf

	# Create the hadoop temp directory. It should not
	# be in the /tmp directory because that directory
	# disappears after each system restart. Something
	# that is done a lot with virtual machines.
	sudo mkdir /hadoop_tmp_dir
	sudo chmod 777 /hadoop_tmp_dir

	# Replace the existing file with the indented lines.
	$ sudo vi core-site.xml
	<?xml version="1.0"?>
	<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
	<configuration>
	<property>
	<name>hadoop.tmp.dir</name>
	<value>/hadoop_tmp_dir</value>
	</property>
	<property>
	<name>fs.default.name</name>
	<value>hdfs://localhost:9000</value>
	</property>
	</configuration>

	##########
	# Notice that the dfs secondary http address is not
	# the default in the XML below. I don't know what
	# process was using the default, but I needed to
	# change it to avoid the 'port already in use' message.
	##########

	# Replace the existing file with the indented lines.
	$ sudo vi hdfs-site.xml
	<?xml version="1.0"?>
	<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
	<configuration>
	<property>
	<name>dfs.secondary.http.address</name>
	<value>0.0.0.0:50090</value>
	</property>
	<property>
	<name>dfs.replication</name>
	<value>1</value>
	</property>
	<property>
	<name>dfs.datanode.max.xcievers</name>
	<value>4096</value>
	</property>
	</configuration>

	# Replace the existing file with the indented lines.
	$ sudo vi mapred-site.xml
	<?xml version="1.0"?>
	<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
	<configuration>
	<property>
	<name>mapred.job.tracker</name>
	<value>localhost:9001</value>
	</property>
	</configuration>

	# format the hadoop filesystem
	$ hadoop namenode -format


	##########
	# Time to setup password-less ssh to localhost
	##########
	$ cd ~
	$ ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
	$ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys

	# If you want to test that the ssh works, do this. Then exit.
	$ ssh localhost

	#######
	#######
	#######
	#######
	# REPEAT FOR EACH RESTART
	#
	# Since we are working inside a virtual machine, I found that
	# some settings did not survive a shutdown or reboot. From this
	# point on, repeat these command for each instance startup.
	#######

	# hadoop was installed as root. Therefore we need to
	# change the ownership so that your username can
	# write. IF YOU ARE NOT USING 'ubuntu', CHANGE THE
	# COMMAND ACCORDINGLY.

	$ sudo chown -R ubuntu:ubuntu /usr/lib/hadoop-0.20
	$ sudo chown -R ubuntu:ubuntu /var/run/hadoop-0.20
	$ sudo chown -R ubuntu:ubuntu /var/log/hadoop-0.20

	# Start hadoop. I remove the logs so that I can find errors
	# faster when I iterate through configuration settings.

	$ cd $HADOOP_HOME
	$ rm -rf logs/*
	$ bin/start-all.sh

	======================================
	HBASE
	======================================

	For the sake of sanity:
	$ sudo service hadoop-zookeeper-server stop
	STOP ALL PROCESSES (Hadoop)
	Then...

	$ sudo apt-get install hadoop-hbase

	$ echo "hdfs - nofile 32768" >> /etc/security/limits.conf
	$ echo "hbase - nofile 32768" >> /etc/security/limits.conf
	$ echo "session required pam_limits.so" >> /etc/pam.d/common-session

	Double check that $HADOOP_HOME/conf/hdfs-site.xml has the dfs.datanode.max.xcievers property set = 4096

	$ sudo apt-get install hadoop-hbase-master

	- Ensure that you edit $HBASE_HOME/conf/hbase-env.sh such that JAVA_HOME is set. For some reason, when installing the HBase Master, it only looks at that file for JAVA_HOME. Placing JAVA_HOME in /etc/environment is not sufficient (or .bashrc for that matter).

	- Edit /etc/hosts such that the ubuntu user points to 127.0.0.1. Otherwise HMaster will be unable to connect properly. This may only be a problem on the VM and not when Ubuntu is the native OS.

	$ sudo /etc/init.d/hadoop-hbase-master start

	$ hbase shell

	Test out the hbase shell and make sure you can create 'test', 'cf'

	STOP ALL PROCESSES NOW... going into pseudo-distributed mode and tying in with Hadoop

	Edit /etc/hbase/conf/hbase-site.xml and add:

	<property>
	<name>hbase.cluster.distributed</name>
	<value>true</value>
	</property>
	<property>
	<name>hbase.rootdir</name>
	<value>hdfs://localhost:9000/hbase</value>
	</property>

	Now start up HDFS again and create the /hbase directory (with no security constraints [aka fs instead of dfs])
	$ hadoop fs -mkdir /hbase
	$ hadoop fs -chown hbase /hbase

	=========================================
	ZOOKEEPER
	=========================================
	$ sudo apt-get install hadoop-zookeeper-server
	$ sudo /etc/init.d/hadoop-zookeeper-server start

	Back to HBase...

	Ensure HDFS / Zookeeper is running. Then...

	$ sudo /etc/init.d/hadoop-hbase-master start

	$ sudo apt-get install hadoop-hbase-regionserver

	$ sudo /etc/init.d/hadoop-hbase-regionserver start

	Navigate to http://localhost:60010 to ensure region server is working with the master.

	You should be able to now manage HBase tables in its shell, and see the results in HDFS.

	================================================
	SQOOP
	================================================
	Install Sqoop through Cloudera:
	$ sudo apt-get install sqoop

	For configuration and installation (JDBC Driver / SQL Server Connector):
	After installing and configuring Sqoop, verify the following environment variables are set on the machine with Sqoop installation, as described in the following table. These must be set for SQL Server-Hadoop Connector to work correctly.

	Environment Variable = Value to Assign

	SQOOP_HOME = Absolute path to the Sqoop installation directory

	SQOOP_CONF_DIR = $SQOOP_HOME/conf

	Step 3: Download and install the Microsoft JDBC Driver

	Sqoop and SQL Server-Hadoop use JDBC technology to establish connections to remote RDBMS servers and therefore needs the JDBC driver for SQL Server. To install this driver on Linux node where Sqoop is already installed:

	Visit http://www.microsoft.com/download/en/details.aspx?displaylang=en&id=21599 and download “sqljdbc_<version>_enu.tar.gz”
	Copy it on the machine with Sqoop installation
	Unpack the tar file using following command: tar –zxvf sqljdbc_<version>_enu.tar.gz. This will create a directory “sqljdbc_3.0” in current directory
	Copy the driver jar (sqljdbc_3.0/enu/sqljdbc4.jar) file to the $SQOOP_HOME/lib directory on machine with Sqoop installation.


	Download and Install SQL Server-Hadoop Connector

	After all of the previous steps have completed, you are ready to download, install and configure the SQL Server-Hadoop Connector on the machine with Sqoop installation. The SQL Server–Hadoop connector is distributed as a compressed tar archive named sqoop-sqlserver-1.0.tar.gz. Download the tar archive from http://download.microsoft.com, and save the archive on the same machine where Sqoop is installed.

	This archive is composed of the following files and directories:

	File/Directory = Description

	install.sh = Is a shell script that installs the SQL Server - Hadoop Connector files into the Sqoop directory structure

	Microsoft SQL Server - Hadoop Connector User Guide.pdf = Contains instructions to deploy and execute SQL Server – Hadoop Connector.

	lib/ = Contains the sqoop-sqlserver-1.0.jar file

	conf/ = Contains the configuration files for SQL Server – Hadoop Connector.

	THIRDPARTYNOTICES FOR HADOOP-BASED CONNECTORS.txt = Contains the third party notices.

	SQL Server Connector for Apache Hadoop MSLT.pdf = EULA for the SQL Server Connector for Apache Hadoop.

	To install SQL Server – Hadoop Connector:

	1. Login to the machine where Sqoop is installed as a user who has permission to install files

	2. Extract the archive with the command: “tar –zxvf sqoop-sqlserver-1.0.tar.gz”. This will create “sqoop-sqlserver-1.0” directory in current directory

	3. Change directory (cd) to “sqoop-sqlserver-1.0”

	4. Ensure that MSSQL_CONNECTOR_HOME environment variable is set to the absolute path of the sqoop -sqlserver-1.0 directory.

	5. Run the shell script install.sh with no additional arguments.

	6. Installer will copy the connector jar and configuration file under existing Sqoop installation

	Example SQL Server Sqoop import statement:
	$ bin/sqoop import --connect 'jdbc:sqlserver://<ip-address>;instanceName=<instance-name>;username=<user-name>;password=<password>;database=<database-name>' --query 'SELECT * FROM [Database].[prefix].[table-name] WHERE $CONDITIONS' --split-by <column-to-split-by> --target-dir <hdfs-target-directory>

	For importing into Hbase...

	bin/sqoop import --connect 'jdbc:sqlserver://<ip-address>;instanceName=SQLExpress;username=<username>;password=<password>;database=<database>' --query 'SELECT * FROM [database].[prefix].[table] WHERE $CONDITIONS' --split-by <primary-key> --hbase-table <hbase-table> --column-family <column-family>

	* Note that the table must be created with a column family in HBase before executing the above command.



	For configuration with importing from Oracle:

	- Download ojdbc6.jar and place in $SQOOP_HOME/lib
	- Connection string format: sqoop --connect jdbc:oracle:thin:@//<address>:<port>/<instance-name>

	(all other options such as --query apply)
	- Another example: $ sqoop import --connect jdbc:oracle:thin:@//dbserver:1521/masterdb --username myuser --table DBFUNC1.R1_EVOLUTION --where 'rownum=1' --verbose -P

	======================================================
	JSON INTEGRATION
	======================================================

	JSON can be sent through the REST interface (Stargate), however everything sent through the REST interface is encoded in base64, and so any GETs or PUTs should first be decoded/encoded in base64.

	See Gist: https://gist.github.com/2284007 for examples

	* Remember - $ sudo hbase rest start

	======================================================
	LILY
	======================================================
	- Install SOLR
	- Install Lily cluster