Skip to content

Instantly share code, notes, and snippets.

@abruzzi
Last active August 29, 2015 14:01
Show Gist options
  • Save abruzzi/4385cc93dec3381d9e87 to your computer and use it in GitHub Desktop.
Save abruzzi/4385cc93dec3381d9e87 to your computer and use it in GitHub Desktop.
Hadoop + Hive + Sqoop notes
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://ubuntu:9000</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/home/hduser/yarn/hadoop-2.0.0-cdh4.6.0/tmp</value>
</property>
</configuration>
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
<property>
<name>dfs.permissions</name>
<value>false</value>
</property>
</configuration>
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>

搭建Hadoop集群

操作系统环境

我们有4台机器,其中三台为SUSE10:

$ lsb_release -a
LSB Version:    core-2.0-noarch:core-3.0-noarch:core-2.0-x86_64:...
Distributor ID: SUSE LINUX
Description:    SUSE Linux Enterprise Server 10 (x86_64)
Release:        10
Codename:       n/a

其中的另外一台为ubuntu

$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 12.04.3 LTS
Release:        12.04
Codename:       precise

首先,我们需要下载一些安装包,这些安装包包括Hadoop, Hive, Sqoop以及所有这些包的依赖jdk1.6。下载之后,先保存在一个目录中。

创建用户
$ sudo addgroup hadoop
$ sudo adduser -ingroup hadoop hduser

如果是SUSE:

$ groupadd hadoop
$ useradd -g hadoop -m -s /bin/bash hduser
$ passwd hduser

由于目前较为稳定的是Cloudera打过补丁的版本,因此我们的所有包都从此地下载: hadoop-2.0.0-cdh4.6.0.tar.gz, hive-0.10.0-cdh4.6.0.tar.gz, sqoop-1.4.3-cdh4.6.0.tar.gz

先用刚才创建的用户hduser登陆系统,然后创建文件夹:

$ mkdir -p yarn
$ mv hadoop-2.0.0-cdh4.6.0.tar.gz yarn/
$ mv hive-0.10.0-cdh4.6.0.tar.gz yarn/

$ cd yarn

$ tar zxvf hadoop-2.0.0-cdh4.6.0.tar.gz
$ tar zxvf hive-0.10.0-cdh4.6.0.tar.gz

$ sudo chown -R hduser:hadoop hadoop-2.0.0-cdh4.6.0
$ sudo chown -R hduser:hadoop hive-0.10.0-cdh4.6.0

这样就得到了一个干净的环境。我们需要设置一些环境变量:

环境变量设置
HADOOP_VERSION=hadoop-2.0.0-cdh4.6.0
export HADOOP_HOME=$HOME/yarn/$HADOOP_VERSION
export PATH=$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$PATH

export HADOOP_MAPRED_HOME=$HOME/yarn/$HADOOP_VERSION
export HADOOP_COMMON_HOME=$HOME/yarn/$HADOOP_VERSION
export HADOOP_HDFS_HOME=$HOME/yarn/$HADOOP_VERSION
export HADOOP_YARN_HOME=$HOME/yarn/$HADOOP_VERSION
export HADOOP_CONF_DIR=$HOME/yarn/$HADOOP_VERSION/etc/hadoop

export JAVA_HOME=$HOME/jdk1.7.0_55
export PATH=$JAVA_HOME/bin:$PATH

export HIVE_HOME=$HOME/yarn/hive-0.10.0-cdh4.6.0
export PATH=$HIVE_HOME/bin:$PATH

export SQOOP_HOME=$HOME/yarn/sqoop-1.4.3-cdh4.6.0
export PATH=$SQOOP_HOME/bin:$PATH

设置之后source ~/.bashrc使之生效,然后在命令行中运行

$ hadoop

应该看到诸如:

Usage: hadoop [--config confdir] COMMAND
       where COMMAND is one of:
  fs                   run a generic filesystem user client
  version              print the version
  ...

的输出。

配置集群中的机器

首先,我们需要确定主机和从机,通常一个主机会管理若干个从机,我们的环境中有四个节点。这里我们统一起见,只是用其中的三台SUSE10的机器。

首先确保集群中的机器都有一个独立的,唯一的名字:

$ cat /etc/hosts
127.0.0.1       localhost

10.144.245.202  ubuntu
10.144.245.203  nassvr
10.144.245.204  suse
10.144.245.205  taurus

我们选择了nassvr作为主机,susetaurus为从机。

机器名称可以通过

$ hostname nassvr

来修改

无密码登录

由于主从机需要通过网络通信,而且需要通过ssh通道,因此需要配置这些机器间可以无密码登录。在Linux环境中,这一点非常容易:

$ mkdir -p ~/.ssh
$ cd ~/.ssh
$ ssh-keygen -t rsa

如果你已经有之前的口令文件,需要备份一下。一般而言,服务器上不会有这样的口令,只需要默认的确认即可。这个命令会生成两个文件,一个id_rsa以及对应的公钥id_rsa.pub

$ ssh-copy-id -i ~/.ssh/id_rsa.pub hduser@suse
$ ssh-copy-id -i ~/.ssh/id_rsa.pub hduser@taurse

使用上边的命令将公钥拷贝到远程的机器上。然后就可以无密码登录到远程的服务器了。之后在远程服务器上重复上述动作,使得在从机上也可以无密码的登录到主机上。

设置配置文件

修改$HADOOP_CONF_DIR目录下的这几个文件(core-site.xml, hdfs-site.xml, mapred-site.xml, yarn-site.xml),内容分别见其他的几个gist。

应该注意的是,在文件hdfs-site.xml中的dfs.replication需要和实际的从机的个数相等,比如此处为2,是因为我们有suse和taurus作为从机。

而文件core-site.xml中的hadoop.tmp.dir的值可以自定义,只需要保证该值所对应的目录实际存在即可。

启动各个节点

$ hadoop-daemon.sh start namenode
$ hadoop-daemons.sh start datanode
$ yarn-daemon.sh start resourcemanager
$ yarn-daemons.sh start nodemanager
$ mr-jobhistory-daemon.sh start historyserver

注意此处的启动datanodenodemanager的时候,使用的是*-daemons.sh,而不是*-daemon.sh。这样就不会在主机上启动这两个进程了。

启动之后,可以通过jps来查看进行的运行情况:

查看所有节点的状态
hdfs dfsadmin -report

应该会看到诸如:

hduser@nassvr:~> hdfs dfsadmin -report
14/05/15 01:52:35 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Configured Capacity: 171787935744 (159.99 GB)
Present Capacity: 140130291712 (130.51 GB)
DFS Remaining: 140031705088 (130.41 GB)
DFS Used: 98586624 (94.02 MB)
DFS Used%: 0.07%
Under replicated blocks: 0
Blocks with corrupt replicas: 0
Missing blocks: 0

-------------------------------------------------
Datanodes available: 2 (2 total, 0 dead)

Live datanodes:
Name: 10.144.245.205:50010 (taurus)
Hostname: taurus
Decommission Status : Normal
Configured Capacity: 85893967872 (79.99 GB)
DFS Used: 49293312 (47.01 MB)
Non DFS Used: 14761175040 (13.75 GB)
DFS Remaining: 71083499520 (66.20 GB)
DFS Used%: 0.06%
DFS Remaining%: 82.76%
Last contact: Thu May 15 01:52:36 CST 2014


Name: 10.144.245.204:50010 (suse)
Hostname: suse
Decommission Status : Normal
Configured Capacity: 85893967872 (79.99 GB)
DFS Used: 49293312 (47.01 MB)
Non DFS Used: 16896468992 (15.74 GB)
DFS Remaining: 68948205568 (64.21 GB)
DFS Used%: 0.06%
DFS Remaining%: 80.27%
Last contact: Thu May 15 01:52:35 CST 2014

使用Hadoop来进行计算

使用sqoop来导入数据:

sqoop import --hive-import --connect jdbc:oracle:thin:@10.144.167.xx:1521:orcl --username SDE --password password --verbose --table F_100WS -m 1

注意数据库的用户名和表的名称要大写!

导入完成之后,可以在Hive中运行这样的命令来进行计算

add jar
  ${env:HOME}/gis/gis-tools-for-hadoop/samples/lib/esri-geometry-api.jar
  ${env:HOME}/gis/gis-tools-for-hadoop/samples/lib/spatial-sdk-hadoop.jar;
create temporary function ST_Point as 'com.esri.hadoop.hive.ST_Point';
create temporary function ST_Contains as 'com.esri.hadoop.hive.ST_Contains';
create temporary function ST_Intersects as 'com.esri.hadoop.hive.ST_Intersects';
CREATE EXTERNAL TABLE IF NOT EXISTS roads (Name string, Layer string, Shape binary)
ROW FORMAT SERDE 'com.esri.hadoop.hive.serde.JsonSerde'
STORED AS INPUTFORMAT 'com.esri.json.hadoop.EnclosedJsonInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION '${env:HOME}/sampledata'; 

upload the roadtest.json(Esri json fromat) to path /home/hduser/roadtest.json (in HDFS)

hadoop dfs -copyFromLocal roadtest.json /home/hduser/roadtest.json

select intersects node count then.

select count(*) from f_100w
join roads
where st_intersects(st_point(f_100w.longitude, f_100w.latitude), roads.shape)=true;
<?xml version="1.0"?>
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce.shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
<value>ubuntu:8025</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>ubuntu:8030</value>
</property>
<property>
<name>yarn.resourcemanager.address</name>
<value>ubuntu:8040</value>
</property>
</configuration>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment