Skip to content

Instantly share code, notes, and snippets.

@mix3
Created April 27, 2011 13:38
Show Gist options
  • Save mix3/944255 to your computer and use it in GitHub Desktop.
Save mix3/944255 to your computer and use it in GitHub Desktop.
Cloudera製Hadoopによる完全分散モードの構築

Cloudera製Hadoopによる完全分散モードの構築

構成

VirtualBox上に構築

namenode 1台
datanode 3台
  • ホストオンリーネットワーク

    • 192.168.10.x
  • 内部ネットワーク

    • 192.168.20.x
  • IP: Domain

    • 10: master
    • 11: slave1
    • 12: slave2
    • 13: slave3

公開鍵

パスフレーズ無し ssh-keygen -t rsa で作成したものKeyPairを各スレーブにコピーして相互にパス無し接続出来るようにする

構築手順

共通

$ vim /etc/apt/sources.list
    deb http://ftp.riken.jp/Linux/debian/debian/ lenny main non-free
    deb-src http://ftp.riken.jp/Linux/debian/debian/ lenny main non-free
$ aptitude update
$ aptitude install sun-java6-jdk
$ update-alternatives --set java /usr/lib/jvm/java-6-sun/jre/bin/java
$ java -version
    Java version "1.6.0_22" 
    Java(TM) SE Runtime Environment (build 1.6.0_22-b04)
    Java HotSpot(TM) Client VM (build 17.1-b03, mixed mode, sharing)
$ vim /etc/apt/sources.list.d/cloudera.list
    deb http://archive.cloudera.com/debian lenny-cdh3 contrib
    deb-src http://archive.cloudera.com/debian lenny-cdh3 contrib
$ aptitude update
$ aptitude install curl rsync sudo
$ curl -s http://archive.cloudera.com/debian/archive.key | sudo apt-key add -

マスター

$ aptitude install hadoop hadoop-0.20-namenode hadoop-0.20-secondarynamenode hadoop-0.20-jobtracker

スレーブ

$ aptitude install hadoop hadoop-0.20-datanode hadoop-0.20-tasktracker

設定

雛形

$ cp -r /etc/hadoop-0.20/conf.empty /etc/hadoop-0.20/conf.cluster
$ update-alternatives --install /etc/hadoop-0.20/conf hadoop-0.20-conf /etc/hadoop-0.20/conf.cluster 50
$ update-alternatives --display hadoop-0.20-conf
    hadoop-0.20-conf -状態は auto。
     リンクは現在 /etc/hadoop-0.20/conf.cluster を指しています
    /etc/hadoop-0.20/conf.empty - 優先度 10
    /etc/hadoop-0.20/conf.cluster - 優先度 50
    現在の `最適' バージョンは /etc/hadoop-0.20/conf.cluster です。

各種設定ファイル

$ cat /etc/hadoop/conf.cluster/core-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
  <property>
    <name>fs.default.name</name>
    <value>hdfs://master:54310</value>
  </property>
  <property>
    <name>hadoop.tmp.dir</name>
    <value>/var/lib/hadoop-0.20/cache/${user.name}</value>
  </property> 
</configuration>

$ cat /etc/hadoop/conf.cluster/hdfs-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
  <property>
    <name>dfs.replication</name>
    <value>3</value>
  </property>
  <property>
    <name>dfs.permissions</name>
    <value>false</value>
  </property>
</configuration>

$ cat /etc/hadoop/conf.cluster/mapred-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
  <property>
    <name>mapred.job.tracker</name>
    <value>master:54311</value>
  </property>
  <property>
    <name>mapred.hosts</name>
    <value>${hadoop.tmp.dir}/hosts.include</value>
  </property>
  <property>
    <name>mapred.hosts.exclude</name>
    <value>${hadoop.tmp.dir}/hosts.exclude</value>
  </property> 
</configuration>

$ cat /etc/hadoop/conf.cluster/masters

master

$ cat /etc/hadoop/conf.cluster/slaves

slave1
slave2
slave3

$ cat /etc/hosts

192.168.20.10   master master.localdomain
192.168.20.11   slave1 slave1.localdomain
192.168.29.12   slave2 slave2.localdomain
192.168.20.13   slave3 slave3.localdomain

namenodeでフォーマット

$ sudo -u hdfs shadoop namenode -format

起動

マスター

$ /etc/init.d/hadoop-0.20-namenode start
$ /etc/init.d/hadoop-0.20-secondnamenode start
$ /etc/init.d/hadoop-0.20-jobtracker start

スレーブ

$ /etc/init.d/hadoop-0.20-datanode start
$ /etc/init.d/hadoop-0.20-tasktracker start

スクリプト

/etc/hadoop/conf.cluster 以下のスクリプトはマスタースレーブで共通で大丈夫なのでマスターからrsyncで簡単にコピー出来る。rsyncで各サーバにスクリプトを組むと良い

$ cat ./node.sh

#!/bin/sh
NAMENODE="192.168.20.10"
DATANODE="192.168.20.11 192.168.20.12 192.168.20.13"
NODE="$NAMENODE $DATANODE"

$cat ./send_all.sh

#!/bin/sh
. ~/node.sh

for n in $NODE;
do
    CMD="ssh $n $*"
    echo "== $n =="
    $CMD;
done

$ cat ./sync.sh

#!/bin/sh
. ~/node.sh

for n in $NODE;
do
if [ `hostname` != "$n" ];then
    CMD="sudo rsync --progress -av /etc/hadoop/conf.cluster/ $n:/etc/hadoop/conf.cluster"
    echo "== $n =="
    echo $CMD
    $CMD
fi
done
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment