Skip to content

Instantly share code, notes, and snippets.

@mix3
Created April 27, 2011 14:30
Show Gist options
  • Save mix3/944342 to your computer and use it in GitHub Desktop.
Save mix3/944342 to your computer and use it in GitHub Desktop.
HadoopでMapReduceに挑戦

HadoopでMapReduceに挑戦

サンプル実行

円周率の計算

$ sudo -u mapred hadoop jar /usr/lib/hadoop/hadoop-examples.jar pi 1 1000

Number of Maps  = 1
Samples per Map = 1000
Wrote input for Map #0
Starting Job
11/04/28 00:13:19 INFO mapred.FileInputFormat: Total input paths to process : 1
11/04/28 00:13:20 INFO mapred.JobClient: Running job: job_201104280006_0001
11/04/28 00:13:21 INFO mapred.JobClient:  map 0% reduce 0%
11/04/28 00:13:57 INFO mapred.JobClient:  map 100% reduce 0%
11/04/28 00:14:18 INFO mapred.JobClient:  map 100% reduce 100%
11/04/28 00:14:21 INFO mapred.JobClient: Job complete: job_201104280006_0001
11/04/28 00:14:21 INFO mapred.JobClient: Counters: 23
11/04/28 00:14:21 INFO mapred.JobClient:   Job Counters 
11/04/28 00:14:21 INFO mapred.JobClient:     Launched reduce tasks=1
11/04/28 00:14:21 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=32016
11/04/28 00:14:21 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
11/04/28 00:14:21 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
11/04/28 00:14:21 INFO mapred.JobClient:     Launched map tasks=1
11/04/28 00:14:21 INFO mapred.JobClient:     Data-local map tasks=1
11/04/28 00:14:21 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=18713
11/04/28 00:14:21 INFO mapred.JobClient:   FileSystemCounters
11/04/28 00:14:21 INFO mapred.JobClient:     FILE_BYTES_READ=28
11/04/28 00:14:21 INFO mapred.JobClient:     HDFS_BYTES_READ=237
11/04/28 00:14:21 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=88
11/04/28 00:14:21 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=215
11/04/28 00:14:21 INFO mapred.JobClient:   Map-Reduce Framework
11/04/28 00:14:21 INFO mapred.JobClient:     Reduce input groups=2
11/04/28 00:14:21 INFO mapred.JobClient:     Combine output records=0
11/04/28 00:14:21 INFO mapred.JobClient:     Map input records=1
11/04/28 00:14:21 INFO mapred.JobClient:     Reduce shuffle bytes=28
11/04/28 00:14:21 INFO mapred.JobClient:     Reduce output records=0
11/04/28 00:14:21 INFO mapred.JobClient:     Spilled Records=4
11/04/28 00:14:21 INFO mapred.JobClient:     Map output bytes=18
11/04/28 00:14:21 INFO mapred.JobClient:     Map input bytes=24
11/04/28 00:14:21 INFO mapred.JobClient:     Combine input records=0
11/04/28 00:14:21 INFO mapred.JobClient:     Map output records=2
11/04/28 00:14:21 INFO mapred.JobClient:     SPLIT_RAW_BYTES=119
11/04/28 00:14:21 INFO mapred.JobClient:     Reduce input records=2
Job Finished in 62.488 seconds
Estimated value of Pi is 3.14800000000000000000

Hadoop と Perl でアクセスログからステータスコードを数える

コード

$ cat map.pl

#!/usr/bin/perl

use strict;
use warnings;

while (<>) {
    chomp;
    my @segments = split /\s+/;
    printf "%s\t%s\n", $segments[8], 1;
}

$ cat reduce.pl

#!/usr/bin/perl

use strict;
use warnings;

my %count;
while (<>) {
    chomp;
    my ($key, $value) = split/\t/;
    $count{$key}++;
}

while (my ($key, $value) = each %count) {
    printf "%s\t%s\n", $key, $value;
}

動作チェック

$ perl map.pl access.log | perl reduce.pl

403 189
304 39908
206 329
400 19
401 231
200 569383
302 7538
416 5
500 1201
404 610
301 1

Hadoopで実行

$ sudo -u mapred hadoop fs -put access.log /user/mapred/

$ sudo -u mapred hadoop jar /usr/lib/hadoop/contrib/streaming/hadoop-streaming-0.20.2+737.jar -file map.pl -file reduce.pl -input /user/mapred/access.log -mapper map.pl -reducer reduce.pl -output output

packageJobJar: [map.pl, reduce.pl, /var/lib/hadoop-0.20/cache/mapred/hadoop-unjar5307710405815318669/] [] /tmp/streamjob1503992582060034246.jar tmpDir=null
11/04/28 15:20:03 INFO mapred.FileInputFormat: Total input paths to process : 1
11/04/28 15:20:05 INFO streaming.StreamJob: getLocalDirs(): [/var/lib/hadoop-0.20/cache/mapred/mapred/local]
11/04/28 15:20:05 INFO streaming.StreamJob: Running job: job_201104281518_0002
11/04/28 15:20:05 INFO streaming.StreamJob: To kill this job, run:
11/04/28 15:20:05 INFO streaming.StreamJob: /usr/lib/hadoop-0.20/bin/hadoop job  -Dmapred.job.tracker=master:54311 -kill job_201104281518_0002
11/04/28 15:20:05 INFO streaming.StreamJob: Tracking URL: http://master:50030/jobdetails.jsp?jobid=job_201104281518_0002
11/04/28 15:20:06 INFO streaming.StreamJob:  map 0%  reduce 0%
11/04/28 15:22:48 INFO streaming.StreamJob:  map 6%  reduce 0%
11/04/28 15:22:57 INFO streaming.StreamJob:  map 19%  reduce 0%
11/04/28 15:23:00 INFO streaming.StreamJob:  map 22%  reduce 0%
11/04/28 15:23:03 INFO streaming.StreamJob:  map 24%  reduce 0%
11/04/28 15:23:07 INFO streaming.StreamJob:  map 26%  reduce 0%
11/04/28 15:23:09 INFO streaming.StreamJob:  map 27%  reduce 0%
11/04/28 15:23:12 INFO streaming.StreamJob:  map 29%  reduce 0%
11/04/28 15:23:16 INFO streaming.StreamJob:  map 31%  reduce 0%
11/04/28 15:23:22 INFO streaming.StreamJob:  map 34%  reduce 0%
11/04/28 15:23:24 INFO streaming.StreamJob:  map 36%  reduce 0%
11/04/28 15:23:28 INFO streaming.StreamJob:  map 37%  reduce 0%
11/04/28 15:23:31 INFO streaming.StreamJob:  map 48%  reduce 0%
11/04/28 15:23:35 INFO streaming.StreamJob:  map 51%  reduce 0%
11/04/28 15:23:37 INFO streaming.StreamJob:  map 56%  reduce 0%
11/04/28 15:23:40 INFO streaming.StreamJob:  map 59%  reduce 0%
11/04/28 15:23:44 INFO streaming.StreamJob:  map 61%  reduce 0%
11/04/28 15:23:47 INFO streaming.StreamJob:  map 64%  reduce 0%
11/04/28 15:23:49 INFO streaming.StreamJob:  map 67%  reduce 0%
11/04/28 15:23:50 INFO streaming.StreamJob:  map 81%  reduce 0%
11/04/28 15:23:57 INFO streaming.StreamJob:  map 100%  reduce 0%
11/04/28 15:24:35 INFO streaming.StreamJob:  map 100%  reduce 67%
11/04/28 15:24:41 INFO streaming.StreamJob:  map 100%  reduce 100%
11/04/28 15:24:45 INFO streaming.StreamJob: Job complete: job_201104281518_0002
11/04/28 15:24:45 INFO streaming.StreamJob: Output: output

$ sudo -u mapred hadoop fs -ls /user/mapred/output

Found 3 items
-rw-r--r--   3 mapred supergroup          0 2011-04-28 15:25 /user/mapred/output/_SUCCESS
drwxr-xr-x   - mapred supergroup          0 2011-04-28 15:21 /user/mapred/output/_logs
-rw-r--r--   3 mapred supergroup         90 2011-04-28 15:25 /user/mapred/output/part-00000

$ sudo -u mapred hadoop fs -cat /user/mapred/output/part-00000

403 189
304 39908
206 329
400 19
401 231
200 569383
302 7538
416 5
500 1201
404 610

注意

アクセス権限の設定をちゃんとしないと動かない。面倒なので結局、

<property>
  <name>dfs.permissions</name>
  <value>false</value>
</property>

reduceの時slave3が/etc/hostsの設定をミスっててmapノードへアクセス出来ずに止まっていた

⇒/etc/hostsはちゃんと設定しましょう

HDFSのファイル操作はhdfsユーザで、MapReduceの処理はmapredユーザでやりましょう

⇒sudo -u [ユーザ] hadoop 〜

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment