Created
May 14, 2012 07:48
-
-
Save sunkencity/2692529 to your computer and use it in GitHub Desktop.
A simple mapreduce example
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
A little tutorial on mapreduce. | |
This is a short tutorial on what mapreduce is. It'll do a process first sequentially and then with multiple mapper jobs As a silly example we will try to get a list of prime numbers in a big corpus of random numbers. Let's first start out with creating some test data that has good behaviour. We'll do this in the a terminal shell using ruby. | |
$ruby -e "(1..100).each { |x| puts x }" > data_1..100.txt | |
Look at the file with the "cat" utility: | |
$ cat data_1..100.txt | |
1 | |
2 | |
.. | |
99 | |
100 | |
Looks good, let's make the mapper: | |
#!/usr/bin/env ruby | |
# try to find evidence of not a prime and return false otherwise return true | |
def is_prime? n | |
return false if n < 2 | |
(2..(n -1)).each do |d| | |
return false if (n / d.to_f) % 1 == 0 | |
end | |
true | |
end | |
# read each line and spit out the number and "true" or "false" whether the number is | |
# a prime or not, separate the two columns with a comma | |
ARGF.each_line do |l| | |
number = l.to_i | |
puts [number,is_prime?(number)].join(',') | |
end | |
make it executable: | |
$ chmod +x mapper.rb | |
Let's try it out: | |
$ cat data_1..100.txt | ./mapper.rb | |
1,false | |
2,true | |
3,true | |
4,false | |
5,true | |
6,false | |
7,true | |
8,false | |
9,false | |
10,false | |
11,true | |
12,false | |
13,true | |
.. | |
Time to write something to compile the result, the reducer, it'll grab all the rows with primes and print them out: | |
#!/usr/bin/env ruby | |
ARGF.each_line do |l| | |
arr = l.chomp.split(",") | |
puts arr.first if arr.last == "true" | |
end | |
Let's try out the whole chain: | |
$ cat data_1..100.txt | ./mapper.rb | ./reducer.rb | |
2 | |
3 | |
5 | |
7 | |
11 | |
13 | |
17 | |
Seems to work fine! Let's generate some more source data and make a speed test: | |
$ mkdir src | |
$ ruby -e "(10000..20000).each { |x| puts x }" > src/10000-20000.txt | |
$ ruby -e "(20001..30000).each { |x| puts x }" > src/20001-30000.txt | |
$ time cat src/* | ./mapper.rb | ./reducer.rb | |
real 0m20.126s | |
user 0m19.978s | |
sys 0m0.100s | |
So let's see if we can speed this up a little by running it in parallel, first we'll need to make a simple bash script to be able to spawn concurrent processes, here's the simplest possible script. I have 2 cores on this machine so I'll run two concurrent processes: | |
#!/bin/bash | |
PARALLEL_JOBS=2 | |
count=0 | |
for item in src/*; do | |
cat $item | bin/mapper.rb & | |
let count+=1 | |
[[ $((count%PARALLEL_JOBS)) -eq 0 ]] && wait | |
done | |
Let's try it out: | |
$ time bin/process_parallel.sh | bin/reducer.rb | |
real 0m12.895s | |
user 0m19.877s | |
sys 0m0.155s | |
If you want to have the values sorted, add a second reduce step with "| sort" at the end. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment