Skip to content

Instantly share code, notes, and snippets.

@sunkencity
Created May 14, 2012 07:48
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save sunkencity/2692529 to your computer and use it in GitHub Desktop.
Save sunkencity/2692529 to your computer and use it in GitHub Desktop.
A simple mapreduce example
A little tutorial on mapreduce.
This is a short tutorial on what mapreduce is. It'll do a process first sequentially and then with multiple mapper jobs As a silly example we will try to get a list of prime numbers in a big corpus of random numbers. Let's first start out with creating some test data that has good behaviour. We'll do this in the a terminal shell using ruby.
$ruby -e "(1..100).each { |x| puts x }" > data_1..100.txt
Look at the file with the "cat" utility:
$ cat data_1..100.txt
1
2
..
99
100
Looks good, let's make the mapper:
#!/usr/bin/env ruby
# try to find evidence of not a prime and return false otherwise return true
def is_prime? n
return false if n < 2
(2..(n -1)).each do |d|
return false if (n / d.to_f) % 1 == 0
end
true
end
# read each line and spit out the number and "true" or "false" whether the number is
# a prime or not, separate the two columns with a comma
ARGF.each_line do |l|
number = l.to_i
puts [number,is_prime?(number)].join(',')
end
make it executable:
$ chmod +x mapper.rb
Let's try it out:
$ cat data_1..100.txt | ./mapper.rb
1,false
2,true
3,true
4,false
5,true
6,false
7,true
8,false
9,false
10,false
11,true
12,false
13,true
..
Time to write something to compile the result, the reducer, it'll grab all the rows with primes and print them out:
#!/usr/bin/env ruby
ARGF.each_line do |l|
arr = l.chomp.split(",")
puts arr.first if arr.last == "true"
end
Let's try out the whole chain:
$ cat data_1..100.txt | ./mapper.rb | ./reducer.rb
2
3
5
7
11
13
17
Seems to work fine! Let's generate some more source data and make a speed test:
$ mkdir src
$ ruby -e "(10000..20000).each { |x| puts x }" > src/10000-20000.txt
$ ruby -e "(20001..30000).each { |x| puts x }" > src/20001-30000.txt
$ time cat src/* | ./mapper.rb | ./reducer.rb
real 0m20.126s
user 0m19.978s
sys 0m0.100s
So let's see if we can speed this up a little by running it in parallel, first we'll need to make a simple bash script to be able to spawn concurrent processes, here's the simplest possible script. I have 2 cores on this machine so I'll run two concurrent processes:
#!/bin/bash
PARALLEL_JOBS=2
count=0
for item in src/*; do
cat $item | bin/mapper.rb &
let count+=1
[[ $((count%PARALLEL_JOBS)) -eq 0 ]] && wait
done
Let's try it out:
$ time bin/process_parallel.sh | bin/reducer.rb
real 0m12.895s
user 0m19.877s
sys 0m0.155s
If you want to have the values sorted, add a second reduce step with "| sort" at the end.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment