sunkencity/mapreduce

## mapreduce
A little tutorial on mapreduce.

This is a short tutorial on what mapreduce is. It'll do a process first sequentially and then with multiple mapper jobs As a silly example we will try to get a list of prime numbers in a big corpus of random numbers. Let's first start out with creating some test data that has good behaviour. We'll do this in the a terminal shell using ruby.

  $ruby -e "(1..100).each { |x| puts x }" > data_1..100.txt

Look at the file with the "cat" utility:

  $ cat data_1..100.txt
  1
  2
  ..
  99
  100

Looks good, let's make the mapper:

#!/usr/bin/env ruby

  # try to find evidence of not a prime and return false otherwise return true
  def is_prime? n
    return false if n < 2
    (2..(n -1)).each do |d|
      return false if (n / d.to_f) % 1 == 0
    end
    true
  end

  # read each line and spit out the number and "true" or "false" whether the number is
  # a prime or not, separate the two columns with a comma
  ARGF.each_line do |l|
    number = l.to_i
    puts [number,is_prime?(number)].join(',')
  end

make it executable:

  $ chmod +x mapper.rb

Let's try it out:

  $ cat data_1..100.txt | ./mapper.rb
  1,false
  2,true
  3,true
  4,false
  5,true
  6,false
  7,true
  8,false
  9,false
  10,false
  11,true
  12,false
  13,true
  ..

Time to write something to compile the result, the reducer, it'll grab all the rows with primes and print them out:

  #!/usr/bin/env ruby

  ARGF.each_line do |l|
    arr = l.chomp.split(",")
    puts arr.first if arr.last == "true"
  end

Let's try out the whole chain:

  $ cat data_1..100.txt | ./mapper.rb | ./reducer.rb
  2
  3
  5
  7
  11
  13
  17

Seems to work fine! Let's generate some more source data and make a speed test:

  $ mkdir src
  $ ruby -e "(10000..20000).each { |x| puts x }" > src/10000-20000.txt
  $ ruby -e "(20001..30000).each { |x| puts x }" > src/20001-30000.txt

  $ time cat src/* | ./mapper.rb | ./reducer.rb

  real	0m20.126s
  user	0m19.978s
  sys	0m0.100s

So let's see if we can speed this up a little by running it in parallel, first we'll need to make a simple bash script to be able to spawn concurrent processes, here's the simplest possible script. I have 2 cores on this machine so I'll run two concurrent processes:

  #!/bin/bash

  PARALLEL_JOBS=2

  count=0
  for item in src/*; do
    cat $item | bin/mapper.rb  &
    let count+=1
    [[ $((count%PARALLEL_JOBS)) -eq 0 ]] && wait
  done

Let's try it out:

  $ time bin/process_parallel.sh | bin/reducer.rb

  real	0m12.895s
  user	0m19.877s
  sys	0m0.155s

If you want to have the values sorted, add a second reduce step with "| sort" at the end.
	A little tutorial on mapreduce.

	This is a short tutorial on what mapreduce is. It'll do a process first sequentially and then with multiple mapper jobs As a silly example we will try to get a list of prime numbers in a big corpus of random numbers. Let's first start out with creating some test data that has good behaviour. We'll do this in the a terminal shell using ruby.

	$ruby -e "(1..100).each { \|x\| puts x }" > data_1..100.txt

	Look at the file with the "cat" utility:

	$ cat data_1..100.txt
	1
	2
	..
	99
	100

	Looks good, let's make the mapper:

	#!/usr/bin/env ruby

	# try to find evidence of not a prime and return false otherwise return true
	def is_prime? n
	return false if n < 2
	(2..(n -1)).each do \|d\|
	return false if (n / d.to_f) % 1 == 0
	end
	true
	end

	# read each line and spit out the number and "true" or "false" whether the number is
	# a prime or not, separate the two columns with a comma
	ARGF.each_line do \|l\|
	number = l.to_i
	puts [number,is_prime?(number)].join(',')
	end

	make it executable:

	$ chmod +x mapper.rb

	Let's try it out:

	$ cat data_1..100.txt \| ./mapper.rb
	1,false
	2,true
	3,true
	4,false
	5,true
	6,false
	7,true
	8,false
	9,false
	10,false
	11,true
	12,false
	13,true
	..

	Time to write something to compile the result, the reducer, it'll grab all the rows with primes and print them out:

	#!/usr/bin/env ruby

	ARGF.each_line do \|l\|
	arr = l.chomp.split(",")
	puts arr.first if arr.last == "true"
	end

	Let's try out the whole chain:

	$ cat data_1..100.txt \| ./mapper.rb \| ./reducer.rb
	2
	3
	5
	7
	11
	13
	17

	Seems to work fine! Let's generate some more source data and make a speed test:

	$ mkdir src
	$ ruby -e "(10000..20000).each { \|x\| puts x }" > src/10000-20000.txt
	$ ruby -e "(20001..30000).each { \|x\| puts x }" > src/20001-30000.txt

	$ time cat src/* \| ./mapper.rb \| ./reducer.rb

	real 0m20.126s
	user 0m19.978s
	sys 0m0.100s

	So let's see if we can speed this up a little by running it in parallel, first we'll need to make a simple bash script to be able to spawn concurrent processes, here's the simplest possible script. I have 2 cores on this machine so I'll run two concurrent processes:

	#!/bin/bash

	PARALLEL_JOBS=2

	count=0
	for item in src/*; do
	cat $item \| bin/mapper.rb &
	let count+=1
	[[ $((count%PARALLEL_JOBS)) -eq 0 ]] && wait
	done

	Let's try it out:

	$ time bin/process_parallel.sh \| bin/reducer.rb

	real 0m12.895s
	user 0m19.877s
	sys 0m0.155s

	If you want to have the values sorted, add a second reduce step with "\| sort" at the end.