abishekk92/monoid.md

## monoid.md

      
    Raw
  

              monoid.md
            
          
    I recently came across how one can parallelize an algorithm in terms of Map Reduce by nailing down the operations on the data as a well defined Algebraic Structure (A monoid). The idea has strong mathematical grounding and explains why Twitter is so much invested in Algebird, Summingbird. I will try to briefly explain what it is  and why I am particularly fascinated by it.
Map Reduce

In today's world of ever growing data, the computation we need to run keeps getting complex and complex, these computations need to be run as efficiently(cost and resource) as possible. Most of these computations are run as Map Reduce jobs on a cluster of commodity grade hardware.
The idea behind Map Reduce is to move the computation to the data than to aggregate data to perform computation.
Each of the computation happen locally on the data(Mapping) on all the nodes at the same time, hence the notion of parallelism, these mapped data can be aggregated together and reduced into a representative value.
For example, I want to count the frequencies of words occurring in a Billion Documents and they spread across different data nodes in the cluster.
By the previous explanation doing it this in Map Reduce fashion.

We would map out word frequencies of documents in each one of the node.
We would aggregate all the different local frequencies and reduce them into a single representative frequency.

In the reduce phase, all the local frequenices are reduced into a single representative frequency through a binary opertaion.
A binary operation is any operation with two operands. For example: f(a,b) -> a + b, + is a binary operation on a and b
You might have noticed that during this whole process there is no control over the order of computation, hence it is important to make sure that the binary operation on a set of data results in the same result irrespective of the order.
For example:
   a -> [1,2,3]
   1+(2+3) == 3+(2+1)
   avg(1,avg(2,3)) != avg(2,(avg(1,3))

Essentially the binary opertion defined needs to be associative on the defined data, so that irrespective of the order of the computation it always results in the same result.
Defining this formally we can convert an algorithm into Map Reduce form if it forms a Semi-Group or a Monoid under the said binary opertaion over the given data structure.
Monoids and Semi Groups

Semi Group
Given a binary opertaion op a is structure said to be a semi-group. When

For a, b in A and a op b in A. (Closure) This can be defined as

  def op(a : A, b: A) : A 

For a, b, c in A a op (b op c) == (a op b) op c (Assocaitive)

Monoid
A monoid is nothing but a semi-group with an indetity element defined for it's binary operation
There exists an identity element e defined on the binary-operation op. i.e
   e op a == a
   a op e == a
It is apparent that making sure that the data structure are monoids under the considered operation provides us free parallelism, irrespective of the paradigm, naturally Twitter's interest in something like Algebird or Summingbird.
Algebird makes it really easy to define your data structure as a Monoid or a Semi-Group or any other advanced Abstract Algebraic Strucutre like a Ring or a field, each one of them are well defined and they come with their own advantage.
Summingbird builds on top of Algebird and it comes with common data structures for heavy numerical analysis defined as Algebraic Types using Algebird. All these types play well with different tool chains like Hadoop, Storm or Spark. If you're interested in learning more about the same I would highly recommend this post by Micheal Noll.