Skip to content

Instantly share code, notes, and snippets.

@abishekk92
Last active July 29, 2019 18:53
Show Gist options
  • Save abishekk92/10593977 to your computer and use it in GitHub Desktop.
Save abishekk92/10593977 to your computer and use it in GitHub Desktop.
Abstract Algebra and Map Reduce

I recently came across how one can parallelize an algorithm in terms of Map Reduce by nailing down the operations on the data as a well defined Algebraic Structure (A monoid). The idea has strong mathematical grounding and explains why Twitter is so much invested in Algebird, Summingbird. I will try to briefly explain what it is and why I am particularly fascinated by it.

Map Reduce

In today's world of ever growing data, the computation we need to run keeps getting complex and complex, these computations need to be run as efficiently(cost and resource) as possible. Most of these computations are run as Map Reduce jobs on a cluster of commodity grade hardware.

The idea behind Map Reduce is to move the computation to the data than to aggregate data to perform computation.

Each of the computation happen locally on the data(Mapping) on all the nodes at the same time, hence the notion of parallelism, these mapped data can be aggregated together and reduced into a representative value.

For example, I want to count the frequencies of words occurring in a Billion Documents and they spread across different data nodes in the cluster.

By the previous explanation doing it this in Map Reduce fashion.

  • We would map out word frequencies of documents in each one of the node.
  • We would aggregate all the different local frequencies and reduce them into a single representative frequency.

In the reduce phase, all the local frequenices are reduced into a single representative frequency through a binary opertaion.

A binary operation is any operation with two operands. For example: f(a,b) -> a + b, + is a binary operation on a and b

You might have noticed that during this whole process there is no control over the order of computation, hence it is important to make sure that the binary operation on a set of data results in the same result irrespective of the order.

For example:

   a -> [1,2,3]
   1+(2+3) == 3+(2+1)
   avg(1,avg(2,3)) != avg(2,(avg(1,3))

Essentially the binary opertion defined needs to be associative on the defined data, so that irrespective of the order of the computation it always results in the same result.

Defining this formally we can convert an algorithm into Map Reduce form if it forms a Semi-Group or a Monoid under the said binary opertaion over the given data structure.

Monoids and Semi Groups

Semi Group

Given a binary opertaion op a is structure said to be a semi-group. When

  • For a, b in A and a op b in A. (Closure) This can be defined as
  def op(a : A, b: A) : A 
  • For a, b, c in A a op (b op c) == (a op b) op c (Assocaitive)

Monoid

A monoid is nothing but a semi-group with an indetity element defined for it's binary operation

There exists an identity element e defined on the binary-operation op. i.e

   e op a == a
   a op e == a

It is apparent that making sure that the data structure are monoids under the considered operation provides us free parallelism, irrespective of the paradigm, naturally Twitter's interest in something like Algebird or Summingbird.

Algebird makes it really easy to define your data structure as a Monoid or a Semi-Group or any other advanced Abstract Algebraic Strucutre like a Ring or a field, each one of them are well defined and they come with their own advantage.

Summingbird builds on top of Algebird and it comes with common data structures for heavy numerical analysis defined as Algebraic Types using Algebird. All these types play well with different tool chains like Hadoop, Storm or Spark. If you're interested in learning more about the same I would highly recommend this post by Micheal Noll.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment