Skip to content

Instantly share code, notes, and snippets.

@karanth
karanth / hadoop-sandbox-1.md
Last active August 29, 2015 13:55
Notes on running Hadoop jobs on Hortonworks sandbox.

The last [gist] (gist.github.com/karanth/8736340) was about installing the HortonWorks sandbox and getting to know the entry points into the sandbox. A next step for most people who are starting to learn Hadoop is to either run the example MapReduce (MR) jobs that come with the Hadoop distribution, or to write a simple MR job like word count.

The HUE web page at http://localhost:8888 does not allow for execution of Hadoop MR jobs from java programs, without the use of higher level abstractions like HIVE (SQL-like) or Pig. Hadoop MR jobs can be run by logging into the sandbox (recall Alt + F5 or ssh) and executing jobs on the sandbox's terminal.

####Running Hadoop Example Programs

The sandbox has the hadoop MR examples in the directory /usr/lib/hadoop-mapreduce. The file name is of the form, hadoop-mapreduce-examples-*.jar. * (asterisk) is the wildcard for the version details of the jar file.

To run an example, the pi estimation program in this case, the command is,

@karanth
karanth / ln-datamodel-I.md
Last active August 29, 2015 13:56
Lecture notes on data modeling for Big Data - I

The fundamental infrastructure to solve Big Data problems is a networked set of data and compute nodes that form a cluster. The nodes run on commodity hardware and data is distributed either through replication or sharding or most likely both. Traditionally, Relational Databases are the default data stores storing data in tables and complying with ACID for transactions.

ACID stands for,

Atomicity - Requires every operation/transaction to be boolean. A transaction either happens in its entirety on successful execution or does not happen at all when failures happen. The state of the system either is fully transformed or not transformed depending on the success of the operation.

Consistency - Requires that when an operation or transaction is executed, the data is valid under some preset rules. The states before and after that operation/transaction are always valid. Constraints in a database are a kind of these rules.

Isolation - Requires that concurrent operations or transactions a

@karanth
karanth / ln-datamodel-II.md
Last active August 29, 2015 13:56
Lecture notes on data modeling for Big Data - II

[Part I] (https://gist.github.com/karanth/8912629) showed the scalability problems posed by distributed transactions and the need for weaker ACID-compliant data stores. These data stores are confusingly called as NoSQL stores. Atomicity and Isolation are guaranteed in these stores within a particular data-domain (for example, at a key-level in a key value store), where the data-domain resides on a single node.

In most NoSQL stores, knowledge of data structures is a pre-requisite as the structure may not be relational. The 2 main principles used in modeling data in NoSQL stores are,

  • In a relational store that complies with ACID, the availability of data is the driver for modeling. The question asked by the person modeling the store is [What answers do I have?] (http://highlyscalable.wordpress.com/2012/03/01/nosql-data-modeling-techniques/). The design focus is on the runtime than data store modeling/design time. This makes relational stores good for adhoc querying. However, in NoSQL stores, the main ques
@karanth
karanth / ln-datamodel-III.md
Last active August 29, 2015 13:56
Lecture notes on data modeling for Big Data - III

[Part 2] (https://gist.github.com/karanth/8931761) illustrated 2 important principles of data modeling in NoSQL, one, that modeling is a design time exercise and is application-specific, and two, duplication of data is a must with the absence of joins. Techniques like denormalization and aggregates are used for data modeling.

####Key-Value Stores These are simplistic and extremely flexible stores. Each record is identified by a key. Anything can be stuffed into a value (though some stores provide data structures as values too). Distribution across nodes in a cluster are based on the key, making the distribution model simple. Queries cannot be made on the values. The most common use case for these kind of stores, are caching in an application for quick lookup. In practice, most of these stores will be memory-based with occassional writes to disk. Key-Value stores can be visualized as a distributed Hashtable.

Range queries are not possible without scan of the entire data in these stores. There are some stor

@karanth
karanth / ln-compute-I.md
Last active August 29, 2015 13:56
Lecture notes on computational paradigms for Big Data - I

Computation is a critical pillar for transformation and analysis of Big Data. At a very high level, there are 2 paradigms for compute, where compute refers to the actual instructions or code that executes on a compute node in a distributed machine cluster.

  • Moving compute to the data
  • Moving data to the compute

At the outset, the latter method seems infeasible for Big Data, given that network is a bottleneck within a set of interconnected machines. The former is very attractive as code is miniscule when compared to the data. But, moving compute to the data can only happen after all or enough of the data required for analysis is present in a storage system. A storage system becomes mandatory for this paradigm. Waiting for the accumulation of data for analysis gives rise to high latency for processing and analyzing the data, but most definitely yields better accuracies. The latter method maybe suited for certain kinds of use cases involving low-latency SLAs in Big Data. A combination of the two, a hybrid app

Data to the compute vs Compute to the data Data Streaming vs MapReduce

MapReduce model "Program once, deploy at scale" Allows programmers without background in parallel/distributed computing to use the distributed systems efficiently. batch processing

Streaming model No random access to data

@karanth
karanth / jsOOP-Part1.md
Last active January 2, 2016 12:29
Notes on Javascript OOP concepts - Part 1 - All about 'this'

Basics

Creating an object in javascript is as simple as,

var obj = {};

This is also equivalent to using the new operator and calling the Object function.

var obj = new Object();
@karanth
karanth / jsOOP-Part2.md
Last active January 2, 2016 14:08
Notes on Javascript OOP - Part 2 - 'new' operator

Internals - 'new' operator

In the last part, we looked at the this keyword in Javascript and saw how it is determined just before function execution. We spent some time in a couple of gotchas due to this "late" binding and mitigation strategies against them. At the beginning of the last part, we saw different ways of creating objects in Javascript and noted a preference for using the new operator when creating them.

To recap, in Javascript, an object is nothing but a collection of properties, properties can be private/privileged/public and they can be data types or functions. var vs this determines the difference between something that is private vs something that can be accessible externally.

The new operator is used before a function to create an object. The function appearing after new is the constructor function that constructs the object.

The new keyword, though an operator by nature, can be thought about as a function that executes a few statements using the constructor func

@karanth
karanth / jsOOP-Part3.md
Last active January 2, 2016 20:39
Notes on Javascript OOP - Part 3 - Inheritance

Internals - Inheritance

Inheritance in Javascript is called prototypal inheritance. It is different than the classical inheritance that we see in OOP languages like Java or C++. Inheritance in OOP is required for 2 reasons,

  • Automatic casting of references for objects of the same class family. If inheritance was not allowed, it would require explicit casting of object references. This is particularly useful in a strongly typed language setting. However, it is not relevant in the setting of Javascript, a loosely-typed language where casting is not necessary.
  • Code reusability is another very important reason for inheritance in OOP. Inheritance allows us to group methods at appropriate levels of abstraction and not worry about duplicating the code at a different level of abstraction. For example, if the Car class has a certain function, let us say a brake method to decelerate the car, a Sedan class, a sub-class of the Car class need not duplicate the same method.

When compared to classical inheritanc

@karanth
karanth / cors-pub.md
Last active January 3, 2016 20:58
Notes on CORS - More to it than what meets the eye

A previous [gist] (https://gist.github.com/karanth/8467944#file-jsonp-cors-pub-md) had notes about JSONP and CORS as methods for making cross-domain requests. After consideration, CORS was a secure and a better alternative. However, implementing CORS is not as straightforward as introducing a few headers on the server response.

There CORS W3C spec identifies cases where a pre-flight request is called for in a CORS situation. A pre-flight request is a client-iniated request using the OPTIONS HTTP verb that tries to understand the capabilities of the server. Only on a successful response (200) to the pre-flight request, and presence of certain headers in the server response, will the client initiate the actual CORS request.

A little bit of nosing around with the rapportive plugin revealed a CORS situation, that forces the client to execute the special pre-flight request. Rapportive is a neat browser plugin that gives details about a contact (by looking into the from field of the email) whe