Skip to content

Instantly share code, notes, and snippets.

@karanth
karanth / bits-1.md
Last active April 13, 2021 16:38
Distributed Computing Models, Definitions and Brewer's theorem

####Basics####

  • Memory Hierarchy - Tape, Disk, SSD, Memory, Cache
  • Kryder's law
  • Long Tail vs 80/20 rule
  • Drawbacks of monolithic systems - Supercomputers
  • Distributed Systems - Advantages & Problems
  • CAP theorem - Consistency, Availability and Partition Tolerance
  • PACELC - if(Partition){ Tradoffs: Consistency vs Availability } else { Tradeoffs: Consistency vs Latency }
  • Concurrency vs Parallelism
  • Parallel vs Distributed computing

Data to the compute vs Compute to the data Data Streaming vs MapReduce

MapReduce model "Program once, deploy at scale" Allows programmers without background in parallel/distributed computing to use the distributed systems efficiently. batch processing

Streaming model No random access to data

@karanth
karanth / ln-compute-I.md
Last active August 29, 2015 13:56
Lecture notes on computational paradigms for Big Data - I

Computation is a critical pillar for transformation and analysis of Big Data. At a very high level, there are 2 paradigms for compute, where compute refers to the actual instructions or code that executes on a compute node in a distributed machine cluster.

  • Moving compute to the data
  • Moving data to the compute

At the outset, the latter method seems infeasible for Big Data, given that network is a bottleneck within a set of interconnected machines. The former is very attractive as code is miniscule when compared to the data. But, moving compute to the data can only happen after all or enough of the data required for analysis is present in a storage system. A storage system becomes mandatory for this paradigm. Waiting for the accumulation of data for analysis gives rise to high latency for processing and analyzing the data, but most definitely yields better accuracies. The latter method maybe suited for certain kinds of use cases involving low-latency SLAs in Big Data. A combination of the two, a hybrid app

@karanth
karanth / ln-datamodel-III.md
Last active August 29, 2015 13:56
Lecture notes on data modeling for Big Data - III

[Part 2] (https://gist.github.com/karanth/8931761) illustrated 2 important principles of data modeling in NoSQL, one, that modeling is a design time exercise and is application-specific, and two, duplication of data is a must with the absence of joins. Techniques like denormalization and aggregates are used for data modeling.

####Key-Value Stores These are simplistic and extremely flexible stores. Each record is identified by a key. Anything can be stuffed into a value (though some stores provide data structures as values too). Distribution across nodes in a cluster are based on the key, making the distribution model simple. Queries cannot be made on the values. The most common use case for these kind of stores, are caching in an application for quick lookup. In practice, most of these stores will be memory-based with occassional writes to disk. Key-Value stores can be visualized as a distributed Hashtable.

Range queries are not possible without scan of the entire data in these stores. There are some stor

@karanth
karanth / ln-datamodel-II.md
Last active August 29, 2015 13:56
Lecture notes on data modeling for Big Data - II

[Part I] (https://gist.github.com/karanth/8912629) showed the scalability problems posed by distributed transactions and the need for weaker ACID-compliant data stores. These data stores are confusingly called as NoSQL stores. Atomicity and Isolation are guaranteed in these stores within a particular data-domain (for example, at a key-level in a key value store), where the data-domain resides on a single node.

In most NoSQL stores, knowledge of data structures is a pre-requisite as the structure may not be relational. The 2 main principles used in modeling data in NoSQL stores are,

  • In a relational store that complies with ACID, the availability of data is the driver for modeling. The question asked by the person modeling the store is [What answers do I have?] (http://highlyscalable.wordpress.com/2012/03/01/nosql-data-modeling-techniques/). The design focus is on the runtime than data store modeling/design time. This makes relational stores good for adhoc querying. However, in NoSQL stores, the main ques
@karanth
karanth / ln-datamodel-I.md
Last active August 29, 2015 13:56
Lecture notes on data modeling for Big Data - I

The fundamental infrastructure to solve Big Data problems is a networked set of data and compute nodes that form a cluster. The nodes run on commodity hardware and data is distributed either through replication or sharding or most likely both. Traditionally, Relational Databases are the default data stores storing data in tables and complying with ACID for transactions.

ACID stands for,

Atomicity - Requires every operation/transaction to be boolean. A transaction either happens in its entirety on successful execution or does not happen at all when failures happen. The state of the system either is fully transformed or not transformed depending on the success of the operation.

Consistency - Requires that when an operation or transaction is executed, the data is valid under some preset rules. The states before and after that operation/transaction are always valid. Constraints in a database are a kind of these rules.

Isolation - Requires that concurrent operations or transactions a

@karanth
karanth / hadoop-sandbox-1.md
Last active August 29, 2015 13:55
Notes on running Hadoop jobs on Hortonworks sandbox.

The last [gist] (gist.github.com/karanth/8736340) was about installing the HortonWorks sandbox and getting to know the entry points into the sandbox. A next step for most people who are starting to learn Hadoop is to either run the example MapReduce (MR) jobs that come with the Hadoop distribution, or to write a simple MR job like word count.

The HUE web page at http://localhost:8888 does not allow for execution of Hadoop MR jobs from java programs, without the use of higher level abstractions like HIVE (SQL-like) or Pig. Hadoop MR jobs can be run by logging into the sandbox (recall Alt + F5 or ssh) and executing jobs on the sandbox's terminal.

####Running Hadoop Example Programs

The sandbox has the hadoop MR examples in the directory /usr/lib/hadoop-mapreduce. The file name is of the form, hadoop-mapreduce-examples-*.jar. * (asterisk) is the wildcard for the version details of the jar file.

To run an example, the pi estimation program in this case, the command is,

@karanth
karanth / hadoop-sandbox.md
Last active April 26, 2021 16:47
Notes on installing Hortonworks Hadoop Sandbox - I

Installing a single node hadoop cluster is not a straight forward task. It involves a bunch of different things like creating users and groups to enabling password-less ssh. Thanks to virtualization technology and hortonworks' pre-configured OS images with Hadoop and a few of its ecosystem components, the task has been greatly simplified. Though this does not enable a first time Hadoop user to learn about the system level Hadoop complexities, it simplifies administration and deployment. The user can now focus on data management and analysis.

Downloads

The 2.4GB image for the Hortonworks Hadoop sandbox can be downloaded from [here] (http://hortonworks.com/products/hortonworks-sandbox/#install). I have chosen Oracle's VirtualBox as the virtualization technology. It can be downloaded from [here] (https://www.virtualbox.org/wiki/Downloads)

Configuration

I have tried installing VirtualBox on my Windows 8 PC, that has 4GB of RAM. The documentation clearly states that if Ambari and/or HBase have to b

@karanth
karanth / jetty-clojure.md
Last active December 20, 2023 19:15
Notes on installing SSL certificates in jetty - clojure + ring

SSL is an important security and privacy feature for all websites. Its details are outlined in this wikipedia [article] ("http://en.wikipedia.org/wiki/Secure_Sockets_Layer"). At Scibler, we use SSL certificates, encrypting all traffic to and fro from our servers. SSL is a public-key based asymmetric encryption scheme for symmetric key exchange. Symmetric keys are used for payload encryption. On our servers, we use embedded jetty (ring jetty adapter), with the clojure [ring] (https://github.com/ring-clojure) library to handle the http specific functionality.

This is a tutorial about installing SSL certificates on jetty webservers. SSL certificates are X.509 certificates that can be self-signed (authorized by Scibler) or can be signed by trusted third-parties. Trusted third-party certificates are the ones that a Internet user and browsers trust the most. Trusted third-party certification authorities issue certificates per domain and charge a nominal yearly fee.

####Pre-Requisites

  • The Java JDK has to be
@karanth
karanth / wsdl-pub.md
Created January 24, 2014 02:17
Notes on WSDL - Web Services Description Language

WSDL is an XML document that describes the contract a Web Service provides to its clients. The best way to explain a WSDL is by looking at XSD, or rules that govern the WSDL XML document. An example WSDL from the web is broken down and each element type is visited. The version of the WSDL below is 1.1. There is a newer 2.0 version of the WSDL XSD that has streamlined a few elements. However, Exchange Web Services still seem to be using 1.1. This particular Web Service gives stock quotes based on symbols.

<wsdl:definitions xmlns:soap="http://schemas.xmlsoap.org/wsdl/soap/"       
                  xmlns:tm="http://microsoft.com/wsdl/mime/textMatching/"
                  xmlns:soapenc="http://schemas.xmlsoap.org/soap/encoding/" 
                  xmlns:mime="http://schemas.xmlsoap.org/wsdl/mime/" 
                  xmlns:tns="http://www.webserviceX.NET/" 
                  xmlns:s="http://www.w3.org/2001/XMLSchema" 

xmlns:soap12="http://schemas.xmlsoap.org/wsdl/soa