Skip to content

Instantly share code, notes, and snippets.

@karanth
Last active August 29, 2015 13:56
Show Gist options
  • Save karanth/8912629 to your computer and use it in GitHub Desktop.
Save karanth/8912629 to your computer and use it in GitHub Desktop.
Lecture notes on data modeling for Big Data - I

The fundamental infrastructure to solve Big Data problems is a networked set of data and compute nodes that form a cluster. The nodes run on commodity hardware and data is distributed either through replication or sharding or most likely both. Traditionally, Relational Databases are the default data stores storing data in tables and complying with ACID for transactions.

ACID stands for,

Atomicity - Requires every operation/transaction to be boolean. A transaction either happens in its entirety on successful execution or does not happen at all when failures happen. The state of the system either is fully transformed or not transformed depending on the success of the operation.

Consistency - Requires that when an operation or transaction is executed, the data is valid under some preset rules. The states before and after that operation/transaction are always valid. Constraints in a database are a kind of these rules.

Isolation - Requires that concurrent operations or transactions are seen as operations or transactions that were executed serially. In other words, intermediate states of one transaction are not visible for another transaction that is being executed concurrently.

Durability - Requires that once a transaction is completed, the effects of the transaction are persistent against failures. A transaction's effects cannot be stored in volatile storage.

A data store complying with ACID for transactions in a distributed setting requires,

  • A consensus on when to commit (distributed commit protocol) a transaction, so that the atomicity guarantee is maintained. 2-Phase Commit (2PC) is a popular protocol to maintain atomicity in a distributed setting. These consensus protocols are blocking protocols, that is, all participating nodes block till a central master issues them a green signal to move on even though they would have finished their duties.
  • Locking to ensure isolation of concurrent transactions during the duration of the distributed commit protocol. Locking can be of different granularities like row-level, table-level, index-level etc. But they too provide isolation by blocking.

As the volume of transactions increase in an ACID-compliant system, blocking to ensure isolation and atomicity, coupled with network round-trips to achieve them, leads to higher number of lock contentions and higher latency respectively, bringing down the transaction throughput. Clearly, this is a non-starter for scaling and latency.

With replication, the problem aggravates as all replicas have to provide strong consistency guarantees. This can be achieved by,

  • Actively performing parallel transactions on replicas, committing the transaction after all replicas perform the transaction successfully.
  • Introducing latency and propagating transaction steps from a primary/master to the replicas or in a completely asynchronous mode through log-shipping.

The former method fails to scale for exactly the same problems that arise with ACID compliance in a distributed setting. The latter method involves latency that may provide stale reads on the replicas till they come upto speed or leave the system in an inconsistent state, if there is a message loss between the primary and its replicas, weakening the strong consistency guarantees.

In summary, the scaling problems are not because systems have a relational model, rather because of the ACID guarantees they provide. NoSQL stores relax these ACID guarantees and provide better scaling and availability. These stores can be termed as [NoACID] (http://dbmsmusings.blogspot.in/2010/08/problems-with-acid-and-how-to-fix-them.html) stores to avoid confusion.

NoSQL stores loosen up ACID guarantees by,

  • Limiting transactions and ACID guarantees to a single logical unit of data, all of which is present within a single node. For example, transactions are allowed to operate on a single key-value pair in a key-value store and the system provides atomicity and isolation guarantees for the transaction. Any arbitrary transaction is prohibited, and complex transactions involving multiple nodes will have to be handled as separate transactions by the application.
  • Allowing for latency in bringing replicas to a consistent (same data across all replicas) state, often termed as Eventual Consistency. High availability on network failures is guaranteed but with a weakening of consistency. On a sunny day, consistency is guaranteed but with latency.

In the literature, these loosening up of ACID guarantees to describe NoSQL systems are termed as BASE.

BASE stands for,

Basically Available - The capabilities of the system are available whether there is a failure in the system or not.

Soft State - Since BASE is itself contrived, Soft State has a couple of interpretations. One interpretation says that the state can change without user intervention as all the replicas converge because of eventual consistency. The other interpretation is that, unlike hard state, soft state is state that is time-bound and needs to be refreshed. In the NoSQL context, it could mean that maintaining states when replicas are not-convergent is the responsibility of the application developer.

Eventual Consistency - The state of all replicas will reach a consistent state with some latency.

Given the existence of these NoSQL systems and acknowledging the fact that sharded MySQL relational database is such a system too, there are broadly 4 other kinds of NoSQL stores based on the data structures they support. They are,

  • Key-Value Stores
  • Column-Family Stores
  • Document Stores
  • Graph Stores

Column-Family stores should not be confused with Column-Oriented stores. Column-Orientedness for a data store is an attribute that comes from how data is stored in the database. Each column value is contigously stored followed by the next column etc. This is in contrast with row-oriented stores where rows are contigously stored. Column-Oriented stores are fast at aggregating and summarizing columns as columns are grouped together. Row-Oriented stores are performant at getting all columns for a set of records. They are a different concept than Column-Family stores. A loose analogy for Row-Oriented vs Column-Oriented stores are row-major and column-major matrices respectively.

To be continued...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment