Skip to content

Instantly share code, notes, and snippets.

@karanth
Last active August 29, 2015 13:56
Show Gist options
  • Save karanth/8931761 to your computer and use it in GitHub Desktop.
Save karanth/8931761 to your computer and use it in GitHub Desktop.
Lecture notes on data modeling for Big Data - II

[Part I] (https://gist.github.com/karanth/8912629) showed the scalability problems posed by distributed transactions and the need for weaker ACID-compliant data stores. These data stores are confusingly called as NoSQL stores. Atomicity and Isolation are guaranteed in these stores within a particular data-domain (for example, at a key-level in a key value store), where the data-domain resides on a single node.

In most NoSQL stores, knowledge of data structures is a pre-requisite as the structure may not be relational. The 2 main principles used in modeling data in NoSQL stores are,

  • In a relational store that complies with ACID, the availability of data is the driver for modeling. The question asked by the person modeling the store is [What answers do I have?] (http://highlyscalable.wordpress.com/2012/03/01/nosql-data-modeling-techniques/). The design focus is on the runtime than data store modeling/design time. This makes relational stores good for adhoc querying. However, in NoSQL stores, the main question is, What questions do I need answers for?, something that can be answered only at modeling time. This makes modeling NoSQL stores application specific. The design of a NoSQL store is tightly coupled to the questions the application may ask during runtime.
  • Relational stores try to keep duplication of data to a minimum by advocating normalization between relations in the store. However, NoSQL stores advocate denormalization, in effect advocating data duplication. This is because of the introduction of isolation through data-domains to ensure atomicity. Either the application has to take on the onus of assembling data from different data-domains by implementing its own distributed commit protocol or duplicate data so that as much of the transaction can be completed within a data-domain. Applications prefer to do the latter and make themselves highly available and performant.

A data store contains a simple data set, license plate number of a car and the corresponding color of the car. A relational store on this data set is a 2-column table.

  CREATE TABLE lic_color(
        lic VARCHAR(7) NOT NULL,
        color VARCHAR(10) NOT NULL
    )

Not only lookup of the color of car with a particular license plate number is possible, but also other kinds of queries like, number of white cars, numbers of all cars that are red in color etc. are also possible. The answer is not application-specific rather it purely depends on the availability of the data to answer the question.

  SELECT color FROM lic_color WHERE lic = '123ABC';
    SELECT COUNT(*) FROM lic_color WHERE color = 'white';
    SELECT lic FROM lic_color WHERE color = 'red';

Now if we model the same in a key-value store like redis we can either have the license numbers as keys or colors as keys or a composite key based on their combination. Key-Value stores cannot be queried on values and the decision to choose keys are dependent on the application type. Will the application ask me only questions like, give me the color of the car with license plate number '123ABC'? Or will it ask me to give me all 'white' cars? Or both? The data store modeling will entirely depend on these design time questions.

The 2 important modeling techniques are,

#####Denormalization Denormalization is the introduction of data redundancy, though strictly this duplication is not necessary. When duplicate data is brought in there are 2 tradoffs that are made,

  • IO per request vs Total Data Volume - Denormalization, if done well, brings more locality to data and the amount of IO per request come down. A normalized relational store has to query multiple tables to fulfill a request, leading to non-localized fetches, leading to more IO. However, on the downside, Denormalization increases the total data volume in the store.
  • Processing Complexity vs Total Data Volume - Querying normalized data is complex as they have to do 'fetch and assemble' as well as joins to fulfill queries. Denormalized data processing is simpler, but at the cost of increased total data volume in the store.

Normalization would require joins between relations to produce results. Most NoSQL databases prohibit joins to aid scaling and encourage denormalization. The technique for denormalization is to do "design"-time joins and duplicate the data that would be required for a join upfront in a relation. This emphasizes the need to know the runtime questions at design time.

#####Aggregations leveraging Soft Schemas Most NoSQL stores provide soft schema i.e. the constraints on the schema are enforced at the application layer, unlike in a ACID RDBMS system. This allows for arbitrary structures and nesting in the data store records. Also, this allows for hetergeneous record types, i.e. 2 records may have different value structures. For example, a key-value store does not dictate what needs to be stored in the value. In redis, a user is allowed to store a data structure as a value, like a list, a map, a set etc.

This feature of soft-schemas can be used to reduce joins by having nested entities. It can also help eliminate one-to-many relationships by having all the related entities in the record value itself (image courtesy http://highlyscalable.files.wordpress.com/2012/02/soft-schema2.png).

image alt

The picture shows how to model products using both a NoSQL document store and a relational store. A normalized relational store requires many relations to specify each and every attribute of an entity. But, in a NoSQL store, soft-schema allows flexibility for aggregating or folding in all the relations into a single entity type.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment