neerajgoel82/books--nosql-distilled

## books--nosql-distilled
Relational databases have been a successful technology for twenty years providing
- persistence
- concurrency control (Multiple apps and multiple users access the DB at the same time)
- integration mechanism (This is what prevented object oriented DBs to flourish)

Drawbacks of Relational DBs
- Impedance mismatch (In-memory(object) model of an application is different from (relational) model on disk).
  That's why there are ORM frameworks which lead to loss of performance
- They are not designed to run efficiently on clusters

Why they are picking up now
- There is a movement away from using databases as integration points towards encapsulating databases within applications
and integrating through (web) services.
- The vital factor for a change in data storage was the need to support large volumes of data by running on clusters.
- Relational DBs can be very costly to support large volume of data

The common characteristics of NoSQL databases are
• Not using the relational model
• Running well on clusters (except for Graph DBs)
• Open-source
• Built for the 21st century web estates
• Schemaless

The most important result of the rise of NoSQL is Polyglot Persistence.
=====================================================================================================================
How to choose the right DB
=====================================================================================================================
When you have to decide on a NoSQL database for your project, following are the things you have to choose
- Data Model
- Distribution Model

Once you've decided on the above two you need to make the choice of the vendor which involves minor nitty-gritty and things like performance
But the above two are the broad choices in terms of design that need to be made.

The factors which determine your choices are:
- Domain Objects Design which determine data model (Eg. lots of relationships or no relations etc)
- Consistency/Availability/Latency requirements (Read CAP, PACELC)/ Durability
- Access Pattern (Read heavy, Write Heavy etc, single machine/multiple machines etc)

=====================================================================================================================
Choice 1 - Data Models
=====================================================================================================================
Different kind of data models in NoSQL are :
- Key-Value
- Document
- Column Family
- Graph

What they all share is the notion of an aggregate indexed by a key that you can use for lookup. An aggregate is a collection of data that
we interact with as a unit. This aggregate is central to running on a cluster, as the database will ensure that all the data for an
aggregate is stored together on one node. The aggregate also acts as the atomic unit for updates, providing a useful, if limited,
amount of transactional control. Essentially Aggregates form the boundaries for ACID operations with the database.

Within that notion of aggregate, we have some differences. The key-value data model treats the aggregate as an opaque whole, which
means you can only do key lookup for the whole aggregate—you cannot run a query nor retrieve a part of the aggregate.

The document model makes the aggregate transparent to the database allowing you to do queries and partial retrievals. However, since
the document has no schema, the database cannot act much on the structure of the document to optimize the storage and retrieval of
parts of the aggregate.

Column-family models divide the aggregate into column families, allowing the database to treat them as units
of data within the row aggregate. This imposes some structure on the aggregate but allows the database to take advantage of that
structure to improve its accessibility.

Aggregate-oriented databases work best when most data interaction is done with the same aggregate; aggregate-ignorant databases (like
Relational DBs) are better when interactions use data organized in many different formations.

=====================================================================================================================
Choice 2 - Distribution Models
=====================================================================================================================
There are two styles of distributing data:
• Sharding distributes different data across multiple servers, so each server acts as the single
source for a subset of data.
• Replication copies data across multiple servers, so each bit of data can be found in multiple
places.

A system may use either or both techniques.

Replication comes in two forms:
• Master-slave replication makes one node the authoritative copy that handles writes while
slaves synchronize with the master and may handle reads.
• Peer-to-peer replication allows writes to any node; the nodes coordinate to synchronize their
copies of the data.

Master-slave replication reduces the chance of update conflicts but peer-to-peer replication
avoids loading all writes onto a single point of failure.

=====================================================================================================================
Factor 1 - Domain Objects Design
=====================================================================================================================
This primarily depends on the domain for which you are creating the software. The different data models directly comes from the
design of the objects in your software, what kind of relationships are there, how these objects are accessed etc.

=====================================================================================================================
Factor 2 - Consistency Requirements
=====================================================================================================================
There are two kinds of consistency issues:
- Logical inconsistency
  Write-write conflicts occur when two clients try to write the same data at the same time. Eg. A, B read the same data and update
  that together. Then you will lose update of either A or B depending on which one reaches the server later. This can be solved by having
  conditional write constructs (equivalent of CAS)

  Read-write conflicts occur when one client reads inconsistent data in the middle of another client’s write. Eg. A updates a part of
  an aggregate and B reads the aggregate after that and A updates the remaining part of the aggregate. This can happen a lot in RDBMS as
  the aggregates are not stored as a whole and hence need locking so that the read does not happen between a write. In NoSQL DBs this
  happens less because the aggregate is stored as a whole and is updated together. So, this actually reduces the need for having 'C'
  of ACID properties in NoSQL DBs. Although the same issue can happen when updating multiple aggregates in a single transaction. Then you
  would need transactional semantics of RDBMSs. So, you need to think if that is the case with you. The time window for which there is
  inconsistency in the system (i.e. one aggregate was updated but the dependent aggregate was not) is known as Inconsistency Window. If
  the DB is installed on a single machine this window would be smaller. On multiple machines, this window can be bigger.

  Please note that these inconsistencies will happen even if we are running NoSQL DB on a single machine as well. Replication can make it
  worse but they will be there. So, these inconsistencies are not because of replication.

  The two broad approaches to solve these are
  - Pessimistic approaches lock data records to prevent conflicts.
  - Optimistic approaches detect conflicts and fix them.

- Replication inconsistency - These are inconsistencies which happen because of replication.
  •  Distributed systems see read-write conflicts due to some nodes having received updates while other nodes have not.
  Eventual consistency means that at some point the system will become consistent once all the writes have propagated to all the nodes.
  • Clients usually want read-your-writes consistency, which means a client can write and then immediately read the new value.
  This can be difficult if the read and the write happen on different nodes. One way to achieve this is introducing sticky sessions.
  • To get good consistency, you need to involve many nodes in data operations, but this increases latency. So you often have to trade
  off consistency versus latency.
  • The CAP theorem states that if you get a network partition, you have to trade off availability of data versus consistency.
  • Durability - We can lose data if that is not replicated and a crash happens. We can trade off durability as well against latency,
  particularly if you want to survive failures with replicated data.
  • You do not need to contact all replicants to preserve strong consistency with replication; you just need a large enough quorum.

=====================================================================================================================
Factor 3 - Access Pattern
=====================================================================================================================
Basically what this means is that you need to see what kind of access requirements are there in the system. Is it read-heavy, write-heavy
or both? You can tweak things like quorum parameters if it is read heavy vs write heavy. Also, you may also decide whether you need
master-slave replication vs p2p replication based on this.


=====================================================================================================================
Version Stamps
=====================================================================================================================
Version stamps help you detect concurrency conflicts. When you read data, then update it, youcan check the version stamp to
ensure nobody updated the data between your read and write.
• Version stamps can be implemented using counters, GUIDs, content hashes, timestamps, or a
combination of these.
• With distributed systems, a vector of version stamps allows you to detect when different nodes
have conflicting updates.


=====================================================================================================================
Features to compare databases
=====================================================================================================================
- consistency
- transactions
- query features
- structure of the data
- scaling


=====================================================================================================================
Based on discussions with MCDP team members  - HBase vs Cassandra
=====================================================================================================================
- See if you need consistency or not
- Check whether its a multi region deployment or not and they check from where the reads are happening ... in HBase read might be slow
because of multiregion deployment
- Check what is the nature of the workload --- read heavy or write heavy .. based on that you can choose which option to make slow if
you need consistency
- Check if you need range queries or not .. cassandra is not a good option if you need range queries
=====================================================================================================================
	Relational databases have been a successful technology for twenty years providing
	- persistence
	- concurrency control (Multiple apps and multiple users access the DB at the same time)
	- integration mechanism (This is what prevented object oriented DBs to flourish)

	Drawbacks of Relational DBs
	- Impedance mismatch (In-memory(object) model of an application is different from (relational) model on disk).
	That's why there are ORM frameworks which lead to loss of performance
	- They are not designed to run efficiently on clusters

	Why they are picking up now
	- There is a movement away from using databases as integration points towards encapsulating databases within applications
	and integrating through (web) services.
	- The vital factor for a change in data storage was the need to support large volumes of data by running on clusters.
	- Relational DBs can be very costly to support large volume of data

	The common characteristics of NoSQL databases are
	• Not using the relational model
	• Running well on clusters (except for Graph DBs)
	• Open-source
	• Built for the 21st century web estates
	• Schemaless

	The most important result of the rise of NoSQL is Polyglot Persistence.
	=====================================================================================================================
	How to choose the right DB
	=====================================================================================================================
	When you have to decide on a NoSQL database for your project, following are the things you have to choose
	- Data Model
	- Distribution Model

	Once you've decided on the above two you need to make the choice of the vendor which involves minor nitty-gritty and things like performance
	But the above two are the broad choices in terms of design that need to be made.

	The factors which determine your choices are:
	- Domain Objects Design which determine data model (Eg. lots of relationships or no relations etc)
	- Consistency/Availability/Latency requirements (Read CAP, PACELC)/ Durability
	- Access Pattern (Read heavy, Write Heavy etc, single machine/multiple machines etc)

	=====================================================================================================================
	Choice 1 - Data Models
	=====================================================================================================================
	Different kind of data models in NoSQL are :
	- Key-Value
	- Document
	- Column Family
	- Graph

	What they all share is the notion of an aggregate indexed by a key that you can use for lookup. An aggregate is a collection of data that
	we interact with as a unit. This aggregate is central to running on a cluster, as the database will ensure that all the data for an
	aggregate is stored together on one node. The aggregate also acts as the atomic unit for updates, providing a useful, if limited,
	amount of transactional control. Essentially Aggregates form the boundaries for ACID operations with the database.

	Within that notion of aggregate, we have some differences. The key-value data model treats the aggregate as an opaque whole, which
	means you can only do key lookup for the whole aggregate—you cannot run a query nor retrieve a part of the aggregate.

	The document model makes the aggregate transparent to the database allowing you to do queries and partial retrievals. However, since
	the document has no schema, the database cannot act much on the structure of the document to optimize the storage and retrieval of
	parts of the aggregate.

	Column-family models divide the aggregate into column families, allowing the database to treat them as units
	of data within the row aggregate. This imposes some structure on the aggregate but allows the database to take advantage of that
	structure to improve its accessibility.

	Aggregate-oriented databases work best when most data interaction is done with the same aggregate; aggregate-ignorant databases (like
	Relational DBs) are better when interactions use data organized in many different formations.

	=====================================================================================================================
	Choice 2 - Distribution Models
	=====================================================================================================================
	There are two styles of distributing data:
	• Sharding distributes different data across multiple servers, so each server acts as the single
	source for a subset of data.
	• Replication copies data across multiple servers, so each bit of data can be found in multiple
	places.

	A system may use either or both techniques.

	Replication comes in two forms:
	• Master-slave replication makes one node the authoritative copy that handles writes while
	slaves synchronize with the master and may handle reads.
	• Peer-to-peer replication allows writes to any node; the nodes coordinate to synchronize their
	copies of the data.

	Master-slave replication reduces the chance of update conflicts but peer-to-peer replication
	avoids loading all writes onto a single point of failure.

	=====================================================================================================================
	Factor 1 - Domain Objects Design
	=====================================================================================================================
	This primarily depends on the domain for which you are creating the software. The different data models directly comes from the
	design of the objects in your software, what kind of relationships are there, how these objects are accessed etc.

	=====================================================================================================================
	Factor 2 - Consistency Requirements
	=====================================================================================================================
	There are two kinds of consistency issues:
	- Logical inconsistency
	Write-write conflicts occur when two clients try to write the same data at the same time. Eg. A, B read the same data and update
	that together. Then you will lose update of either A or B depending on which one reaches the server later. This can be solved by having
	conditional write constructs (equivalent of CAS)

	Read-write conflicts occur when one client reads inconsistent data in the middle of another client’s write. Eg. A updates a part of
	an aggregate and B reads the aggregate after that and A updates the remaining part of the aggregate. This can happen a lot in RDBMS as
	the aggregates are not stored as a whole and hence need locking so that the read does not happen between a write. In NoSQL DBs this
	happens less because the aggregate is stored as a whole and is updated together. So, this actually reduces the need for having 'C'
	of ACID properties in NoSQL DBs. Although the same issue can happen when updating multiple aggregates in a single transaction. Then you
	would need transactional semantics of RDBMSs. So, you need to think if that is the case with you. The time window for which there is
	inconsistency in the system (i.e. one aggregate was updated but the dependent aggregate was not) is known as Inconsistency Window. If
	the DB is installed on a single machine this window would be smaller. On multiple machines, this window can be bigger.

	Please note that these inconsistencies will happen even if we are running NoSQL DB on a single machine as well. Replication can make it
	worse but they will be there. So, these inconsistencies are not because of replication.

	The two broad approaches to solve these are
	- Pessimistic approaches lock data records to prevent conflicts.
	- Optimistic approaches detect conflicts and fix them.

	- Replication inconsistency - These are inconsistencies which happen because of replication.
	• Distributed systems see read-write conflicts due to some nodes having received updates while other nodes have not.
	Eventual consistency means that at some point the system will become consistent once all the writes have propagated to all the nodes.
	• Clients usually want read-your-writes consistency, which means a client can write and then immediately read the new value.
	This can be difficult if the read and the write happen on different nodes. One way to achieve this is introducing sticky sessions.
	• To get good consistency, you need to involve many nodes in data operations, but this increases latency. So you often have to trade
	off consistency versus latency.
	• The CAP theorem states that if you get a network partition, you have to trade off availability of data versus consistency.
	• Durability - We can lose data if that is not replicated and a crash happens. We can trade off durability as well against latency,
	particularly if you want to survive failures with replicated data.
	• You do not need to contact all replicants to preserve strong consistency with replication; you just need a large enough quorum.

	=====================================================================================================================
	Factor 3 - Access Pattern
	=====================================================================================================================
	Basically what this means is that you need to see what kind of access requirements are there in the system. Is it read-heavy, write-heavy
	or both? You can tweak things like quorum parameters if it is read heavy vs write heavy. Also, you may also decide whether you need
	master-slave replication vs p2p replication based on this.


	=====================================================================================================================
	Version Stamps
	=====================================================================================================================
	Version stamps help you detect concurrency conflicts. When you read data, then update it, youcan check the version stamp to
	ensure nobody updated the data between your read and write.
	• Version stamps can be implemented using counters, GUIDs, content hashes, timestamps, or a
	combination of these.
	• With distributed systems, a vector of version stamps allows you to detect when different nodes
	have conflicting updates.


	=====================================================================================================================
	Features to compare databases
	=====================================================================================================================
	- consistency
	- transactions
	- query features
	- structure of the data
	- scaling


	=====================================================================================================================
	Based on discussions with MCDP team members - HBase vs Cassandra
	=====================================================================================================================
	- See if you need consistency or not
	- Check whether its a multi region deployment or not and they check from where the reads are happening ... in HBase read might be slow
	because of multiregion deployment
	- Check what is the nature of the workload --- read heavy or write heavy .. based on that you can choose which option to make slow if
	you need consistency
	- Check if you need range queries or not .. cassandra is not a good option if you need range queries
	=====================================================================================================================