Skip to content

Instantly share code, notes, and snippets.

@neerajgoel82
Last active August 25, 2020 05:00
Show Gist options
  • Save neerajgoel82/0aea50fc2b0b54341018ad75854fa34e to your computer and use it in GitHub Desktop.
Save neerajgoel82/0aea50fc2b0b54341018ad75854fa34e to your computer and use it in GitHub Desktop.
Relational databases have been a successful technology for twenty years providing
- persistence
- concurrency control (Multiple apps and multiple users access the DB at the same time)
- integration mechanism (This is what prevented object oriented DBs to flourish)
Drawbacks of Relational DBs
- Impedance mismatch (In-memory(object) model of an application is different from (relational) model on disk).
That's why there are ORM frameworks which lead to loss of performance
- They are not designed to run efficiently on clusters
Why they are picking up now
- There is a movement away from using databases as integration points towards encapsulating databases within applications
and integrating through (web) services.
- The vital factor for a change in data storage was the need to support large volumes of data by running on clusters.
- Relational DBs can be very costly to support large volume of data
The common characteristics of NoSQL databases are
• Not using the relational model
• Running well on clusters (except for Graph DBs)
• Open-source
• Built for the 21st century web estates
• Schemaless
The most important result of the rise of NoSQL is Polyglot Persistence.
=====================================================================================================================
How to choose the right DB
=====================================================================================================================
When you have to decide on a NoSQL database for your project, following are the things you have to choose
- Data Model
- Distribution Model
Once you've decided on the above two you need to make the choice of the vendor which involves minor nitty-gritty and things like performance
But the above two are the broad choices in terms of design that need to be made.
The factors which determine your choices are:
- Domain Objects Design which determine data model (Eg. lots of relationships or no relations etc)
- Consistency/Availability/Latency requirements (Read CAP, PACELC)/ Durability
- Access Pattern (Read heavy, Write Heavy etc, single machine/multiple machines etc)
=====================================================================================================================
Choice 1 - Data Models
=====================================================================================================================
Different kind of data models in NoSQL are :
- Key-Value
- Document
- Column Family
- Graph
What they all share is the notion of an aggregate indexed by a key that you can use for lookup. An aggregate is a collection of data that
we interact with as a unit. This aggregate is central to running on a cluster, as the database will ensure that all the data for an
aggregate is stored together on one node. The aggregate also acts as the atomic unit for updates, providing a useful, if limited,
amount of transactional control. Essentially Aggregates form the boundaries for ACID operations with the database.
Within that notion of aggregate, we have some differences. The key-value data model treats the aggregate as an opaque whole, which
means you can only do key lookup for the whole aggregate—you cannot run a query nor retrieve a part of the aggregate.
The document model makes the aggregate transparent to the database allowing you to do queries and partial retrievals. However, since
the document has no schema, the database cannot act much on the structure of the document to optimize the storage and retrieval of
parts of the aggregate.
Column-family models divide the aggregate into column families, allowing the database to treat them as units
of data within the row aggregate. This imposes some structure on the aggregate but allows the database to take advantage of that
structure to improve its accessibility.
Aggregate-oriented databases work best when most data interaction is done with the same aggregate; aggregate-ignorant databases (like
Relational DBs) are better when interactions use data organized in many different formations.
=====================================================================================================================
Choice 2 - Distribution Models
=====================================================================================================================
There are two styles of distributing data:
• Sharding distributes different data across multiple servers, so each server acts as the single
source for a subset of data.
• Replication copies data across multiple servers, so each bit of data can be found in multiple
places.
A system may use either or both techniques.
Replication comes in two forms:
• Master-slave replication makes one node the authoritative copy that handles writes while
slaves synchronize with the master and may handle reads.
• Peer-to-peer replication allows writes to any node; the nodes coordinate to synchronize their
copies of the data.
Master-slave replication reduces the chance of update conflicts but peer-to-peer replication
avoids loading all writes onto a single point of failure.
=====================================================================================================================
Factor 1 - Domain Objects Design
=====================================================================================================================
This primarily depends on the domain for which you are creating the software. The different data models directly comes from the
design of the objects in your software, what kind of relationships are there, how these objects are accessed etc.
=====================================================================================================================
Factor 2 - Consistency Requirements
=====================================================================================================================
There are two kinds of consistency issues:
- Logical inconsistency
Write-write conflicts occur when two clients try to write the same data at the same time. Eg. A, B read the same data and update
that together. Then you will lose update of either A or B depending on which one reaches the server later. This can be solved by having
conditional write constructs (equivalent of CAS)
Read-write conflicts occur when one client reads inconsistent data in the middle of another client’s write. Eg. A updates a part of
an aggregate and B reads the aggregate after that and A updates the remaining part of the aggregate. This can happen a lot in RDBMS as
the aggregates are not stored as a whole and hence need locking so that the read does not happen between a write. In NoSQL DBs this
happens less because the aggregate is stored as a whole and is updated together. So, this actually reduces the need for having 'C'
of ACID properties in NoSQL DBs. Although the same issue can happen when updating multiple aggregates in a single transaction. Then you
would need transactional semantics of RDBMSs. So, you need to think if that is the case with you. The time window for which there is
inconsistency in the system (i.e. one aggregate was updated but the dependent aggregate was not) is known as Inconsistency Window. If
the DB is installed on a single machine this window would be smaller. On multiple machines, this window can be bigger.
Please note that these inconsistencies will happen even if we are running NoSQL DB on a single machine as well. Replication can make it
worse but they will be there. So, these inconsistencies are not because of replication.
The two broad approaches to solve these are
- Pessimistic approaches lock data records to prevent conflicts.
- Optimistic approaches detect conflicts and fix them.
- Replication inconsistency - These are inconsistencies which happen because of replication.
• Distributed systems see read-write conflicts due to some nodes having received updates while other nodes have not.
Eventual consistency means that at some point the system will become consistent once all the writes have propagated to all the nodes.
• Clients usually want read-your-writes consistency, which means a client can write and then immediately read the new value.
This can be difficult if the read and the write happen on different nodes. One way to achieve this is introducing sticky sessions.
• To get good consistency, you need to involve many nodes in data operations, but this increases latency. So you often have to trade
off consistency versus latency.
• The CAP theorem states that if you get a network partition, you have to trade off availability of data versus consistency.
• Durability - We can lose data if that is not replicated and a crash happens. We can trade off durability as well against latency,
particularly if you want to survive failures with replicated data.
• You do not need to contact all replicants to preserve strong consistency with replication; you just need a large enough quorum.
=====================================================================================================================
Factor 3 - Access Pattern
=====================================================================================================================
Basically what this means is that you need to see what kind of access requirements are there in the system. Is it read-heavy, write-heavy
or both? You can tweak things like quorum parameters if it is read heavy vs write heavy. Also, you may also decide whether you need
master-slave replication vs p2p replication based on this.
=====================================================================================================================
Version Stamps
=====================================================================================================================
Version stamps help you detect concurrency conflicts. When you read data, then update it, youcan check the version stamp to
ensure nobody updated the data between your read and write.
• Version stamps can be implemented using counters, GUIDs, content hashes, timestamps, or a
combination of these.
• With distributed systems, a vector of version stamps allows you to detect when different nodes
have conflicting updates.
=====================================================================================================================
Features to compare databases
=====================================================================================================================
- consistency
- transactions
- query features
- structure of the data
- scaling
=====================================================================================================================
Based on discussions with MCDP team members - HBase vs Cassandra
=====================================================================================================================
- See if you need consistency or not
- Check whether its a multi region deployment or not and they check from where the reads are happening ... in HBase read might be slow
because of multiregion deployment
- Check what is the nature of the workload --- read heavy or write heavy .. based on that you can choose which option to make slow if
you need consistency
- Check if you need range queries or not .. cassandra is not a good option if you need range queries
=====================================================================================================================
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment