shoaibi/DynamoDb.md

## DynamoDb.md

      
    Raw
  

              DynamoDb.md
            
          
    General


Data Pressure: Ability to process data within reasonable time and cost
Whenever one of the dimensions of Data Pressure is broken, we'd invent new technology.
RDBMs optimize for storage which makes sense as storage was expensive in 70s.  But this happens at the cost of CPU, where CPU is going insane pulling data from all over the disk and stitch it together. Nowadays storage is cheap and CPU isn't. So why use a system that depends on an expensive resource.
RDBMs/SQL gives structure with power to create ad hoc queries while noSQL gives instantiated views for given queries. In this way noSQL forces more product driven development where one has to discover access patterns.
noSQL != non-relation. All data in universe is relational.
Do not compare RDBMS with noSQL and specially MySQL and DynamoDb.

No, Table is not the same, and no Document is not same as records


For writes, DynamoDb uses 3 replica and returns when 2 confirm the write
Eventually consistent vs strictly consistent

EC could be achieved with using streams to update nested records
SC always queries the primary replica and hence isn't really scalable
EC is single digit millisecond behind Primary


noSQL us not flexible. It is efficient. The data is tightly coupled with access patterns.
Not good with answering count, avg, sum, min, max, ...

Forget everything you've learn for RDBMS

Concepts


Table
Item
Partition Key

Used to do unordered hash index e.g. routing


Sort Key

Sort order of data on the same partition node


Attributes
Secondary Indices

Local
Global


Streams

Secondary Indices

Local Secondary Index


Stored locally on same partition on a node
Contribute to partition size and follow same limit e.g. 10GB per partition key
Partition key remains the same, we change the sort key
Strongly consistent

Global Secondary Index


Used for supporting access patterns that can't be supported by Primary Key
Create a replica of table with the Secondary Index as Partition Key
Updates are replicated from main table
Storage costs could be a concern depending on how many GSI we'd have
Capacity planning could be critical

Base Table could have a capacity of 4 while the SI table could have 2
Extensive writes could cause a throttle on Base Table


Eventually consistent

Primary Key


Simple: Partition Key
Composite: Partition Key + Sort Key
DO NOT CONFUSE PK

Streams


Streams + Lambda could be use like a stored procedure, disconnected from table space.
Stream is like a changelog where everything gets written
Could invoke a lambda (Separate IAM roles for better security: Invocation[access to the table], Execution)
Commonly used for aggregations and written back as metadata item
For high velocity one could use ECS/EC2 to process the streams instead

API Actions


Item based

Writing, Updating, Deleting
Must provide entire primary key


Query

Must provide Partition key, Sort key is optional


Scan

Expensive for performance and costs


Transactions API


TransactWriteItems

Synchronize update, put, delete and check

Atomic
Automatic Rollback


Up to 10 items within a transaction
Supports multiple tables. DO NOT DO IT
Complex conditional checks


Good:

Commit changes across items
Condition batch inserts/updates


Bad

Maintaining normalized data


Design Process


Nature of the application

OLTP/OLAP/DSS
Data Life Cycle: TTL, Backup/Archival


Start with ERD
Define access patterns

Identify data sources
Document workflows
Read/Write workloads
Query dimensions and aggregations


Define Core Entity
Define primary keys and secondary indexes

Avoid using multiple tables, use one with different SK
1 application service = 1 table
How will data be read and written? Eventual consistency or strict?


Review -> Repeat -> Review

Design Principles

General


One table in DynamoDb can, and usually does, have data for more than one entity.
Always prefix the PK and SK with identifier or acronym.
Use SK to indicate 1-M relationships
Use indices instead of scan. Avoid FilterExpression as it does a scan
Use Reverse index for M-1
Items that aren't accessed separately should be part of a main record

Partition Key


High cardinality
Uniformly requested and randomly distributed
Examples

Bad: Status, Gender,
Good: CustomerId, DeviceId


Sort Key


Model 1:M and M:M relationships
Efficient/selective patterns, ability to query multiple entities with same Partition key but different sort key

No more: Get the Customer and then get all the records for that customer, or do some join.


Use Range Queries

Examples

eCommerce [1]

ERD


Users <-1:M-> Address
Users <-1:M-> Orders
Orders <-1:M-> OrderItems

Access Patterns


Get user profile
Get orders for user
Get single order and relevant order items
Get orders for user by status
Get Open Orders

Design

Users

PK, SK, Attributes
USER#sho, #PROFILE#sho, {"username": "sho", "email": "sho@domain.com"}

Profile belongs to user and would only be accessed through user, makes sense to make the SK to couple with PK.
Address


Addresses would only be accesses when a profile is accessed
A limit can be added for addresses, say 10.
Denormalize. Make address a json attribute of User

Orders


Keep User as PK but add Order as SK.

PK, SK, Attributes
USER#sho, ORDER#id, {"total": 22.90, "something: "something"}

Order Items


Belong to Orders, which belongs to user, won't exists without Order items
Can't do same as Address belong this is property of an object which is property of another object
We already have Order id as SK, keep that and change PK to be used for ITEM

PK, SK, Attributes
ITEM#shirt, ORDER#id, {"price": 12.90, "size: "L"}


Add first Secondary Index

Inverted index. SK becomes primary key, PK becomes sort key
Fetch by Order id

Returns relevant user and all items, different kind of data but that is expected and actually helpful, plus PK tells the kind of data.


Filtering access patterns


Can't use FilterExpression as one can't query across partition
Use Scan? 1M limit, expensive and will not return all the results, could even return empty when we have data

Get orders for user


PK = USER#sho AND BEGINS_WITH(SK, "ORDER#")

Get orders by status for user


Can not query the status attribute directly
Composite Sort key

Combine Status with status date as a new attribute
SHIPPED#2019-06-01
Create SI on the combined attribute
PK = USER#sho AND BEGINS_WITH(OrderStatusDate, "SHIPPED#")


Can filter also on dates with this

Get open orders


This looks like a scan for open orders
Sparse index pattern

Create a new attribute that is only set if order is open, value does not matter
Create a new SI
Do a scan on that new SI


Hierarchical Data [2]


Country > State > City > Office
Use composite sort to define hierarchy

State#City#OfficeId


Reduce query complexity

Primary Key, Attributes
USA, NY#NYC#JFK11, JFK....
USA, NY#NYC#JFK14, JFK...
USA, WA#SEA#BLACKFOOT, BLACKFOOT...
USA, WA#SEA#MAYDAY, MAYDAY...
CA, ....

Quotes

Mostly from [2] below.

History repeats itself because nobody was listening the first time. - Anonymous


We are stuck with technology when what we really want is just stuff that works. - Douglas Adams


A ship in port is safe, but that's not what ships were built for. -  Grace Hopper


Computers are useless. They can only give you answers. - Pablo Picasso


Hierarchies are celestial. In hell all are equal. - Nicolas Gomez Davila


The great myth of our times is that technology is communication. - Libby Larsen


Reality is that which, when you stop believing in it, does not go away, - Phillip K Dick


Fairly cheap home computing was what changes my life. - Linus Torvals


Its not just fail fast, its fail cheap. - Rick Houlihan


When you hear people say, "noSQL is missing joins", you say, "you are missing the point". - Rick Houlihan

Resources


# AWS re:Invent 2019: Data modeling with Amazon DynamoDB (CMY304)
# AWS re:Invent 2018: Amazon DynamoDB Deep Dive: Advanced Design Patterns for DynamoDB (DAT401)
DynamoDb Best Practices

Talks I didn't get to


Build With DynamoDb Series
AWS Builders' Day | DynamoDB and Schema Design
AWS re:Invent 2018: Amazon DynamoDB Under the Hood: How We Built a Hyper-Scale Database (DAT321)
AWS re:Invent 2019: [REPEAT 1] Amazon DynamoDB deep dive: Advanced design patterns (DAT403-R1)
Model a DynamoDB Database for a Gaming Application - AWS Virtual Workshop