- Data Pressure: Ability to process data within reasonable time and cost
- Whenever one of the dimensions of Data Pressure is broken, we'd invent new technology.
- RDBMs optimize for storage which makes sense as storage was expensive in 70s. But this happens at the cost of CPU, where CPU is going insane pulling data from all over the disk and stitch it together. Nowadays storage is cheap and CPU isn't. So why use a system that depends on an expensive resource.
- RDBMs/SQL gives structure with power to create ad hoc queries while noSQL gives instantiated views for given queries. In this way noSQL forces more product driven development where one has to discover access patterns.
- noSQL != non-relation. All data in universe is relational.
- Do not compare RDBMS with noSQL and specially MySQL and DynamoDb.
- No, Table is not the same, and no Document is not same as records
- For writes, DynamoDb uses 3 replica and returns when 2 confirm the write
- Eventually consistent vs strictly consistent
- EC could be achieved with using streams to update nested records
- SC always queries the primary replica and hence isn't really scalable
- EC is single digit millisecond behind Primary
- noSQL us not flexible. It is efficient. The data is tightly coupled with access patterns.
- Not good with answering count, avg, sum, min, max, ...
- Table
- Item
- Partition Key
- Used to do unordered hash index e.g. routing
- Sort Key
- Sort order of data on the same partition node
- Attributes
- Secondary Indices
- Local
- Global
- Streams
- Stored locally on same partition on a node
- Contribute to partition size and follow same limit e.g. 10GB per partition key
- Partition key remains the same, we change the sort key
- Strongly consistent
- Used for supporting access patterns that can't be supported by Primary Key
- Create a replica of table with the Secondary Index as Partition Key
- Updates are replicated from main table
- Storage costs could be a concern depending on how many GSI we'd have
- Capacity planning could be critical
- Base Table could have a capacity of 4 while the SI table could have 2
- Extensive writes could cause a throttle on Base Table
- Eventually consistent
- Simple: Partition Key
- Composite: Partition Key + Sort Key
- DO NOT CONFUSE PK
- Streams + Lambda could be use like a stored procedure, disconnected from table space.
- Stream is like a changelog where everything gets written
- Could invoke a lambda (Separate IAM roles for better security: Invocation[access to the table], Execution)
- Commonly used for aggregations and written back as metadata item
- For high velocity one could use ECS/EC2 to process the streams instead
- Item based
- Writing, Updating, Deleting
- Must provide entire primary key
- Query
- Must provide Partition key, Sort key is optional
- Scan
- Expensive for performance and costs
- TransactWriteItems
- Synchronize update, put, delete and check
- Atomic
- Automatic Rollback
- Up to 10 items within a transaction
- Supports multiple tables. DO NOT DO IT
- Complex conditional checks
- Synchronize update, put, delete and check
- Good:
- Commit changes across items
- Condition batch inserts/updates
- Bad
- Maintaining normalized data
- Nature of the application
- OLTP/OLAP/DSS
- Data Life Cycle: TTL, Backup/Archival
- Start with ERD
- Define access patterns
- Identify data sources
- Document workflows
- Read/Write workloads
- Query dimensions and aggregations
- Define Core Entity
- Define primary keys and secondary indexes
- Avoid using multiple tables, use one with different SK
- 1 application service = 1 table
- How will data be read and written? Eventual consistency or strict?
- Review -> Repeat -> Review
- One table in DynamoDb can, and usually does, have data for more than one entity.
- Always prefix the PK and SK with identifier or acronym.
- Use SK to indicate 1-M relationships
- Use indices instead of scan. Avoid FilterExpression as it does a scan
- Use Reverse index for M-1
- Items that aren't accessed separately should be part of a main record
- High cardinality
- Uniformly requested and randomly distributed
- Examples
- Bad: Status, Gender,
- Good: CustomerId, DeviceId
- Model 1:M and M:M relationships
- Efficient/selective patterns, ability to query multiple entities with same Partition key but different sort key
- No more: Get the Customer and then get all the records for that customer, or do some join.
- Use Range Queries
- Users <-1:M-> Address
- Users <-1:M-> Orders
- Orders <-1:M-> OrderItems
- Get user profile
- Get orders for user
- Get single order and relevant order items
- Get orders for user by status
- Get Open Orders
PK, SK, Attributes
USER#sho, #PROFILE#sho, {"username": "sho", "email": "sho@domain.com"}
Profile belongs to user and would only be accessed through user, makes sense to make the SK to couple with PK.
- Addresses would only be accesses when a profile is accessed
- A limit can be added for addresses, say 10.
- Denormalize. Make address a json attribute of User
- Keep User as PK but add Order as SK.
PK, SK, Attributes
USER#sho, ORDER#id, {"total": 22.90, "something: "something"}
- Belong to Orders, which belongs to user, won't exists without Order items
- Can't do same as Address belong this is property of an object which is property of another object
- We already have Order id as SK, keep that and change PK to be used for ITEM
PK, SK, Attributes
ITEM#shirt, ORDER#id, {"price": 12.90, "size: "L"}
- Add first Secondary Index
- Inverted index. SK becomes primary key, PK becomes sort key
- Fetch by Order id
- Returns relevant user and all items, different kind of data but that is expected and actually helpful, plus PK tells the kind of data.
- Can't use FilterExpression as one can't query across partition
- Use Scan? 1M limit, expensive and will not return all the results, could even return empty when we have data
PK = USER#sho AND BEGINS_WITH(SK, "ORDER#")
- Can not query the status attribute directly
- Composite Sort key
- Combine Status with status date as a new attribute
SHIPPED#2019-06-01
- Create SI on the combined attribute
PK = USER#sho AND BEGINS_WITH(OrderStatusDate, "SHIPPED#")
- Can filter also on dates with this
- This looks like a scan for open orders
- Sparse index pattern
- Create a new attribute that is only set if order is open, value does not matter
- Create a new SI
- Do a scan on that new SI
- Country > State > City > Office
- Use composite sort to define hierarchy
- State#City#OfficeId
- Reduce query complexity
Primary Key, Attributes
USA, NY#NYC#JFK11, JFK....
USA, NY#NYC#JFK14, JFK...
USA, WA#SEA#BLACKFOOT, BLACKFOOT...
USA, WA#SEA#MAYDAY, MAYDAY...
CA, ....
Mostly from [2] below.
History repeats itself because nobody was listening the first time. - Anonymous
We are stuck with technology when what we really want is just stuff that works. - Douglas Adams
A ship in port is safe, but that's not what ships were built for. - Grace Hopper
Computers are useless. They can only give you answers. - Pablo Picasso
Hierarchies are celestial. In hell all are equal. - Nicolas Gomez Davila
The great myth of our times is that technology is communication. - Libby Larsen
Reality is that which, when you stop believing in it, does not go away, - Phillip K Dick
Fairly cheap home computing was what changes my life. - Linus Torvals
Its not just fail fast, its fail cheap. - Rick Houlihan
When you hear people say, "noSQL is missing joins", you say, "you are missing the point". - Rick Houlihan
- # AWS re:Invent 2019: Data modeling with Amazon DynamoDB (CMY304)
- # AWS re:Invent 2018: Amazon DynamoDB Deep Dive: Advanced Design Patterns for DynamoDB (DAT401)
- DynamoDb Best Practices
- Build With DynamoDb Series
- AWS Builders' Day | DynamoDB and Schema Design
- AWS re:Invent 2018: Amazon DynamoDB Under the Hood: How We Built a Hyper-Scale Database (DAT321)
- AWS re:Invent 2019: [REPEAT 1] Amazon DynamoDB deep dive: Advanced design patterns (DAT403-R1)
- Model a DynamoDB Database for a Gaming Application - AWS Virtual Workshop