Skip to content

Instantly share code, notes, and snippets.

@shoaibi
Last active April 1, 2020 08:07
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save shoaibi/80f771a15fb63c39ad39aa293345ecab to your computer and use it in GitHub Desktop.
Save shoaibi/80f771a15fb63c39ad39aa293345ecab to your computer and use it in GitHub Desktop.
DynamoDb Notes

General

  • Data Pressure: Ability to process data within reasonable time and cost
  • Whenever one of the dimensions of Data Pressure is broken, we'd invent new technology.
  • RDBMs optimize for storage which makes sense as storage was expensive in 70s. But this happens at the cost of CPU, where CPU is going insane pulling data from all over the disk and stitch it together. Nowadays storage is cheap and CPU isn't. So why use a system that depends on an expensive resource.
  • RDBMs/SQL gives structure with power to create ad hoc queries while noSQL gives instantiated views for given queries. In this way noSQL forces more product driven development where one has to discover access patterns.
  • noSQL != non-relation. All data in universe is relational.
  • Do not compare RDBMS with noSQL and specially MySQL and DynamoDb.
    • No, Table is not the same, and no Document is not same as records
  • For writes, DynamoDb uses 3 replica and returns when 2 confirm the write
  • Eventually consistent vs strictly consistent
    • EC could be achieved with using streams to update nested records
    • SC always queries the primary replica and hence isn't really scalable
    • EC is single digit millisecond behind Primary
  • noSQL us not flexible. It is efficient. The data is tightly coupled with access patterns.
  • Not good with answering count, avg, sum, min, max, ...

Forget everything you've learn for RDBMS

Concepts

  • Table
  • Item
  • Partition Key
    • Used to do unordered hash index e.g. routing
  • Sort Key
    • Sort order of data on the same partition node
  • Attributes
  • Secondary Indices
    • Local
    • Global
  • Streams

Secondary Indices

Local Secondary Index

  • Stored locally on same partition on a node
  • Contribute to partition size and follow same limit e.g. 10GB per partition key
  • Partition key remains the same, we change the sort key
  • Strongly consistent

Global Secondary Index

  • Used for supporting access patterns that can't be supported by Primary Key
  • Create a replica of table with the Secondary Index as Partition Key
  • Updates are replicated from main table
  • Storage costs could be a concern depending on how many GSI we'd have
  • Capacity planning could be critical
    • Base Table could have a capacity of 4 while the SI table could have 2
    • Extensive writes could cause a throttle on Base Table
  • Eventually consistent

Primary Key

  • Simple: Partition Key
  • Composite: Partition Key + Sort Key
  • DO NOT CONFUSE PK

Streams

  • Streams + Lambda could be use like a stored procedure, disconnected from table space.
  • Stream is like a changelog where everything gets written
  • Could invoke a lambda (Separate IAM roles for better security: Invocation[access to the table], Execution)
  • Commonly used for aggregations and written back as metadata item
  • For high velocity one could use ECS/EC2 to process the streams instead

API Actions

  • Item based
    • Writing, Updating, Deleting
    • Must provide entire primary key
  • Query
    • Must provide Partition key, Sort key is optional
  • Scan
    • Expensive for performance and costs

Transactions API

  • TransactWriteItems
    • Synchronize update, put, delete and check
      • Atomic
      • Automatic Rollback
    • Up to 10 items within a transaction
    • Supports multiple tables. DO NOT DO IT
    • Complex conditional checks
  • Good:
    • Commit changes across items
    • Condition batch inserts/updates
  • Bad
    • Maintaining normalized data

Design Process

  • Nature of the application
    • OLTP/OLAP/DSS
    • Data Life Cycle: TTL, Backup/Archival
  • Start with ERD
  • Define access patterns
    • Identify data sources
    • Document workflows
    • Read/Write workloads
    • Query dimensions and aggregations
  • Define Core Entity
  • Define primary keys and secondary indexes
    • Avoid using multiple tables, use one with different SK
    • 1 application service = 1 table
    • How will data be read and written? Eventual consistency or strict?
  • Review -> Repeat -> Review

Design Principles

General

  • One table in DynamoDb can, and usually does, have data for more than one entity.
  • Always prefix the PK and SK with identifier or acronym.
  • Use SK to indicate 1-M relationships
  • Use indices instead of scan. Avoid FilterExpression as it does a scan
  • Use Reverse index for M-1
  • Items that aren't accessed separately should be part of a main record

Partition Key

  • High cardinality
  • Uniformly requested and randomly distributed
  • Examples
    • Bad: Status, Gender,
    • Good: CustomerId, DeviceId

Sort Key

  • Model 1:M and M:M relationships
  • Efficient/selective patterns, ability to query multiple entities with same Partition key but different sort key
    • No more: Get the Customer and then get all the records for that customer, or do some join.
  • Use Range Queries

Examples

eCommerce [1]

ERD

  • Users <-1:M-> Address
  • Users <-1:M-> Orders
  • Orders <-1:M-> OrderItems

Access Patterns

  • Get user profile
  • Get orders for user
  • Get single order and relevant order items
  • Get orders for user by status
  • Get Open Orders

Design

Users

PK, SK, Attributes
USER#sho, #PROFILE#sho, {"username": "sho", "email": "sho@domain.com"}

Profile belongs to user and would only be accessed through user, makes sense to make the SK to couple with PK.

Address

  • Addresses would only be accesses when a profile is accessed
  • A limit can be added for addresses, say 10.
  • Denormalize. Make address a json attribute of User

Orders

  • Keep User as PK but add Order as SK.
PK, SK, Attributes
USER#sho, ORDER#id, {"total": 22.90, "something: "something"}

Order Items

  • Belong to Orders, which belongs to user, won't exists without Order items
  • Can't do same as Address belong this is property of an object which is property of another object
  • We already have Order id as SK, keep that and change PK to be used for ITEM
PK, SK, Attributes
ITEM#shirt, ORDER#id, {"price": 12.90, "size: "L"}
  • Add first Secondary Index
    • Inverted index. SK becomes primary key, PK becomes sort key
    • Fetch by Order id
      • Returns relevant user and all items, different kind of data but that is expected and actually helpful, plus PK tells the kind of data.

Filtering access patterns

  • Can't use FilterExpression as one can't query across partition
  • Use Scan? 1M limit, expensive and will not return all the results, could even return empty when we have data
Get orders for user
  • PK = USER#sho AND BEGINS_WITH(SK, "ORDER#")
Get orders by status for user
  • Can not query the status attribute directly
  • Composite Sort key
    • Combine Status with status date as a new attribute
    • SHIPPED#2019-06-01
    • Create SI on the combined attribute PK = USER#sho AND BEGINS_WITH(OrderStatusDate, "SHIPPED#")
  • Can filter also on dates with this
Get open orders
  • This looks like a scan for open orders
  • Sparse index pattern
    • Create a new attribute that is only set if order is open, value does not matter
    • Create a new SI
    • Do a scan on that new SI

Hierarchical Data [2]

  • Country > State > City > Office
  • Use composite sort to define hierarchy
    • State#City#OfficeId
  • Reduce query complexity
Primary Key, Attributes
USA, NY#NYC#JFK11, JFK....
USA, NY#NYC#JFK14, JFK...
USA, WA#SEA#BLACKFOOT, BLACKFOOT...
USA, WA#SEA#MAYDAY, MAYDAY...
CA, ....

Quotes

Mostly from [2] below.

History repeats itself because nobody was listening the first time. - Anonymous

We are stuck with technology when what we really want is just stuff that works. - Douglas Adams

A ship in port is safe, but that's not what ships were built for. - Grace Hopper

Computers are useless. They can only give you answers. - Pablo Picasso

Hierarchies are celestial. In hell all are equal. - Nicolas Gomez Davila

The great myth of our times is that technology is communication. - Libby Larsen

Reality is that which, when you stop believing in it, does not go away, - Phillip K Dick

Fairly cheap home computing was what changes my life. - Linus Torvals

Its not just fail fast, its fail cheap. - Rick Houlihan

When you hear people say, "noSQL is missing joins", you say, "you are missing the point". - Rick Houlihan

Resources

  1. # AWS re:Invent 2019: Data modeling with Amazon DynamoDB (CMY304)
  2. # AWS re:Invent 2018: Amazon DynamoDB Deep Dive: Advanced Design Patterns for DynamoDB (DAT401)
  3. DynamoDb Best Practices

Talks I didn't get to

  1. Build With DynamoDb Series
  2. AWS Builders' Day | DynamoDB and Schema Design
  3. AWS re:Invent 2018: Amazon DynamoDB Under the Hood: How We Built a Hyper-Scale Database (DAT321)
  4. AWS re:Invent 2019: [REPEAT 1] Amazon DynamoDB deep dive: Advanced design patterns (DAT403-R1)
  5. Model a DynamoDB Database for a Gaming Application - AWS Virtual Workshop
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment