christopher-caldwell/aws-intro.md

## aws-intro.md

      
    Raw
  

              aws-intro.md
            
          
    AWS Services Breakdown

This guide is intended for the absolute beginner to AWS. It assumes you know little to nothing about AWS.
Lambda

A "serverless" function ran without provisioning a certain server for a given amount of time. The concept of serverless is a huge topic, but the basic idea is that the starting and stopping of the server is abstracted away. When you ask it to run, it runs. After the code is done, the server stops running. You are only charged for the server when it's running.
They are completely stateless, and are essentially a new boot every time they run. This means that session level things maintained by a Lambda, such as auth session, are not possible.
Where the Code is Stored

Generally, the code will be stored in an S3 bucket. When the Lambda is invoked, AWS pulls the code down from S3, mounts it in the container, and then runs the code. It is possible to not use S3, but size is limited to around 10mb.
Language Support

Lambda is limited to certain versions of certain languages. The popular ones include Node, Python, Java, C#, Ruby.
The reason for this is because Lambda functions are mounted into Docker containers. The containers are managed by AWS, so only a limited amount / version / language are available.
Version Support

You can however use your own Docker container, which can support any language / version you want. This is helpful because of ToL ( time of life ) for supported languages. Let's say you make everything with Node 6. AWS deprecated the official Node 6 runtime a couple years ago, which means you cannot create or update functions with Node 6. Which means you are REQUIRED to update to the supported Node version. This will continue to happen as new versions are released.
Use Cases


Pretty common as the main compute power behind small to medium API's.
Great for bridging AWS services.
ETL data jobs ( transform some kind of data and insert it into somewhere )
Automated scheduled tasks like a weekly newsletter, etc

Limitations

There's always a tradeoff.

No writable filesystem and readable is limited to 512mb
Maximum duration of 15 minutes
Maximum memory is 3gb
Soft limit of 1000 concurrent invocations. Meaning that no more than 1000 Lambdas can be running at once

Cold Starts

A Lambda can be thought of as a Docker container. Docker containers take time to start up. Once they have been started, AWS keeps them running for approx 5 minutes. About 1.5-2x longer in a VPC.
When a Lambda is invoked for the first time in its life cycle, it suffers a "cold start". It takes longer normal to complete the invocation, as the requestor has to wait for the container to start,and the other setup processes AWS runs behind the scenes.
Once the container has suffered that once, it is considered "hot", and will run the handler without suffering the start up time.
Keep in mind that if the same Lambda is invoked concurrently, 2 Lambdas will run. This means that both Lambdas will have a cold start. For example, you have Lambda A that runs for 1 minute on average. The first time Lambda A is invoked it suffers a cold start. 20 seconds into A, you invoke A again from a different source. Since one instance of A is already running, another instance of A will need to be started an ran. This second A will suffer a cold start. 2 minutes after both A's are done, if A is invoked again, it will not suffer a cold start.
Handler

When a Lambda runs for the 1st time ( cold start ) it runs 100% of the code. On subsequent hot starts, Lambda will only run the code inside of handler. In Node, you export a function called handler or whatever you designated the handler to be named in your template.
export const handler = async (event, context) => {
  ...
  return {
    yay: 'YAY
  }
}

// or

export const handler = (event, context, callback) => {
  ...
  callback(null, { yay: 'YAY' })
}
The handler can be defined in 2 ways, an async function that returns something, or one with a callback which will end the Lambda invocation. The 1st option is heavily preferred.
This is an important concept in Lambda, because you cannot write all the code inside of handler and maintain performance from an API. In ETL jobs that are not time sensitive, it's not very important. Serverless API's can have terrible performance if all the code is inside handler because all of the code is invoked every single time.
If you have some setup code that does not need to be ran inside handler ( that is pretty much everything that doesn't rely on something from the event object ), it should be declared outside the handler. It's a big topic.
DynamoDB

Dynamo is really confusing to learn at first. It's very complex, and there are a lot of little things to keep in mind when using it.
Dynamo is a NoSQL keystore database. It can scale to infinity, and can handle an unfathomable amount of requests. It was created to handle the shopping cart on amazon.com during black friday / cyber monday. It is designed to handle extremely high volume reads, with a very specific way to access data. It does not force you to conform to any schema other than the keys. One record can have 1000 properties, the next can have 3.
Dynamo instances can have multiple tables, but it is considered best practice to only have one table per application.
Partitioning

The way Dynamo stores data is pretty unfamiliar to SQL users. Dynamo spreads data across multiple partitions of a cluster SSDs. You data scales horizontally by adding more SSD's to the cluster, rather than increasing the size and compute of a single DB server.
Dynamo uses partition keys to separate data. You want these values to be as unique as possible, as they are the only way to influence how data is spread out. If you have an auto incrementing number, they are very similar according to Dynamo. 1,2,3,4 will all be saved on the same partition, causing read speeds to slow with scale, and the size of that single partition to swell.
There is a hard limit fo 10gb of data per partition, so it's possible to reach the limit if you partition the data incorrectly. Think of the partition as the grouping key. It groups data together. A great example of a partition key is an email address.
Range Key

If the partition is the group, the range key is the unique id within that group. This key is not technically necessary, but it almost always is. The combination of the partition and range key must be unique across the entire table, this is known as the Primary Key.
The strategy here is to have identifying information in reverse pyramid fashion. So least specific to start, going down to the most specific.
If the email address is the partition key, the range key might be something like user-profile. This would give the primary key of me@me.com >--> user-profile. If that user also had some orders on an ecommerce site, the range key might be order_123, with 123 being the order ID. Order is less specific, but order id is very specific. The reason both are included in this cascading manner will be explained in access patterns.
Dynamo Operations

All interaction with DynamoDB is done through the AWS-SDK of your chosen language. For Node, it provides at least 2 ways of interfacing with DynamoDB, the DynamoDB class, and the DocumentClient class.
I don't use the DynamoDB class, because it's a little funky to use the Dynamo native syntax. For example, Dynamo resembles JSON, but a string key would be declared as:
{
  "someKey": {
    "S": "The value of someKey"
  }
}
The DocumentClient abstracts those away, and marshalls the Dynamo structure into a familiar JSON
{
  "someKey": "The value of someKey"
}
In order to get any item from Dynamo, you must know the entire partition key. It has to be a strict equals for the partition. This means that a randomly generated ID that you can't possibly rely on anyone to remember may not be a good partition key since you will need it every time you ask Dynamo for data.
Get Item

To get an item, you must know the partition and the range key entirely. It returns one record matching the primary key.
More can be found here
Query Item

To query, you only need to know the partition. The query will return every record that matches the partition. This is helpful if you want all records associated with a specific group, or partition. Think "everything for the email address me@me.com".
When you add a range key, you can filter down results. You can do a strict text match on the range key, or you can use operations like begins_with to take advantage of that reverse pyramid of specificity.
More can be found here
Access Patterns

I've had the pleasure of talking directly with the speaker at this keynote. If you truly want to understand Dynamo access patterns, you need to watch this video several times.
The most important aspect of Dynamo is knowing your access patterns. You MUST know how you will access the data in the db before writing it there. This is commonly key a key strategy, but it's the idea of what info will I need in order to get x,y,z data out of Dynamo in a single operation.
A common thing is for businesses to change how they want data. Let's say you are getting some product orders by order ID. No problem for SQL, or Dynamo. You go on 6 months creating orders that are designed to be accessed by order ID. Then, suddenly someone decides you need to access them by the order ID and the order date. No problem for SQL, major problem for Dynamo. You cannot suddenly say you want to access data by the order date now. You have to re-index 6 months worth of orders to now be able to add the ability to use either the order date, or order ID to find the order. Not an easy task, and there's always the possibility of losing some data in the process.
When thinking about how to access data, the range key specificity comes into play. Taking the example from the range key section, of order -> order ID, you can access all of a customers orders, without needing the ID. You can access everything pertaining to a customer by using just their email address. If you only need that specific order, you can use the entire range key to get just that order.
This is not an easy task, and should not rest on someone that would benefit from reading this guide. That is to say, if you have learned something new from reading the Dynamo section, you should not be in charge of a key strategy for anything outside personal projects.
Example

Taking the example of the order date and order ID being an access pattern, we can see how that would be an issue.
If you have the range key of order_orderId, you can get the order by an ID. If you have order_orderDate, you can get the date, but how do you get both? If you had order_orderDate_orderId, you could get all the orders for a given date, then filter down by id, which is great.
If you had order_orderId_orderDate, it wouldn't be much help, since the ID already tells you exactly which order you have. That pretty much makes the date irrelevant.
The way this is accomplished is with secondary indexes
Indexes ( Indices? )

Indexes are a way to access the same data using different key strategies. Something to know up front:
This duplicates your data. If you write a 1mb record, you have now written 2, 1mb records totaling 2mb. It is not a reference to the same data, it's pure duplication.
Local Secondary Index

Official Docs
Highlights


Abbreviated as "LSI"
Limit is currently 5 per table
Must be create at the creation time of the table, meaning it cannot be created after the table has been created.
Same partition key, different range key
Subject to the 10gb limit of the shared partition.

Example

Keeping with the above example, we can use 2 different range keys to access the order by the ID, and the date.


Partition
Range
LSI


me@me.com
order_orderId
order_orderDate_orderId


With this key strategy we can:

Get all orders for an email address
Get an order by a specific ID
Get all orders for an email that were placed on a given date
Get the specific order for an email that was placed on a given date ( although I'm not sure this is necessary, as we can already use the range key, just an example of hierarchy in key structures)

Global Secondary Index

We think everything is great until the business says "I want to see all orders placed on a given date". This is really common, and is a reason why a lot people dislike DynamoDB. It's not meant for future flexibility. Please do not use this db engine if you do not have an understanding that your established access patterns will not change. Tell the business side, NO ( politely, because you have already agreed not to change the access patterns ).
Dynamo is tuned for performance, not convenience or pivoting.
Official Docs
Highlights


Abbreviated "GSI"
Current limit is 25 per table
Completely different partition and range key
Can be created at anytime, and will automatically work for data created after the GSI

You can re-write the old data to automatically backfill missing pieces in the GSI


Not subject to the 10gb partition limit
Great for reverse lookups
Most recommend that 1, 2 max is all you need with a well architected table

Okay, so the business wants to display all the orders for a given date. Here is where a GSI used as a reverse lookup is going to be helpful.
Naming Keys

It's important to name keys as generic as possible, as they will be overloaded with many different types of attributes. For example, the partition key of our table is only storing the email, so it may be tempting to name it "email",but in fce it's actually a groupId, that is responsible for grouping records together. It just so happens that our examples work well with email addresses. What if we started storing employees of our fictitious company in our table? We are a broke startup, so we can't afford emails yet, so we give everyone an employee number. If we started storing the employeeNumber inside of email, that would be confusing at best.
So instead it's named groupId because I am grouping all of the employee's info under their employeeNumber.
The same can be said for indexes. I am using this to do a reverse lookup for orders, but this one index can be used to get all of the users, all the employees, and all of the orders.
Example

So I have a record with a schema like
{
  // partition
  "groupId": "me@me.com",
  // range
  "individualId": "order_orderId",
  // LSI 1
  "orderDateLookup": "order_orderDate",
  ...
}
At anytime, I can add reverseLookupGroupId to start indexing my new records with a different partition key.
Lets say on the orders, I add {"reverseLookupGroupId": "order" }. I can now use the query operation to find every order ever placed in the db. I can do the same for users, employees etc.
For the range key of the GSI, it needs to still be unique throughout the table, and it's even more important that cascading specificity is used. For example, you might not want to put emailAddress as the first entry in the range key. Then you'd only have 2 access patterns, all of the orders, and all the orders for one email, which is already satisfied by the original key strategy. It truly depends on your access patterns. If the business needs order date, then use order date.
Your query would be


GSI 1
GSI Range


order
2012-12-12


This would give you every order placed on the 12th.
RDS

IAM

SNS

SQS

API Gateway

Glossary

Soft Limit

A default limit on services imposed by AWS. Can be extended if requested and paid for.
Hard Limit

A default limit on services imposed by AWS. This cannot be extended, even when saying please.
Re Index

The process of scraping 100% of your data from the DB, and writing it back the same or a different db with a different key structure.
Reverse Lookup

Getting all of the records for a specific group, while normally storing them as individual entries. This is the equivalent of select * from x, without any where clauses. I can add where clauses with the range key, but I want to re-group all of the items by their root characteristic.