External Resources
- linuxacademy.com (Training)
- Acloud.guru (training)
- Read the top-level Documentation and FAQs for all the major AWS resources (EC2, S3, RDS, Auto Scaling, etc). The answers to the "nit-picky" questions can be found here. It's also helpful to go deeper on VPCs and networking-related concepts.
- Everyone says read the white papers. The ones I read were:
- Security Best Practices https://d0.awsstatic.com/whitepapers/Security/AWS_Security_Best_Practices.pdf
- Cloud Best Practices https://d0.awsstatic.com/whitepapers/AWS_Cloud_Best_Practices.pdf
Core
- Elasticity
- Ability to scale up/down on demand
- Reduce cost
- Scaling
- Proactive - fixed interval
- Proactive Event-based
- Auto-Scaling based on demand
- Out: more instances, Up: Bigger instances
- Fault tolerant
- Operationally efficient
- More resources, less resources cost
- Scalability
- grow over time
- economies of scale
- vertically: more powerful resources
- horizontally: increased number of resources
- Reserved
- reduced price
- guaranteed capacity (useful for disaster recovery)
- within AZ
- Can submit modification request to
- change AZ
- scope up/down AZ to Region
- instance size w/in same type
- "Tightly Coupled" vs "Loosely Coupled"
- Tightly
- one thing fails, all fail
- Loosely
- individual component failure won't break everything
- Can scale out individual components
- Tightly
- Shared Security Model
- AWS: host OS/Virtualization, physical security
- DDoS
- Could use CloudFront to absorb requests
- Do INGRESS filtering on incoming traffic
- DDoS
- You: guest OS/VPC on up, security group, software updates, data in transit/rest
- Also use software-level security: IPTABLES, Firewalls, etc
- DDoS
- Block CIDR at Network ACL (Subnet level)
- Must have permission to do Port Scanning w/in cloud
- Encryption (AES-256):
- EBS
- done on instance itself, not good for small instances
- snapshots automatically encrypted
- S3 - at rest
- Glacier
- Redshift
- SQL RDS
- MySQL//Aurora, Oracle, Postgres, MSSQL
- Snapshots, backups, read replicas all encrypted
- SSL connection encryption
- EBS
- AWS: host OS/Virtualization, physical security
- Disaster Recovery
- Recovery Time Objective (RTO)
- restore to level of service, measured in time
- Recovery Point Objective (RPO)
- acceptable amount of dataloss, measured in time
- Methodology
- Pilot light
- Minimal version of production environment in AWS
- Scale out and DNS switch if disaster
- Make sure it's up to date
- Requires extra time to spin up
- Warm standby
- Larger footprint than pilot light
- Running business critical applications
- Multi-site
- clone production environment
- active-active
- also use as load balance
- less downtime, more costly
- Pilot light
- DR Services
- ELB and Auto Scaling
- Route53 failover DNS, or latency based
- Storage Gateway
- lots of AWS tools to get data, AMIs out of on-premises
- Recovery Time Objective (RTO)
EC2
- AMI
- Unique to a Region (need to manually copy to another Region)
- Permissions: can make public, or available to another AWS account
- PV Paravirtual
- cannot take advantage of hardware extensions
- historically was faster, but basically no diff now
- HVM Hardware Virtual - preferred method, on current instance types
- Copying an image, can encrypt from unencrypted, but default is keep same
- Security Group
- EC2-Classic: Can't add/remove SGs to running instance
- VPC: Live in that VPC
- Stateful: response traffic is allowed
- 500 Security Groups / VPC, 50 rules/SG, 5 SGs/network interface
- Can't delete the default security group
- Instance needs at least one security group
- Dedicated instance - hardware w/ just our stuff
- Elastic Load Balancer
- Can distribute traffic across AZs
- Can live in a public subnet and serve traffic to private subnets (eg Auto Scaling Groups)
- Can apply an SSL directly to it
- Needs to split between two subnets?
- Configure
- What to listen on (80)
- Apply a Security Group to the ELB
- Health Check on target instances (eg port 80)
- target instances
- Instances Security Groups still need to allow traffic from ELB
- Instances don't need a Public IP, just a public subnet
- Requests look like they're from ELB, might want to log traffic at ELB level
- ELB for Apex domain required Route53 Alias record
- Internal ONLY accepts traffic from w/in VPC (use w/ private subnet for multi-tier apps)
- Connection draining: wait for connections to complete
- Cross Zone load balancing required across AZs
- AWS will increase ELB as needed
- Contact AWS for "Prewarming" to handle rapid, drastic spikes in traffic
- Lots of built in Metrics on the ELB (connections, responses, etc)
- Can create Alerts off of it
- Classic: register instances, Application: target group
- Troubleshooting
- Make sure target is available for health check (correct "index.html" target)
- Make sure ELB and Instances have port 80 open
- Enable Access Logs to Amazon S3 (otherwise ELB logs on Instance)
- Add specific subnets to ELB
- Auto Scaling Group (ASG)
- Launch additional instances as needed
- Load
- Proactive
- Event
- Consists of Launch Configuration + Scaling Plan
- Spans AZs, not Regions
- Tell it to receive traffic from an ELB
- static size = "self healing", or
- scaling policy to adjust capacity w/ CloudWatch Alarm for Increase and Decrease group size
- Can use with internal multi-tier apps
- Deleting ASG terminates running instances
- Can't pass EC2 instance cap
- Launch Configuration (LC)
- Info on the instance that an Auto Scaling Group uses (AMI, instance type, etc)
- LC can be used w/ multiple ASGs
- ASG requires an LC
- Can't modify LC after launch
- Create new LC, apply to ASG
- Only affects new instances! Existing are kept.
- Create new LC, apply to ASG
- Scaling Plan
- Tells Auto Scaling when and how to scale
- Manual
- Scheduled
- Demand (In and Out)
- Alarm - What to look for (via CloudWatch)
- Could be SQS Queue Size
- Policy - how to respond to Alarm
- Can't go outside min/max group size
- Adjustment Types
- ChangeInCapacity (+/-)
- ExactCapacity
- PercentChangeInCapacity
- Scaling Policy Types
- Simple
- has cooldown
- Step
- based on size of alarm breach
- continuously evaluated
- Simple
- Alarm - What to look for (via CloudWatch)
- Tells Auto Scaling when and how to scale
- Troubleshooting
- "thrashing" up and down
- Change thresholds
- Decrease checking frequency
- Increase cooldown (not for step or scheduled)
- doesn't happen
- max too low
- "thrashing" up and down
- Launch additional instances as needed
- EBS Volumes
- cannot cross AZs (auto replicated WITHIN AZ)
- create snapshot in other AZ to access there
- IOP is max 256KB
- SSD General Purpose
- 1GiB to 16TiB
- Burstable IOP credits
- baseline 3 IOPS/GiB
- Provisioned IOPS
- 4GiB to 16TiB
- up to 20,000 IOPS
- Critical apps requiring sustained IOPS, large DB workloads
- Magnetic
- 1Gib to 1024GiB
- AWS EBS encryption uses AWS Key Management Service (don't use on smaller than m3)
- Customer Master Key (CMK)
- Snapshots/Volumes by default inherit status of their source
- Can't change CMK of volume/snapshot but
- encrypt > encrypt w/ new CMK w/ copy snapshot
- unencrypted > encrypt: w/ copy snapshot
- encrypt > unencrypt: mount both, copy over
- Can share an encrypted snapshot but using non-default CMK and sharing both w/ other account
- Snapshots
- Stored on S3 under the hood (can't see them)
- incremental in nature (behind the scenes first snapshot might still exist)
- When creating snapshots of EBS volumes that are configured in a RAID array, it is critical that there is no data I/O to or from the volumes when the snapshots are created
- freeze filesystem, unmount, or stop instance
- cannot cross AZs (auto replicated WITHIN AZ)
- Elastic File System (EFS)
- NFS for EC2
- EC2 Classic
- internal IPs are unstable after reboot
- Placement Groups
- instances w/in same AZ, low latency (close together), min 10 Gbps network
- stop / start as a group
- failure is "insufficient capacity error"
- CAN span peered VPCs, apparently
- Spot Instances
- If AWS terminates, no charge for last hour (you kill, you pay)
- The price per instance-hour for a Spot instance is set at the beginning of each instance-hour for the entire hour. Any changes to the Spot price will not be reflected until the next instance-hour begins.
S3
- URL
- to object bucketname.s3.amazonaws.com/path
- to static site linuxacademy-big-bucket.s3-website-ap-southeast-2.amazonaws.com
- unlimited storage
- no limit to number of objects in bucket
- 11 nines durability, 99.99% availability
- Created in a Region, stay there
- synchronized across all AZ w/in Region automatically
- new PUT read-after-write consistent
- overwrite PUT, DELETE eventual consistent
- bucketnames unique across ALL regions
- 100 buckets / account, can't change owner
- Account (not user) owns bucket
- min object size: 0 bytes
- max: 5GB? multipart: 5TB (recc'd > 100MB)
- if rapid increase > 100 PUT/LIST/DELETE or > 300 GET reqs/sec, contact AWS
- CORS configuration
- JavaScript thing
- Resource Based Policies
- ACL to share bucket across account
- Bucket Policy
- Restrict off IP address, HTTP referrer,
- Will overwrite "Public Permissions"
- Can be edited by CloudFront Distributions
- Use IAM to give a User access to a specific bucket
- Resource format arn:aws:s3:::bucketname/folder
- User-based Policies
- IAM
- RRS - Reduced Redundancy Storage
- 99.99 durability, availability
- cheaper, for reproducible objects
- Versioning
- Off by default
- Once on, can only be suspended, not disabled (old versions continue to exist)
- Lifecycle policies
- Archive (Glacier) and Delete options
- Can be applied to Versions
Glacier
VPC
- Spans all AZs in the Region
- Internet Gateway (IGs) attach to VPC
- /28 is smallest possible subnet, /16 largest
- can't resize (requires redo)
- Subnet
- Only belongs to one AZ
- /28 is smallest possible subnet, up to VPC range
- First 4, last 1 reserved by AWS
- Route Tables
- VPC comes w/ main route table
- Subnet must have route table (only one at a time, implicitly main).
- Route table can have multiple Subnets
- Subnet automatically associated with the main route table for the VPC (modifiable)
- private subnet - no internet gateway
- public requires IG + Route Table
- Use when serving traffic to Internet
- default subnets in default vpc are public (3 of them)
- Route Table
- Says all traffic to IG
- Attach RT to subnet
- Can set all instances to receive public IP
- Even in Public, Instances needs Elastic/Public IP to comm w/ internet
- Use a NAT for a Private Subnet to get updates
- By default, instances w/in VPC can all communicate w/ e/o regardless of pub/private via "local route"
- public requires IG + Route Table
- Can assign public IPs by default (routed to private IP (NAT))
- NAT
- NAT Gateway (AWS Provided)
- Add to Public subnet,
- Edit Route to Private Subnet (default if not explicit) to all traffic to nat
- Instance
- allow an instance in your VPC to initiate outbound connections to the Internet but prevent unsolicited inbound connections from the Internet
- AWS provides instance types
- Needs Public Subnet, Public IP
- SG setting: Allow 80/443 from subnet CIDR
- Disable source/destination check on EC2 instance
- Add Route Table to Private Subnet pointing to NAT instance id
- NAT Gateway (AWS Provided)
- In VPC, reboot keeps internal + elastic IPs (not so for non-ElasticIP public ip classic)
- Network ACLs (firewall at subnet level)
- Live w/in a VPC. Can associate w/ multiple Subnets (Subnet has at most 1 Network ACL)
- Block all traffic from an IP/range at Network ACL level
- "stateless" - return traffic must be allowed through outbound rule
- Security Groups - stateful, always allow return traffic
- deny at low number trumps allow at high number (reverse NOT true)
- increment rules by 10
- Kind of like a Security Group, but for Subnets
- VPC Peering Connections
- Can't do across regions (only between AZs in Region)
- Can do across accounts w/in same Region
- Can't have overlapping CIDR
- Can do 1:Many where children can't see each other (not transitive)
- Can configure Routes to act at VPC or Subnet level, or instance level
- Edit Route Tables of BOTH VPC/Subnets to point to the PC
- Only works w/in AWS (can't peer Internal network, use a VPG or DC instead)
- Can't "cross over" into S3 endpoints
- Can't do across regions (only between AZs in Region)
- Virtual Private Gateway
- VPN connection from Customer Gateway to VPG in VPC
- Add on-site route tables to AWS Subnet, apply to VPG
- Could also run OpenVPN (not site-to-site)
- Works w/ mobile devices that have OpenVPN client
- Use w/ ElasticIP + backup OpenVPN to support High Availability
- Bastion Host
- Log in from the web, it has access to private subnet resources
- Limits
- 5 / VPCs region
- 1 IG / VPC limit
- ergo 5 IGs/region
- 50 customer gateways / region
- 50 VPN connections /region
- 200 subnets/VPC
- 200 Network ACLs/VPC, 20 rules/ACL
- 200 route tables/VPC, 50 entries/table
- 5 elastic IP/region
- 500 Security Groups / VPC, 50 rules/SG, 5 SGs/network interface
- 5 / VPCs region
RDS
- Automatic point-in-time backups, updates
- Multi-AZ Deployment
- primary DB instance is synchronously replicated across Availability Zones to the standby replica
- InnoDB only for MySQL
- Read replicas for heavy read only traffic
- avoids need for user-initiated point-in-time restore
- DB Subnet Groups
- Needs multiple Subnets across (at least two) AZs
- generally private subnets
- Use DNS to connect to DB instance; IP can change on failover
- Publicly Accessible needs to be True (plus SG, etc) to be accessible
- Can move VPCs for non-Aurora dbs
- Supported DBs
- MySQL // Aurora
- MariaDB
- Postgres
- Oracle
- MSSQL
- 5 GB to 6 TB of storage
- Can't do MySQL Clusters, would need to run on EC2
- RDS does NOT support a cluster of instances w/ load balancing traffic
- Can use ElasticCache clusters for caching db session info
- Shares same Security Groups as EC2
- Can encrypt data at rest
- Can use an SSL certificate for connections
- CNAME can be used w/ Route53 to give it a different DNS name
- Automated Backup
- automated point in time recovery (serious)
- default 1 day (free), can be set up to 35
- Snapshots
- user initiated
- Encryption
DynamoDB
- Fully managed NoSQL (MongoDB): HA, scaling
- Sync'd across AZ within Region
- Dev specifies table throughput
- Document and Key/Value
- Can use SSL
- Use case: User session data
AWS Database Migration Service
- Done while running
- homogenous, heterogenous migrations
- Can stream to Redshift
Redshift
- petabyte-scale data warehouse for BI
- hRedshift columns = 1024kb
- Supports SQL tools w/ ODBC/JDBC connections (Postgres-based)
- Columnar data store
- Monitors and backups data, can enable encryption
ElasticCache - in-memory data store
- Redis
- Memcached
Amazon Storage Gateway
- Local Storage that backs up to S3
- Gateway-Cached Volumes:
- iSCSI mounted on-premise. Writes to S3, caches locally. Store all data in S3, cache most-frequently accessed locally.
- less-limited, only cache data locally, everything else in S3 (cheaper?)
- Gateway-Stored Volumes:
- local storage, periodic incremental snapshots
- limited to the amount of space you allocate to the VM (eg potentially more costly)
- Recommend use w/ Direct Connect
Import/Export
- Mail hard drives
- Example: Baseline data into S3 (use incremental updates over internet)
- Snowball
- Secure appliance (encrypted)
Direct Connect - dedicated private connection from ISP to AWS
- Not over internet
- Private Virtual Interface
- Only internal IP addresses inside of EC2
- Public Virtual Interface
- connect to public AWS endpoints
- Cross Connect
- Alternative to VPN
- 1 to 10 GB
IAM
- Global users, all AWS regions
- New users start with no permissions
- Federated - can integrate w/ existing LDAP or Kerberos
- Simple Token Service
- Temporary permissions for users/role
- SAML - integrate w/ active directory
- Groups -
- Collection of IAM users
- Deny overrides allow
- Roles - other AWS resources (users, EC2 instances can assume)
- Identify Provider Access - gran on-premises networks Role access
- Don't have API credentials (password or access keys)
- EC2 Instance can only assume ONE Role when it is first created (cannot change or add)
- Roles are always preferred to API keys
- Default PowerUser role has access to everything except IAM
- Delegation requires
- Trust Policy
- Permissions Policy
- Temporary Credentials
- Policies
- Last 5 versions are tracked
- ARN - Amazon Resource Name (includes amazon account id)
- Security Token Service - assume a temporary role, do something (write to database), expires
- Cloudtrail - logs API calls
- Resource-level permissions
- EC2
- EBS: attach, delete, detach
- Could require MFA for actions
- Cannot apply permissions to Root
AWS WAF
- Web application firewall, blocks common attacks (SQL injection, cross-site scripting)
Route53
- Apex of domain is "bare", w/o subdomain "example.com"
- Route53 can Alias to
- ELB
- CloudFront
- Elastic Beanstalk
- S3 Bucket configured Static
- other R53 record in zone
- Route53 can Alias to
- Routing Policy
- Simple
- Latency based routing / multiregion failover
- "Active-Active"
- Routes to whichever has lowest latency
- Requires duplicate architecture (use CloudFormation)
- Weighted
- "Active-Active"
- Probabilistic off weighting
- Useful for A/B tests
- Geo-based Routing
- Compliance w/ laws
- Failover
- Active-Passive
- Example was ELB-Instance failing to CloudFormation-S3
SNS
- Notifications when events occur in AWS
- Topic: what a message is sent to
- Subscription/Subscriber: who/what gets the message
- SMS, HTTPS,JSON,SQS
- Exists at Region level (e.g., don't pick an AZ)
SQS - Simple Queue Service
- Exists at Region level (e.g., don't pick an AZ)
- Distributed and Decoupled applications (Fault Tolerant)
- Messages up to 256KB
- Default message retention in queue is 4 days
- Can set to 60s to 14 days
- Standard Queue
- at-least-once message delivery
- FIFO (First In First Out)
- Maximum inflight messages
- 120,000 standard
- 20,000 FIFO
- Default Visibility Timeout is 30s
- Guaranteed delivery
- Order best effort
- Can create an Auto Scaling group for component based off queue size
- Long Polling
- 1-20s waits for all messages in queue
- Reduces cost when you get lots of empty responses
- Queries all servers
- Short Polling (default)
- increases API requests (increases cost)
- returns some (not all) messages in queue (queries subset of servers)
- dead-letter queue
- messages that cannot be processed successfully (poison-pill management)
- Must be in the same Region as the feeder queue
SWF - Simple Work Flow Service
- Distributed and Decoupled applications (Fault Tolerant)
- coordinates asynchronously across multiple devices
- guaranteed order, no duplicates
- Execution can last up to 1 year
- Activity Task
- Decision Task
- API is task-oriented (SQS is message-oriented)
Amazon API Gateway
- Service to build RESFUL API to expose lambda, http, other
EMR - Elastic MapReduce
- Hadoop Master/Slave
- Master, Core, Task Nodes
- Can launch apps on it (Hive, Pig compatible
- Launches a pre-built Hadoop cluster
- You can login (unlike RDS)
- Input data from S3, DynamoDB, RedShift
- S3 mounted by default
- Chunk out data into 128MB sizes (default, can be changed)
- split files loaded into memory
- Preconfigured Hadoop AMI w/ Mappers/Reducers per instance size
- Can use CloudWatch to configure number of workers
- customers may encrypt the input data before they upload it to Amazon S3 (using any common data compression tool); they then need to add a decryption step to the beginning of their cluster when Amazon EMR fetches the data from Amazon S3.
Kinesis
- Collect data from multiple Producers, maintains order
- "Stream"
- Preserves for 24hrs default, 7 days max
- Data blob can be 1 MB
- Run SQL queries on streaming data
- Emit from Kensis Streams to S3, Redshift, EMR, Lambda (Consumers)
- Scale across multiple shards
- Aggregation refers to the storage of multiple records in a Streams record
Elastic Beanstalk
- Uses CloudFormation templates
- Deploy less-complex applications, single tier, core services
- EC2
- Auto Scaling
- ELB
- RDS
- SQS
- CloudFront
- Supported Platforms:
- Docker
- Java
- Windows .NET
- Node.js
- PHP
- Python
- Ruby
- Don't use if:
- Need software updates on boot?
- Integrates with version control
- Web server Environment vs Worker Environment
- Deploys an ELB, integrates w/ Route53
CloudFormation - create and provision resources w/ templates
- JSON Templates "infrastructure as code"
- Stack - template that deploys infrastructure
- Can create from existing application architecture
- 25 at a time
- Resources
- AWS things to be launched
- Can configure Parameters that prompt for user input
- "Automatic rollback on error" enabled by default
CloudWatch
- Can be used to shut down inactive instances
- integrates w/ CloudTrail for AWS environment change monitoring
- Only accessible via SSL endpoint
- EC Instances "built in" hypervisor metrics
- CPU
- Network I/O
- Disk Read/Write
- EC Instance custom metrics (non-RDS)
- disk usage / free
- swap
- memory use /free
- Status Check
- "self healing applications"
- Basic: 5 min, Detailed: 1 min
CloudTrail
- Log any action taken against AWS API, user details of who did it
- account, IP, time, parameters, response
CloudFront - CDN, uses edge locations to serve data
- Edge Location - AWS datacenter w/o services
- origin (S3 bucket, ELB CNAME) to edge location
- distribution can have multiple origins (by device type)
- Web or RTMP (Flash media)
- Restrict Bucket Access: Only accessible via CloudFront, ignore Bucket policy
- CloudFront adds ACL to Bucket Policy (OAI?)
- S3 Bucket permissions still apply otherwise
- Private content
- can create one time use "signed URLs" (or cookies) for private content
- urls for RTMP, individual files
- cookies for multiple files, or to not change URLs
- specify ending datetime (optional: starttime, IP address range)
- Limit S3 access to just CloudFront
- Create Origin Access identity (OAI)
- Give it S3 read permissions, lock out everyone else
- can create one time use "signed URLs" (or cookies) for private content
- Price Class: varies by geographic spread
- Route53
- can create alternate cnames for long CloudFront names (eg "cdn.mywebsite.com")
- Can create two Alias records, one as primary one as failover (eg, ELB to Instance, CloudFront to S3)
- "Active-Passive"
- caches files until
- cache expires
- or overwrite w/ new name
- or create an "invalidation" (costs, might be cheaper to recreate a new CloudFront distribution & update CNAME)
AWS Config
- Point in time snapshot of AWS
- JSON
- Stores config changes
Trusted Advisor
- Notify you of security holes, High Availability issues, cost, performance
- According to AWS best practices
CloudHSM
- Hardware Security Module
- You can implement in multiple AZs and enable replication
- dedicated hardware to store keys, encryption, etc
- isolated physically, tamper resistant
- Compliance reasons, encrypt data/files
- AWS Engineers don't have access
Key Management Service
AWS CloudSearch
- Upload data, they index and provide a search interface
AWS ElasticSearch Service
ECS EC2 Container Service
- Start/Stop Docker containers
AWS Lambda
- Run code in response to "events"
- AWS handles all scaling