alcheng10/AWS_Best_Practices.md

## AWS_Best_Practices.md

      
    Raw
  

              AWS_Best_Practices.md
            
          
    AWS Best Practices Checklist - Serverless

This checklist uses markdown formatting and therefore can be easily incorporated into a git repository's README.md.
General


 Tagging of stack and all resources - e.g. cost tags, microservices, projects
 AWS resources are created via Infrastructure-as-Code (Terraform, Serverless.com, CDK, Pulumi, SAM, etc.)
 Environment variables are passed during deployment (including Stage of stack, such as DEV, PROD, etc.)
 RBAC and Least-Privilege Principle applied - IAM roles limited to only what is needed

SQS

Access


 For request-response or remote procedure call (RPC) systems, don't create reply queues per message. Instead, create reply queues on startup, per producer, and use a correlation ID message attribute to map replies to requests.
 For request-response or remote procedure call (RPC) systems, don't let your producers share reply queues. This can cause a producer to receive response messages intended for another producer.
 Enable long polling and use it in preference to short polling whenever possible (enabled for Lambda by default)

Security


 Server-Side Encryption (KMS) is enabled

Message Throughput and Error Handling


 Batch message actions
 Check batch limits (10 messages) and message size limit (256kb)
 Check throughput limits (Standard - 3,000 messages/second and FIFO - 300 messages/second) and in-flight messages limits (Standards - 120,000 and FIFO - 20,000)
 Handle SQS Over-scaling by backpressure control and lambda reserved concurrency
 Handle SQS Over-pulling by ensuring Visibility Timeout is 6x > Lambda Timeout and Max Receive Count >= 5
 Set the message visibility timeout appropriately
 Handle retries, duplicate messages and retry and backoff with SDK exponential backoff if needed
 Make sure that the Maximum Retention Period is set correctly.
 Configure a dead-letter queue - sets its retention timeout to a greater value than the original queue (time in original queue counts towards retention period)

Logging


 Configure CloudWatch metrics and alarms to be notified of errors and backpressure in the queue.

Lambda

Code


 Separate the Lambda handler (entry point) from your core logic
 Take advantage of Execution Context reuse to improve the performance of your function (e.g. to speed up warm starts, use global variables/connection and lazy initalisation)
 Use AWS Lambda Environment Variables to pass operational parameters to your function
 Minimise and control the dependencies to run-time necessities - function's deployment package, including dependencies, needs to be <= 250mb
 Avoid using recursive code
 Shared code/libraries are sourced from Lambda Layers where appropriate
 Delete test and unnecessary functions which are no longer required
 Ensure Lambda handler returns API Gateway-compliant response (if integrated with API Gateway)

Resource Allocation


 Load testing and optimising CPU and memory allocation
 Consider Lambda Timeout Period in light of event trigger type - synchronous (e.g. API Gateway), async (e.g. SNS, S3), stream and poll-based (e.g. SQS, DynamoDB).
 Ensure Visibility Timeout period is >= Lambda Timeout Period (if function invoked by SQS)
 Establish dead-letter queues for asynchronously invoked Lambdas to allow for reprocessing if required (if not using SQS as event trigger)
 Manage reserved concurrency (if set to zero, then completely throttled and no functions will run) - maximum 900 per region (unless increased via support ticket)

Logging


 Create CloudWatch alarms for errors, concurrent executions or excessive invocation duration
 Enable AWS X-Ray and for Python Lambdas, include patching of boto3 for X-Ray Logging
 Ensure Lambda Execution IAM role has necessary CloudWatch and X-Ray policies
 Ensure Lambda Execution IAM role has necessary event read policies (e.g. S3 Read, SQS Read)

Networking


 Determine if Lambda needs to be deployed in a VPC. General rule is Lambdas should not be in VPCs, unless the resources exist in a VPC (e.g. EC2, RDS)
 If Lambda deployed into, consider whether you need to access any public internet resources. If so, ensure NAT is configured for VPC. Further ensure that Lambda Execution Role allows creation of ENIs.

S3


 Compliant S3 Bucket names chosen - name should not contain periods '.' to be DNS friendly
 Consider enabling CloudTrail and access logging to detect S3 configuration changes
 Consider enabling S3 buckets versioning if required in order to recover overwritten or deleted data.
 Enable S3 Bucket Default Encryption to automatically encrypt all object stored in S3
 Enable blocking public access by default, unless otherwise needed
 Server-Side Encryption (SSE). Ensure that S3 buckets are protected by encryption at rest.
 Suitable storage model has been chosen and create a Lifecycle policy

API Gateway

Access


 Consider use-case for API Gateway - should only be for user interaction front-end or microservice API. Backend should use Queues, Triggers, AWS SDK etc and IAM roles to control access.
 Use versioning between DEV and PROD versions of API
 Consider how long caching will retain for

Security


 If non-public facing API, use a resource policy to restrict invocation to only whitelisted IPs
 Mandate authentication via API Key
 API is secured using HTTPS
 Set Up CORS headers in Lambda handler response if using AWS resources
 Consider quotas/throttling of API requests (default is 10,000 requests/second with a burst of 5000 concurrent requests)
 Consider Create SSL certificates with AWS Certificate Manager for public-facing APIs
 Consider enabling AWS Config and CloudTrail to monitor configurations on API (to prevent misconfiguration security leaks)

Logging


 Establish CloudWatch alarms on metrics - integration latency, Latency, cache hit and miss, throttle and 400 HTTP responses
 Logging with CloudWatch or Kinesis Data Firehose