Skip to content

Instantly share code, notes, and snippets.

@alcheng10
Last active November 29, 2022 06:22
Show Gist options
  • Save alcheng10/04b61d808621d3c6551da2dd2bf6b617 to your computer and use it in GitHub Desktop.
Save alcheng10/04b61d808621d3c6551da2dd2bf6b617 to your computer and use it in GitHub Desktop.
AWS Serverless (Lambda, S3, SQS, API Gateway) Best Practices Checklist

AWS Best Practices Checklist - Serverless

This checklist uses markdown formatting and therefore can be easily incorporated into a git repository's README.md.

General

  • Tagging of stack and all resources - e.g. cost tags, microservices, projects
  • AWS resources are created via Infrastructure-as-Code (Terraform, Serverless.com, CDK, Pulumi, SAM, etc.)
  • Environment variables are passed during deployment (including Stage of stack, such as DEV, PROD, etc.)
  • RBAC and Least-Privilege Principle applied - IAM roles limited to only what is needed

Access

  • For request-response or remote procedure call (RPC) systems, don't create reply queues per message. Instead, create reply queues on startup, per producer, and use a correlation ID message attribute to map replies to requests.
  • For request-response or remote procedure call (RPC) systems, don't let your producers share reply queues. This can cause a producer to receive response messages intended for another producer.
  • Enable long polling and use it in preference to short polling whenever possible (enabled for Lambda by default)

Security

  • Server-Side Encryption (KMS) is enabled

Message Throughput and Error Handling

  • Batch message actions
  • Check batch limits (10 messages) and message size limit (256kb)
  • Check throughput limits (Standard - 3,000 messages/second and FIFO - 300 messages/second) and in-flight messages limits (Standards - 120,000 and FIFO - 20,000)
  • Handle SQS Over-scaling by backpressure control and lambda reserved concurrency
  • Handle SQS Over-pulling by ensuring Visibility Timeout is 6x > Lambda Timeout and Max Receive Count >= 5
  • Set the message visibility timeout appropriately
  • Handle retries, duplicate messages and retry and backoff with SDK exponential backoff if needed
  • Make sure that the Maximum Retention Period is set correctly.
  • Configure a dead-letter queue - sets its retention timeout to a greater value than the original queue (time in original queue counts towards retention period)

Logging

  • Configure CloudWatch metrics and alarms to be notified of errors and backpressure in the queue.

Code

  • Separate the Lambda handler (entry point) from your core logic
  • Take advantage of Execution Context reuse to improve the performance of your function (e.g. to speed up warm starts, use global variables/connection and lazy initalisation)
  • Use AWS Lambda Environment Variables to pass operational parameters to your function
  • Minimise and control the dependencies to run-time necessities - function's deployment package, including dependencies, needs to be <= 250mb
  • Avoid using recursive code
  • Shared code/libraries are sourced from Lambda Layers where appropriate
  • Delete test and unnecessary functions which are no longer required
  • Ensure Lambda handler returns API Gateway-compliant response (if integrated with API Gateway)

Resource Allocation

  • Load testing and optimising CPU and memory allocation
  • Consider Lambda Timeout Period in light of event trigger type - synchronous (e.g. API Gateway), async (e.g. SNS, S3), stream and poll-based (e.g. SQS, DynamoDB).
  • Ensure Visibility Timeout period is >= Lambda Timeout Period (if function invoked by SQS)
  • Establish dead-letter queues for asynchronously invoked Lambdas to allow for reprocessing if required (if not using SQS as event trigger)
  • Manage reserved concurrency (if set to zero, then completely throttled and no functions will run) - maximum 900 per region (unless increased via support ticket)

Logging

  • Create CloudWatch alarms for errors, concurrent executions or excessive invocation duration
  • Enable AWS X-Ray and for Python Lambdas, include patching of boto3 for X-Ray Logging
  • Ensure Lambda Execution IAM role has necessary CloudWatch and X-Ray policies
  • Ensure Lambda Execution IAM role has necessary event read policies (e.g. S3 Read, SQS Read)

Networking

  • Determine if Lambda needs to be deployed in a VPC. General rule is Lambdas should not be in VPCs, unless the resources exist in a VPC (e.g. EC2, RDS)
  • If Lambda deployed into, consider whether you need to access any public internet resources. If so, ensure NAT is configured for VPC. Further ensure that Lambda Execution Role allows creation of ENIs.
  • Compliant S3 Bucket names chosen - name should not contain periods '.' to be DNS friendly
  • Consider enabling CloudTrail and access logging to detect S3 configuration changes
  • Consider enabling S3 buckets versioning if required in order to recover overwritten or deleted data.
  • Enable S3 Bucket Default Encryption to automatically encrypt all object stored in S3
  • Enable blocking public access by default, unless otherwise needed
  • Server-Side Encryption (SSE). Ensure that S3 buckets are protected by encryption at rest.
  • Suitable storage model has been chosen and create a Lifecycle policy

API Gateway

Access

  • Consider use-case for API Gateway - should only be for user interaction front-end or microservice API. Backend should use Queues, Triggers, AWS SDK etc and IAM roles to control access.
  • Use versioning between DEV and PROD versions of API
  • Consider how long caching will retain for
  • If non-public facing API, use a resource policy to restrict invocation to only whitelisted IPs
  • Mandate authentication via API Key
  • API is secured using HTTPS
  • Set Up CORS headers in Lambda handler response if using AWS resources
  • Consider quotas/throttling of API requests (default is 10,000 requests/second with a burst of 5000 concurrent requests)
  • Consider Create SSL certificates with AWS Certificate Manager for public-facing APIs
  • Consider enabling AWS Config and CloudTrail to monitor configurations on API (to prevent misconfiguration security leaks)

Logging

  • Establish CloudWatch alarms on metrics - integration latency, Latency, cache hit and miss, throttle and 400 HTTP responses
  • Logging with CloudWatch or Kinesis Data Firehose
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment