Skip to content

Instantly share code, notes, and snippets.

@neduma
Last active September 27, 2019 20:55
Show Gist options
  • Save neduma/45db5d2ae2b8fa86013357ce9d6dcfd2 to your computer and use it in GitHub Desktop.
Save neduma/45db5d2ae2b8fa86013357ce9d6dcfd2 to your computer and use it in GitHub Desktop.
Case Study - SRE Practices

Moovweb - New Platform

  • Allow devs to deploy universal react apps ( apps that support server side rendering.)
  • Cloud deploy, CLI Driven
    • Make sure devs have right IAM policies (Authz, AuthN, RBAC) in place in terms of who can deploy, change and update configs.
    • Code review is must / Gated deploy with approvals.

Clarification

  • Control Place / Deployment stack which deploys content to Delivery stack using ROR API wrapping AWS API deployed in Kubernetes and accessed by CLI
  • Delivery Plane / API Gateways and CDN Networks.

Architecture

  • Internet => CloudFront => API Gateway => AWS Lambda
  • Deploy Platform as RoR / Kubernetes, RDS (Mysql, postgresql) in AWS.

General Best Practices

  • Always have plan in terms of Runbook/Cheklists for Service Onboarding, Perf test, Filures like Site is down,
  • Have rubrik for incident management and oncall cadence.

SRE Requirments

  • Maintain Uptime.

    • Make sure we have redundant systems in place in terms of configuration to support global/distributed and redundant delivery.
    • Multiple AZ, VPC, Regions architectures to address HA and DR needs
    • Monitoring and Auditing all running systems in terms of metrics. Observe metrics over the time to predict failures
    • CloudFront Settings
      • Min/Max quota settings to address peek and viral requests in terms of bandwidth, edge/pop configs and RPS
      • Tweak settings in terms of Cached/Bandwith/TTL to address HA
    • API Gateway Settings
      • Add Caching to API to increase performance and reduce load on AWS Lampda or any backend systems
    • AWS Lampda Settings
  • PCI Compliant (Others: HIPAA, GDPR)

    • Backups and Recovery
    • Security At Rest.
      • RDS Data encryption is in place?
    • Secuirty At Transit
      • PKI, CA TLS Certificates are in place with right configs and rotations (integrated with cert manager). Make sure is TLS protected end-to-end. No gaps.
      • Make sure CloudFront, Lamda, API Gatway config settings are up to date in terms of PCI compliance and recommendations (App, Network, User security)
      • Make sure you have right IAM Policies in place for security and access controls.

Possible Problems

  • Deployments Keep Failing
    • Why, Check with out logs in Kubernets, Docker, RoR logs and RDS logs.
    • Monitor/Measure deployment with smoke tests. Elaborate integration, white box and black box testing
    • Monitor Cloudwatch logs
    • Keep Deployment Failing Runbook Ready and collect incidents
  • Site seems slow
    • Test the speed between hops. Client to CDN, CDN to API Gateway, Gateway to Lampda
    • Caching logic and routing logics are correct in CloudFront
  • Sites is down
    • Verify at client side and every geo/region. is it Partial or total?
    • Communication is very importent and update status page
    • Runbook and Incident Management
      • Where it is exactly failing and fix it Bad config in CDN, Lampda, Gateway
      • Why redundant servcies is not kicking off ( Chao Engineering)
      • Bring back service even if it is slow and progressivly full speed.
  • Conversion rate dropped 50% last week?
    • Tied to slow. What user experience impacted lately.
    • Metrics and data
      • Monitoring data: Traffic, Saturation, Latency, Error data
      • UI flow changed?
  • Customers complaints that they can't check out
    • Always verify. Is erros are due to bad network settings at client side. Deep dive from there.
    • Any errors react rendering on the server side
    • Session issues like timeout
    • Checkout means Intergration with payment gateway providers. Are Auth creds are ok? Did we make changes in auth config? How we manage change in terms of testing and commuincation.
  • Are we under attack?
    • API Gateway Throttling to protect DDOS attack
    • AWS Sheild protections in place to detect and automated mitigations.
    • Watch Cloudwatch logs
  • Search ranking dropped
    • SEO, Access policies, Is crawlers are access keywords and pages?
    • Monitor and what kind of SEO Tracking alerts we have?

Sample Troubleshooting - Runbook

  • Check VPC, Subnets, Security groups, Network ACL are configured properly.
  • Check IAM Policies are correct and upto date
  • Checke headers, cache, compresssion settings.
  • CloudFront Bandwith Quotas, RPS upper limits.
  • Check CDN, Gateway, Lambda Access Logs - cache hit/miss/error
  • Check Cloudwatch logs

Sample Incident Management Report

  • Elaborate on what constitues an Incident managment report
  • Post mortem and after effects Use Goolge SRE Book template

Referrences

  • AWS Well-Architeced framwork, Cloud Solutions architetures from GCP, Azure
  • Google SRE Books
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment