neduma/Case-study-sre-best-practices.md

## Case-study-sre-best-practices.md

      
    Raw
  

              Case-study-sre-best-practices.md
            
          
    Moovweb - New Platform


Allow devs to deploy universal react apps ( apps that support server side rendering.)
Cloud deploy, CLI Driven

Make sure devs have right IAM policies (Authz, AuthN, RBAC) in place in terms of who can deploy, change and update configs.
Code review is must / Gated deploy with approvals.


Clarification


Control Place / Deployment stack which deploys content to Delivery stack using ROR API wrapping AWS API deployed in Kubernetes and accessed by CLI
Delivery Plane /  API Gateways and CDN Networks.

Architecture


Internet => CloudFront => API Gateway => AWS Lambda
Deploy Platform as RoR / Kubernetes, RDS (Mysql, postgresql) in AWS.

General Best Practices


Always have plan in terms of Runbook/Cheklists for Service Onboarding, Perf test, Filures like Site is down,
Have rubrik for incident management and oncall cadence.

SRE Requirments


Maintain Uptime.

Make sure we have redundant systems in place in terms of configuration to support global/distributed and redundant delivery.
Multiple AZ, VPC, Regions architectures to address HA and DR needs
Monitoring and Auditing all running systems in terms of metrics. Observe metrics over the time to predict failures
CloudFront Settings

Min/Max quota settings to address peek and viral requests in terms of bandwidth, edge/pop configs and RPS
Tweak settings in terms of Cached/Bandwith/TTL to address HA


API Gateway Settings

Add Caching to API to increase performance and reduce load on AWS Lampda or any backend systems


AWS Lampda Settings


PCI Compliant (Others: HIPAA, GDPR)

Backups and Recovery
Security At Rest.

RDS Data encryption is in place?


Secuirty At Transit

PKI, CA TLS Certificates are in place with right configs and rotations (integrated with cert manager). Make sure is TLS protected end-to-end. No gaps.
Make sure CloudFront, Lamda, API Gatway config settings are up to date in terms of PCI compliance and recommendations (App, Network, User security)
Make sure you have right IAM Policies in place for security and access controls.


Possible Problems


Deployments Keep Failing

Why, Check with out logs in Kubernets, Docker, RoR logs and RDS logs.
Monitor/Measure deployment with smoke tests. Elaborate integration, white box and black box testing
Monitor Cloudwatch logs
Keep Deployment Failing Runbook Ready and collect incidents


Site seems slow

Test the speed between hops. Client to CDN, CDN to API Gateway, Gateway to Lampda
Caching logic and routing logics are correct in CloudFront


Sites is down

Verify at client side and every geo/region. is it Partial or total?
Communication is very importent and update status page
Runbook and Incident Management

Where it is exactly failing and fix it Bad config in CDN, Lampda, Gateway
Why redundant servcies is not kicking off ( Chao Engineering)
Bring back service even if it is slow and progressivly full speed.


Conversion rate dropped 50% last week?

Tied to slow. What user experience impacted lately.
Metrics and data

Monitoring data: Traffic, Saturation, Latency, Error data
UI flow changed?


Customers complaints that they can't check out

Always verify. Is erros are due to bad network settings at client side. Deep dive from there.
Any errors react rendering on the server side
Session issues like timeout
Checkout means Intergration with payment gateway providers.  Are Auth creds are ok? Did we make changes in auth config? How we manage change in terms of testing and commuincation.


Are we under attack?

API Gateway Throttling to protect DDOS attack
AWS Sheild protections in place to detect and automated mitigations.
Watch Cloudwatch logs


Search ranking dropped

SEO, Access policies, Is crawlers are access keywords and pages?
Monitor and what kind of SEO Tracking alerts we have?


Sample Troubleshooting - Runbook


Check VPC, Subnets, Security groups, Network ACL are configured properly.
Check IAM Policies are correct and upto date
Checke headers, cache, compresssion settings.
CloudFront Bandwith Quotas, RPS upper limits.
Check CDN, Gateway, Lambda Access Logs - cache hit/miss/error
Check Cloudwatch logs

Sample Incident Management Report


Elaborate on what constitues an Incident managment report
Post mortem and after effects
Use Goolge SRE Book template

Referrences


AWS Well-Architeced framwork, Cloud Solutions architetures from GCP, Azure
Google SRE Books