- Allow devs to deploy universal react apps ( apps that support server side rendering.)
- Cloud deploy, CLI Driven
- Make sure devs have right IAM policies (Authz, AuthN, RBAC) in place in terms of who can deploy, change and update configs.
- Code review is must / Gated deploy with approvals.
- Control Place / Deployment stack which deploys content to Delivery stack using ROR API wrapping AWS API deployed in Kubernetes and accessed by CLI
- Delivery Plane / API Gateways and CDN Networks.
- Internet => CloudFront => API Gateway => AWS Lambda
- Deploy Platform as RoR / Kubernetes, RDS (Mysql, postgresql) in AWS.
- Always have plan in terms of Runbook/Cheklists for Service Onboarding, Perf test, Filures like Site is down,
- Have rubrik for incident management and oncall cadence.
-
Maintain Uptime.
- Make sure we have redundant systems in place in terms of configuration to support global/distributed and redundant delivery.
- Multiple AZ, VPC, Regions architectures to address HA and DR needs
- Monitoring and Auditing all running systems in terms of metrics. Observe metrics over the time to predict failures
- CloudFront Settings
- Min/Max quota settings to address peek and viral requests in terms of bandwidth, edge/pop configs and RPS
- Tweak settings in terms of Cached/Bandwith/TTL to address HA
- API Gateway Settings
- Add Caching to API to increase performance and reduce load on AWS Lampda or any backend systems
- AWS Lampda Settings
-
PCI Compliant (Others: HIPAA, GDPR)
- Backups and Recovery
- Security At Rest.
- RDS Data encryption is in place?
- Secuirty At Transit
- PKI, CA TLS Certificates are in place with right configs and rotations (integrated with cert manager). Make sure is TLS protected end-to-end. No gaps.
- Make sure CloudFront, Lamda, API Gatway config settings are up to date in terms of PCI compliance and recommendations (App, Network, User security)
- Make sure you have right IAM Policies in place for security and access controls.
- Deployments Keep Failing
- Why, Check with out logs in Kubernets, Docker, RoR logs and RDS logs.
- Monitor/Measure deployment with smoke tests. Elaborate integration, white box and black box testing
- Monitor Cloudwatch logs
- Keep Deployment Failing Runbook Ready and collect incidents
- Site seems slow
- Test the speed between hops. Client to CDN, CDN to API Gateway, Gateway to Lampda
- Caching logic and routing logics are correct in CloudFront
- Sites is down
- Verify at client side and every geo/region. is it Partial or total?
- Communication is very importent and update status page
- Runbook and Incident Management
- Where it is exactly failing and fix it Bad config in CDN, Lampda, Gateway
- Why redundant servcies is not kicking off ( Chao Engineering)
- Bring back service even if it is slow and progressivly full speed.
- Conversion rate dropped 50% last week?
- Tied to slow. What user experience impacted lately.
- Metrics and data
- Monitoring data: Traffic, Saturation, Latency, Error data
- UI flow changed?
- Customers complaints that they can't check out
- Always verify. Is erros are due to bad network settings at client side. Deep dive from there.
- Any errors react rendering on the server side
- Session issues like timeout
- Checkout means Intergration with payment gateway providers. Are Auth creds are ok? Did we make changes in auth config? How we manage change in terms of testing and commuincation.
- Are we under attack?
- API Gateway Throttling to protect DDOS attack
- AWS Sheild protections in place to detect and automated mitigations.
- Watch Cloudwatch logs
- Search ranking dropped
- SEO, Access policies, Is crawlers are access keywords and pages?
- Monitor and what kind of SEO Tracking alerts we have?
- Check VPC, Subnets, Security groups, Network ACL are configured properly.
- Check IAM Policies are correct and upto date
- Checke headers, cache, compresssion settings.
- CloudFront Bandwith Quotas, RPS upper limits.
- Check CDN, Gateway, Lambda Access Logs - cache hit/miss/error
- Check Cloudwatch logs
- Elaborate on what constitues an Incident managment report
- Post mortem and after effects Use Goolge SRE Book template
- AWS Well-Architeced framwork, Cloud Solutions architetures from GCP, Azure
- Google SRE Books