Internet Scale Services Checklist
A checklist for designing and developing internet scale services, inspired by James Hamilton's 2007 paper "On Desgining and Deploying Internet-Scale Services."
An update by Paul Johnston (email@example.com), for a Serverless Architecture scenario. This assumes something akin to AWS Lambda + API Gateway + DynamoDB (c. 2016) Function as a Service (FaaS) solution as the basis for deployment rather than a cloud-based virtual server approach which the original paper was based upon. The FaaS solution implies each function is separately scalable and the database is inherently partitioned (assuming designed/built well).
If you agree/disagree, please fork and share with me on twitter @pauldjohnston.
- Does the design expect failures to happen regularly and handle them gracefully?
- Have we kept things as simple as possible?
- Have we automated everything?
Overall Application Design & Development
- Can the service survive failure without human administrative interaction?
- Are failure paths frequently tested?
- Have we documented all conceivable component failure modes and combinations thereof?
- Does our design tolerate these failure modes? And if not, have we undertaken a risk assessment to determine the risk is acceptable?
Are we targeting commodity hardware? (That is, our design does not require special h/w)targetting of FaaS solution is different to hardware
Are we hosting all users on a single version of the software?Each function is essentially versioned separately, so this needs updating. All users will be on latest version of deployed solution. Not sure we could have multiple versions running.
- Can we support multi-tenancy without physical isolation?
- Have we implemented (and automated) a quick service health check?
Do our developers work in the full environment? (Requires single server deployment)Difficult to achieve in FaaS as functions are standalone, and full environments are difficult to replicate.
- Can we continue to operate in reduced capacity if services (components) you depend on fail?
- Does our design eliminate code redundancy across services/components?
Can our pods/clusters of services continue to operate independently of each other?essentially every function is a single service, so they should operate separately from each other anyway. Different?
- For rare emergency human intervention, have we worked with operations to come up with recovery plans,and documented, scripted, and tested them?
- Does each of our complexity adding optimizations (if any), give at least an order of magnitude improvement?
- Have we enforced admission control at all levels?
Can we partition the service, and is that partitioning infinitely adjustable and fine-grained?partitioning happens at a data level. Assume that DynamoDB or similar manages this. May need to identify something slightly different here around data partitioning design...
Have we understood the network design and reviewed it with networking specialists?FaaS has no network design issues (?)
Have we analysed throughput and latency and determined the most important metric for capacity planning?Assumption is that services used are high quality and therefore highly scalable and capacity is provided
- Are all of our operations utilities following the same code review, source code control, testing etc. as the rest of the code base?
Have we understood the load this service will put on any backend store / services? Have we measured and validated this load?Intent in using FaaS is that provider handles load for us. Hitting provider limits is possibly an issue.
- Is everything versioned? The goal is to run single-version software, but multiple versions will always exist during rollout and testing etc. Versions n and n+1 of all components need to peacefully co-exist.
Have we avoided single points of failure?Each function is a failure point. The issue for FaaS is that the functions need to be uncoupled from each other (if one fails, it doesn't kill the whole service). Same or different?
Automatic Management and Provisioning
- Are all of our operations restartable?
- Is all persistent state stored redundantly?
- Can we support geo-distribution / multiple data center deployments
- Have we automated
provisioning andinstallation? Provisioning is managed for us by provider in FaaS
- Are configuration and code delivered by development in a single unit?
- Is the unit created by development used all the way through the lifecycle (test and prod. deployment)?
- Is there an audit log mechanism to capture all changes made in production?
- Have we designed for roles rather than servers, with the ability to deploy each 'role' on as many or few servers as needed?
- Are we handling failures and correcting errors at the service level?
Have we eliminated any dependency on local storage for non-recoverable information?lambda has a local FS but in reality it would be unlikely to ever be used
- Is our deployment model as simple as it can possibly be? (Hard to beat file copy!)
Are we using a chaos monkey in production?not sure how we could do this without access to infrastructure
(How to handle dependencies on other services / components).
- Can we tolerate highly variable latency in service calls? Do we have timeout mechanisms in place and can we retry interactions after a timeout (idempotency)?
- Are all retries reported, and have we bounded the number of retries?
- Do we have circuit breakers in place to prevent cascading failures? Do they 'fail fast'?
- Are we depending upon shipping and proven components wherever possible?
- Have we implemented inter-service monitoring and alerting?
- Do the services we depend on have the same (or compatible) design points (e.g SLAs)?
- Can we continue operation (perhaps in a degraded mode) if a component or service we depend on fails?
Release Cycle and Testing
- Are we shipping often enough?
- Have we defined specific criteria around the intended user experience? Are we continuously monitoring it?
- Are we collecting the actual numbers rather than just summary reports? Raw data will always be needed for diagnosis.
- Have we minimized false-positives in the alerting system?
- Are we analyzing trends on key metrics?
Is the system health highly visible at all times?in a FaaS system, there isn't really a "whole system", so this would have to be aggregated across all functions
- Is the system continuously monitored?
Can we support version roll-back? Is this tested and proven?Can certainly do this for a function at a time.
- Do we support both forward and backward compatibility on every change?
Can we deploy on a single server to support dev and test?Complicated to do this when almost completely reliant on services.
Have we run stress tests?Not sure you can "stress test" something like a FaaS solution in the same way you could with a single server
- Do we have a process in place to catch performance and capacity degradations in new releases?
- Are we running tests using real data?
- Do we have (and run) system-level acceptance tests?
Do we have an environment that lets us test at scale, with the same data collection and mining techniques using in production?Complex to do without duplicating environment which is hard to do without a full environment deployment solution
Hardware Selection and Standardization
(I deviate from the Hamilton paper here, on the assumption that you'll use at least an IaaS layer).
Do we depend only on standard IaaS compute, storage, and network facilities?
Have we avoided dependencies on specific hardware features?
- Have we abstracted the network and naming? (For service discovery)
Operations and Capacity Planning
- Is there a devops team that takes shared responsibility for both developing and operating the service?
- Do we always do soft deletes so that we can recover accidentally deleted data?
- Are we tracking resource allocation for every service to understand the correlation between service metrics and underlying infrastructure requirements?
- Do we have a discipline of only making one change at a time?
- Is everything that might need to be configured or tuned in production able to be changed without a code change?
Auditing, Monitoring, and Alerting
- Are we tracking the alerts:trouble-ticket ratio (goal is near 1:1)?
- Are we tracking the number of systems health issues that don't have corresponding alerts? (goal is near zero)
- Have we instrumented every customer interaction that flows through the system? Are we reporting anomalies?
- Do we have sufficient data to understand the normal operating behaviour?
- Do we have automated testing that takes a customer view of the service?
- Do we have sufficient instrumentation to detect latency issues?
- Do we have performance counters for all operations? (at least latency and number ops/sec data)
- Is every operation audited?
- Do we have individual accounts for everyone who interacts with the system?
- Are we tracking all fault-tolerant mechanisms to expose failures they may be hiding?
- Do we have per-entity / entity-specific audit logs?
- Do we have sufficient assertions in the code base?
- Are we keeping historical performance and log data?
- Is logging configurable without needing to redeploy?
- Are we exposing suitable health information for monitoring?
- Is every error that we report actionable?
- Do our problem reports contain enough information to diagnose the problem?
Can we snapshot system state for debugging outside of production?Not sure how you would capture the full system state. As each Function is uncoupled, this could be rewritten to identify function state (which is relatively simple for a stateless function)
- Are we recording all significant system actions? Both commands sent by users, and what the system internally does.
Graceful Degradation and Admission Control
This is essentially built into the usage of a FaaS solution. Does it still have relevance?
- Do we have a 'big red switch' mechanism to keep vital processing online while shedding or delaying non-critical workload?
- Have we implemented admission control?
- Can we meter admission to slowly bring a system back up after a failure?
Customer and Press Communication Plan
- Do we have a communications plan in place for issues such as wide-scale system unavailability, data loss or corruption, security breaches, privacy violations etc..?
Customer Self-Provisioning and Self-Help
- Can customers self-provision and self-help?