IwoHerka/well_architected.md

## well_architected.md

      
    Raw
  

              well_architected.md
            
          
    Operational excellence


Perform operations as code: In the cloud, you can apply the same
engineering discipline that you use for application code to your entire
environment. You can define your entire workload (applications,
infrastructure, etc.) as code and update it with code. You can script your
operations procedures and automate their process by launching them in
response to events. By performing operations as code, you limit human error
and create consistent responses to events.


Make frequent, small, reversible changes: Design workloads that are
scaleable and loosely coupled to permit components to be updated regularly.
Automated deployment techniques together with smaller, incremental changes
reduces the blast radius and allows for faster reversal when failures occur.
This increases confidence to deliver beneficial changes to your workload
while maintaining quality and adapting quickly to changes in market
conditions.


Refine operations procedures frequently: As you evolve your workloads,
evolve your operations appropriately. As you use operations procedures, look
for opportunities to improve them. Hold regular reviews and validate that all
procedures are effective and that teams are familiar with them. Where gaps
are identified, update procedures accordingly. Communicate procedural updates
to all stakeholders and teams. Gamify your operations to share best practices
and educate teams.


Anticipate failure: Perform “pre-mortem” exercises to identify potential
sources of failure so that they can be removed or mitigated. Test your
failure scenarios and validate your understanding of their impact. Test your
response procedures to ensure they are effective and that teams are familiar
with their process. Set up regular game days to test workload and team
responses to simulated events.


Learn from all operational failures: Drive improvement through lessons
learned from all operational events and failures. Share what is learned
across teams and through the entire organization.


Use managed services: Reduce operational burden by using AWS managed
services where possible. Build operational procedures around interactions
with those services.


Implement observability for actionable insights: Gain a comprehensive
understanding of workload behavior, performance, reliability, cost, and
health. Establish key performance indicators (KPIs) and leverage
observability telemetry to make informed decisions and take prompt action
when business outcomes are at risk. Proactively improve performance,
reliability, and cost based on actionable observability data.


Security


Implement a strong identity foundation: Implement the principle of least
privilege and enforce separation of duties with appropriate authorization for
each interaction with your AWS resources. Centralize identity management, and
aim to eliminate reliance on long-term static credentials.


Maintain traceability: Monitor, alert, and audit actions and changes to
your environment in real time. Integrate log and metric collection with
systems to automatically investigate and take action.


Apply security at all layers: Apply a defense in depth approach with
multiple security controls. Apply to all layers (for example, edge of
network, VPC, load balancing, every instance and compute service, operating
system, application, and code).


Automate security best practices: Automated software-based security
mechanisms improve your ability to securely scale more rapidly and
cost-effectively. Create secure architectures, including the implementation
of controls that are defined and managed as code in version-controlled
templates.


Protect data in transit and at rest: Classify your data into sensitivity
levels and use mechanisms, such as encryption, tokenization, and access
control where appropriate.


Keep people away from data: Use mechanisms and tools to reduce or
eliminate the need for direct access or manual processing of data. This
reduces the risk of mishandling or modification and human error when handling
sensitive data.


Prepare for security events: Prepare for an incident by having incident
management and investigation policy and processes that align to your
organizational requirements. Run incident response simulations and use tools
with automation to increase your speed for detection, investigation, and
recovery.


Reliability


Automatically recover from failure: By monitoring a workload for key
performance indicators (KPIs), you can run automation when a threshold is
breached. These KPIs should be a measure of business value, not of the
technical aspects of the operation of the service. This allows for automatic
notification and tracking of failures, and for automated recovery processes
that work around or repair the failure. With more sophisticated automation,
it’s possible to anticipate and remediate failures before they occur.


Test recovery procedures: In an on-premises environment, testing is often
conducted to prove that the workload works in a particular scenario. Testing is
not typically used to validate recovery strategies. In the cloud, you can test
how your workload fails, and you can validate your recovery procedures. You can
use automation to simulate different failures or to recreate scenarios that led
to failures before. This approach exposes failure pathways that you can test
and fix before a real failure scenario occurs, thus reducing risk.


Scale horizontally to increase aggregate workload availability: Replace one
large resource with multiple small resources to reduce the impact of a single
failure on the overall workload. Distribute requests across multiple, smaller
resources to ensure that they don’t share a common point of failure.


Stop guessing capacity: A common cause of failure in on-premises workloads
is resource saturation, when the demands placed on a workload exceed the
capacity of that workload (this is often the objective of denial of service
attacks). In the cloud, you can monitor demand and workload utilization, and
automate the addition or removal of resources to maintain the optimal level to
satisfy demand without over- or under-provisioning. There are still limits, but
some quotas can be controlled and others can be managed (see Manage Service
Quotas and Constraints).


Manage change through automation: Changes to your infrastructure should be
made using automation. The changes that need to be managed include changes to
the automation, which then can be tracked and reviewed.


Performance efficiency


Democratize advanced technologies: Make advanced technology
implementation easier for your team by delegating complex tasks to your cloud
vendor. Rather than asking your IT team to learn about hosting and running a
new technology, consider consuming the technology as a service. For example,
NoSQL databases, media transcoding, and machine learning are all technologies
that require specialized expertise. In the cloud, these technologies become
services that your team can consume, allowing your team to focus on product
development rather than resource provisioning and management.


Go global in minutes: Deploying your workload in multiple AWS Regions
around the world allows you to provide lower latency and a better experience
for your customers at minimal cost.


Use serverless architectures: Serverless architectures remove the need
for you to run and maintain physical servers for traditional compute
activities. For example, serverless storage services can act as static
websites (removing the need for web servers) and event services can host
code. This removes the operational burden of managing physical servers, and
can lower transactional costs because managed services operate at cloud
scale.


Experiment more often: With virtual and automatable resources, you can
quickly carry out comparative testing using different types of instances,
storage, or configurations.


Consider mechanical sympathy: Use the technology approach that aligns
best with your goals. For example, consider data access patterns when you
select database or storage for your workload.


Cost optimization


Implement cloud financial management: To achieve financial success and
accelerate business value realization in the cloud, you must invest in Cloud
Financial Management. Your organization must dedicate the necessary time and
resources for building capability in this new domain of technology and usage
management. Similar to your Security or Operations capability, you need to
build capability through knowledge building, programs, resources, and
processes to help you become a cost efficient organization.


Adopt a consumption model: Pay only for the computing resources you
consume, and increase or decrease usage depending on business requirements.
For example, development and test environments are typically only used for
eight hours a day during the work week. You can stop these resources when
they’re not in use for a potential cost savings of 75% (40 hours versus 168
hours).


Measure overall efficiency: Measure the business output of the workload
and the costs associated with delivery. Use this data to understand the gains
you make from increasing output, increasing functionality, and reducing cost.


Stop spending money on undifferentiated heavy lifting: AWS does the heavy
lifting of data center operations like racking, stacking, and powering
servers. It also removes the operational burden of managing operating systems
and applications with managed services. This allows you to focus on your
customers and business projects rather than on IT infrastructure.


Analyze and attribute expenditure: The cloud makes it easier to
accurately identify the cost and usage of workloads, which then allows
transparent attribution of IT costs to revenue streams and individual
workload owners. This helps measure return on investment (ROI) and gives
workload owners an opportunity to optimize their resources and reduce costs.


Sustainability


Understand your impact: Measure the impact of your cloud workload and
model the future impact of your workload. Include all sources of impact,
including impacts resulting from customer use of your products, and impacts
resulting from their eventual decommissioning and retirement. Compare the
productive output with the total impact of your cloud workloads by reviewing
the resources and emissions required per unit of work. Use this data to
establish key performance indicators (KPIs), evaluate ways to improve
productivity while reducing impact, and estimate the impact of proposed
changes over time.


Establish sustainability goals: For each cloud workload, establish
long-term sustainability goals such as reducing the compute and storage
resources required per transaction. Model the return on investment of
sustainability improvements for existing workloads, and give owners the
resources they need to invest in sustainability goals. Plan for growth, and
architect your workloads so that growth results in reduced impact intensity
measured against an appropriate unit, such as per user or per transaction.
Goals help you support the wider sustainability goals of your business or
organization, identify regressions, and prioritize areas of potential
improvement.


Maximize utilization: Right-size workloads and implement efficient design
to ensure high utilization and maximize the energy efficiency of the
underlying hardware. Two hosts running at 30% utilization are less efficient
than one host running at 60% due to baseline power consumption per host. At
the same time, eliminate or minimize idle resources, processing, and storage
to reduce the total energy required to power your workload.


Anticipate and adopt new, more efficient hardware and software offerings:
Support the upstream improvements your partners and suppliers make to help
you reduce the impact of your cloud workloads. Continually monitor and
evaluate new, more efficient hardware and software offerings. Design for
flexibility to allow for the rapid adoption of new efficient technologies.


Use managed services: Sharing services across a broad customer base helps
maximize resource utilization, which reduces the amount of infrastructure
needed to support cloud workloads. For example, customers can share the
impact of common data center components like power and networking by
migrating workloads to the AWS Cloud and adopting managed services, such as
AWS Fargate for serverless containers, where AWS operates at scale and is
responsible for their efficient operation. Use managed services that can help
minimize your impact, such as automatically moving infrequently accessed data
to cold storage with Amazon S3 Lifecycle configurations or Amazon EC2 Auto
Scaling to adjust capacity to meet demand.


Reduce the downstream impact of your cloud workloads: Reduce the amount
of energy or resources required to use your services. Reduce or eliminate the
need for customers to upgrade their devices to use your services. Test using
device farms to understand expected impact and test with customers to
understand the actual impact from using your services.