johnazariah/Resilient architectures for the cloud.md

## Resilient architectures for the cloud.md

      
    Raw
  

              Resilient architectures for the cloud.md
            
          
    Resilient Architectures for the cloud

Slide - Introduction

Ask these questions:

How many of you have deployed your applications for the cloud?
Have you ever had any downtown out of something you didn't do?

Let's take a trip back in time. Infact just 3 weeks.
Date is 19th November. Microsoft rolls out a Performance update for Azure Storage. Something goes wrong! The Blob storage front end goes into a loop, and stops taking newer connections. It is not just Azure Storage that goes down, but everything that relies on it, like Websites and Virtual Machines. Microsoft had to roll back their changes and do some restarts to restore service. Outage took a few hours!
Lets go back a couple of years. A certificate that Microsoft uses for its services expires, and that causes heaps of outage!
You think the problem only exists in Microsoft's Azure cloud! No...Every cloud provider faces the same issue. Christmas Eve, and something goes wrong with AWS, causing a 20 hour downtime. These things can happen to you whether you are hosting your app in Azure, AWS, Google, Rackspace or even your own data center - and the reasons for this could be anything: Software bug, certification expiration, Human error, natural disaster, whatever...
So, what you really need to do is make sure your architecture is resilient and can live through these outages and disasters. That is the focus of this talk.
"How do you build resilient architectures for the cloud". My name is Mahesh Krishnan, and this is my co-presenter John Azariah. Both of us are Azure MVPs from Australia, and as consultants we focus on helping our customers build large apps on the cloud that are highly available.
So, before we get going, lets talk some basics:
Slide - Basics


First - What is the cloud made up of? Cheap commodity hardware...helps to keep the cost low.
Something else keeps the cost low - Multi-tenancy
All this could lead to failure and you can overcome this by redundancy - Some comes out of the box, and some you have to opt into
SLAs
Four important things you need to consider:

Disaster recovery - What happens when things have gone terribly wrong. What is my stratergy to get things up and running again
High Availability - How do I make my application and data available all the time (or most of the time)
Responsiveness - How quickly does my application respond to requests
Performance - How quickly can it do something. You may respond very quickly, but that doesn't mean you have completed the task quickly as well


All these have a big impact on your cost and your budget

Slide - What do we get out of the box?


We can run multiple instances of a server and have them load balanced
We can do autoscaling - to cope with loads automatically
We get redundancy with storage. 3 times redundant
We can choose to have geo-redundancy
We can have replication with SQL Db (across DCs)
We have Traffic Manager that can route traffic across multiple DCs
Also have a whole slew of automated tools for backup, restore, etc. and the ability to script all of them

But, in spite of this you need to get your Architecture right - and you need to architect your application for high resilience...and focus on DR, High Availability, Reponsiveness and Performance. Let's look at performance first, because that is the least we care about (at least in this talk)
Slide - Architecting for high performance


Split workloads
Run them in parallel
Get bigger boxes
Use caches
Efficient algorithms
...

(But for a lot of applications, this is not as important as responsiveness and availability, and this talk isn't going to focus on this. We are going to focus on Responsiveness, High Availability and what you do when things have completely failed, and you need to do some kind of Disaster Recovery
Slide - Architectural considerations for DR, Availability and Responsiveness


Decompose applications based on workload
Establish a life cycle model
Establish availability model and plan
Identify Failure points, failure modes and failure effects
Apply resiliency patterns across your application

Slide - Decompose applications by workload

Why do we need to decompose?

Different workloads will have

different requirements
different loads
different implications wrt to performance, availability, etc

Examples


E Commerce website

Search and catalog
Checkout
User Profile


Corporate website (need better example)

Browse and search
Insert and edit records
Reporting capabilities
Admin capabilities


Sports website - like CricInfo

Historic stats and results
Live scores and commentary


Slide - Life Cycle Model


Defines the expected behaviour of the application when operational
Will specify, when traffic is likely to be high, when it is going to be low (Talk about examples - Cricket Worldcup, Olympics, Black Friday sales, University websites, Share trading, etc, etc)

Slide - Establish Availability model and plan


Next step is to estabilish your availability model - basically what your SLAs are
You can now specify SLAs based on workload, and life cycle period.
Getting to a 100% availability could be cost prohibitive and highly complex - and none of the cloud providers will give you that.
Achieving a multiple nine availablity based on specific workloads and lifecycle maybe achievable
Need to understand SLAs for service dependencies

Number of calls
Frequency of calls
What is their SLA
How can you monitor them
etc...


Architect for Autonomy

Independence and Reduced dependency between the parts that make up the whole service
Resilient and easily fault recoverable
Easy to scale
Does not need manual intervention


Slide - Identify Failure Points, Failure Modes, and Failure Effects


Identify where failure can occur. Examples - DB connections, Config files, Storage, etc, etc
Identity the failure modes - i.e. the nature of the failure. Example - Significant traffic exceeding resource capacity, Missing config file, DB reaching maximum capacity, etc, etc
Identify the effects these failures can have

Identifying them will help in figuring out the failures and what action you can take to offset them.
Slide - Resiliency Patterns


Asynchrony

Use Queue Centric Workflows
Provides for autonomy
Allows scaling of tiers


Time-outs

Look for acceptable timeframe to connect to a service or perform an activity
If they cannot be achieved in the time frame (in spite of retries), take appropriate actions


Handle transient faults

Embrace failure
For fleeting faults, retires will work
For more permenant faults, consider gracefull degradation

Return cached data, approximated values
Turn off some services that are not available
Rather than throw errors, do something else - for ex. show default image, if usual image is not available


Considerations for transient error handling

Retry logic
Exponential back-off
Considering Idempotency
Compensating behaviour


Circuit Breaker pattern

A switch that trips and interrupts the flow of current if it exceeds a preset limit
It is a safety precaution and can be turned back on when the problem is gone
Has 3 states:

Closed: Normal state, flow of control is as per usual
Open: Something has gone wrong; flow of control through mitigation path(s)

Once tripped, a timer will be started, that will move the switch to a Half open state after the timer finishes
Mitigation paths will either be a Fail fast route or a mitigation path route


Half open

Limited number of requests routed to normal route to see if it is working
If usual route is back to normal, switch moves back to Closed state, else switches back to Open state


Automate all things

People make mistakes - removing human intervention as much as possible, is the way to go
Dev Ops and scripting FTW!

Automate dployment
Automating data archiving and purging
Automate your test harness


Identify Redundancy Strategies

Redundancy of deployed component - storage, compute instances, etc
Redundancy across multiple DCs
Redundancy across providers
On-premise redundancy
Redundancy configurations:

Active/Active
Active/Passive
N+1
N+M
N to 1
N to N


Traffic Management

Traffic could be geo-distributed or routed to certain DCs for business continuity scenarios
WARNING: Traffic Manager introduces a single point of failure


Data Partitioning Strategy
*
Networking
*
*
Caching

Distributed caching
Device caching
CDNs


Backup and Restore
*
Design for Ops

Establish a health model
Telemetry


'Nuff Talk. Lets look at some demo
DEMO - Failover to different DC


Deploy website to 2 different regions
Replicate DB with primary and secondary
One website has red background, other has green
Use traffic manager to route traffic to just one site
Bring down the website
Show graceful degradation - i.e. only readonly data is visible.
Change secondary DB to be the primary
View the site to show that write access is also now available

(Show a diagram of this architecture, and talk about what we have implemented here)
DEMO 2 - Failover to different provider

Would be a killer demo if we can pull this off!
(Maybe use these set of slides at the end while we summarise things?)
Slide - Architecting for high availability


Infrastructure point of view:

Run multiple instances
Eliminate all single points of failure
Automate your deployments
Autoscaling
Multi DC deployment


Software point of view:

Run things asynchronously, and don't block
Use patterns like Queue Centric Workflows, Command Query Responsibility Segregation
Use retry logic
Design your apps to be stateless
Plan for graceful degradataion
Consider throttling


Slide - Architecting for Disaster Recovery


Backup and Restore strategy
Offsite backup (Think alternate subscriptions, backing up to other regions, providers, etc)
Multi DC deployments

Slide - Architecting for responsiness


Asynchrony
Patterns like QCW, CQRS, etc
Some changes required to Business processes