Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save johnazariah/975a7c097b0e79accf77 to your computer and use it in GitHub Desktop.
Save johnazariah/975a7c097b0e79accf77 to your computer and use it in GitHub Desktop.

Resilient Architectures for the cloud

Slide - Introduction

Ask these questions:

  • How many of you have deployed your applications for the cloud?
  • Have you ever had any downtown out of something you didn't do?

Let's take a trip back in time. Infact just 3 weeks. Date is 19th November. Microsoft rolls out a Performance update for Azure Storage. Something goes wrong! The Blob storage front end goes into a loop, and stops taking newer connections. It is not just Azure Storage that goes down, but everything that relies on it, like Websites and Virtual Machines. Microsoft had to roll back their changes and do some restarts to restore service. Outage took a few hours!

Lets go back a couple of years. A certificate that Microsoft uses for its services expires, and that causes heaps of outage!

You think the problem only exists in Microsoft's Azure cloud! No...Every cloud provider faces the same issue. Christmas Eve, and something goes wrong with AWS, causing a 20 hour downtime. These things can happen to you whether you are hosting your app in Azure, AWS, Google, Rackspace or even your own data center - and the reasons for this could be anything: Software bug, certification expiration, Human error, natural disaster, whatever...

So, what you really need to do is make sure your architecture is resilient and can live through these outages and disasters. That is the focus of this talk. "How do you build resilient architectures for the cloud". My name is Mahesh Krishnan, and this is my co-presenter John Azariah. Both of us are Azure MVPs from Australia, and as consultants we focus on helping our customers build large apps on the cloud that are highly available.

So, before we get going, lets talk some basics:

Slide - Basics

  1. First - What is the cloud made up of? Cheap commodity hardware...helps to keep the cost low.
  2. Something else keeps the cost low - Multi-tenancy
  3. All this could lead to failure and you can overcome this by redundancy - Some comes out of the box, and some you have to opt into
  4. SLAs
  5. Four important things you need to consider:
    • Disaster recovery - What happens when things have gone terribly wrong. What is my stratergy to get things up and running again
    • High Availability - How do I make my application and data available all the time (or most of the time)
    • Responsiveness - How quickly does my application respond to requests
    • Performance - How quickly can it do something. You may respond very quickly, but that doesn't mean you have completed the task quickly as well
  6. All these have a big impact on your cost and your budget

Slide - What do we get out of the box?

  • We can run multiple instances of a server and have them load balanced
  • We can do autoscaling - to cope with loads automatically
  • We get redundancy with storage. 3 times redundant
  • We can choose to have geo-redundancy
  • We can have replication with SQL Db (across DCs)
  • We have Traffic Manager that can route traffic across multiple DCs
  • Also have a whole slew of automated tools for backup, restore, etc. and the ability to script all of them

But, in spite of this you need to get your Architecture right - and you need to architect your application for high resilience...and focus on DR, High Availability, Reponsiveness and Performance. Let's look at performance first, because that is the least we care about (at least in this talk)

Slide - Architecting for high performance

  • Split workloads
  • Run them in parallel
  • Get bigger boxes
  • Use caches
  • Efficient algorithms
  • ...

(But for a lot of applications, this is not as important as responsiveness and availability, and this talk isn't going to focus on this. We are going to focus on Responsiveness, High Availability and what you do when things have completely failed, and you need to do some kind of Disaster Recovery

Slide - Architectural considerations for DR, Availability and Responsiveness

  • Decompose applications based on workload
  • Establish a life cycle model
  • Establish availability model and plan
  • Identify Failure points, failure modes and failure effects
  • Apply resiliency patterns across your application

Slide - Decompose applications by workload

Why do we need to decompose?

Different workloads will have

  • different requirements
  • different loads
  • different implications wrt to performance, availability, etc

Examples

  • E Commerce website
    • Search and catalog
    • Checkout
    • User Profile
  • Corporate website (need better example)
    • Browse and search
    • Insert and edit records
    • Reporting capabilities
    • Admin capabilities
  • Sports website - like CricInfo
    • Historic stats and results
    • Live scores and commentary

Slide - Life Cycle Model

  • Defines the expected behaviour of the application when operational
  • Will specify, when traffic is likely to be high, when it is going to be low (Talk about examples - Cricket Worldcup, Olympics, Black Friday sales, University websites, Share trading, etc, etc)

Slide - Establish Availability model and plan

  • Next step is to estabilish your availability model - basically what your SLAs are
  • You can now specify SLAs based on workload, and life cycle period.
  • Getting to a 100% availability could be cost prohibitive and highly complex - and none of the cloud providers will give you that.
  • Achieving a multiple nine availablity based on specific workloads and lifecycle maybe achievable
  • Need to understand SLAs for service dependencies
    • Number of calls
    • Frequency of calls
    • What is their SLA
    • How can you monitor them
    • etc...
  • Architect for Autonomy
    • Independence and Reduced dependency between the parts that make up the whole service
    • Resilient and easily fault recoverable
    • Easy to scale
    • Does not need manual intervention

Slide - Identify Failure Points, Failure Modes, and Failure Effects

  • Identify where failure can occur. Examples - DB connections, Config files, Storage, etc, etc
  • Identity the failure modes - i.e. the nature of the failure. Example - Significant traffic exceeding resource capacity, Missing config file, DB reaching maximum capacity, etc, etc
  • Identify the effects these failures can have

Identifying them will help in figuring out the failures and what action you can take to offset them.

Slide - Resiliency Patterns

  • Asynchrony
    • Use Queue Centric Workflows
    • Provides for autonomy
    • Allows scaling of tiers
  • Time-outs
    • Look for acceptable timeframe to connect to a service or perform an activity
    • If they cannot be achieved in the time frame (in spite of retries), take appropriate actions
  • Handle transient faults
    • Embrace failure
    • For fleeting faults, retires will work
    • For more permenant faults, consider gracefull degradation
      • Return cached data, approximated values
      • Turn off some services that are not available
      • Rather than throw errors, do something else - for ex. show default image, if usual image is not available
    • Considerations for transient error handling
      • Retry logic
      • Exponential back-off
      • Considering Idempotency
      • Compensating behaviour
  • Circuit Breaker pattern
    • A switch that trips and interrupts the flow of current if it exceeds a preset limit
    • It is a safety precaution and can be turned back on when the problem is gone
    • Has 3 states:
      • Closed: Normal state, flow of control is as per usual
      • Open: Something has gone wrong; flow of control through mitigation path(s)
        • Once tripped, a timer will be started, that will move the switch to a Half open state after the timer finishes
        • Mitigation paths will either be a Fail fast route or a mitigation path route
      • Half open
        • Limited number of requests routed to normal route to see if it is working
        • If usual route is back to normal, switch moves back to Closed state, else switches back to Open state
  • Automate all things
    • People make mistakes - removing human intervention as much as possible, is the way to go
    • Dev Ops and scripting FTW!
      • Automate dployment
      • Automating data archiving and purging
      • Automate your test harness
  • Identify Redundancy Strategies
    • Redundancy of deployed component - storage, compute instances, etc
    • Redundancy across multiple DCs
    • Redundancy across providers
    • On-premise redundancy
    • Redundancy configurations:
      • Active/Active
      • Active/Passive
      • N+1
      • N+M
      • N to 1
      • N to N
  • Traffic Management
    • Traffic could be geo-distributed or routed to certain DCs for business continuity scenarios
    • WARNING: Traffic Manager introduces a single point of failure
  • Data Partitioning Strategy *
  • Networking * *
  • Caching
    • Distributed caching
    • Device caching
    • CDNs
  • Backup and Restore *
  • Design for Ops
    • Establish a health model
    • Telemetry

'Nuff Talk. Lets look at some demo

DEMO - Failover to different DC

  • Deploy website to 2 different regions
  • Replicate DB with primary and secondary
  • One website has red background, other has green
  • Use traffic manager to route traffic to just one site
  • Bring down the website
  • Show graceful degradation - i.e. only readonly data is visible.
  • Change secondary DB to be the primary
  • View the site to show that write access is also now available

(Show a diagram of this architecture, and talk about what we have implemented here)

DEMO 2 - Failover to different provider

Would be a killer demo if we can pull this off!

(Maybe use these set of slides at the end while we summarise things?)

Slide - Architecting for high availability

  • Infrastructure point of view:

    • Run multiple instances
    • Eliminate all single points of failure
    • Automate your deployments
    • Autoscaling
    • Multi DC deployment
  • Software point of view:

    • Run things asynchronously, and don't block
    • Use patterns like Queue Centric Workflows, Command Query Responsibility Segregation
    • Use retry logic
    • Design your apps to be stateless
    • Plan for graceful degradataion
    • Consider throttling

Slide - Architecting for Disaster Recovery

  • Backup and Restore strategy
  • Offsite backup (Think alternate subscriptions, backing up to other regions, providers, etc)
  • Multi DC deployments

Slide - Architecting for responsiness

  • Asynchrony
  • Patterns like QCW, CQRS, etc
  • Some changes required to Business processes
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment