Skip to content

Instantly share code, notes, and snippets.

@pedro
Created November 13, 2014 01:06
Show Gist options
  • Star 2 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save pedro/c3f3ed59a19f01f26439 to your computer and use it in GitHub Desktop.
Save pedro/c3f3ed59a19f01f26439 to your computer and use it in GitHub Desktop.
Maintaining a resilient front door at massive scale, from Netflix

Maintaining a resilient front door at massive scale, from Netflix

  • Netflix is responsible for about 1/3 of downstream traffic in NA

  • Responsible team in the company is called "edge engineering"

    • Apart from resiliency/scaling, also cares about high velocity product innovation and real time health insights
  • Basic architecture:

    • end-user devices make requests to ELBs, which delegates to zuul, which routes to origin servers serving APIs
  • Zuul

    • Multi-region resiliency
      • Cross-region failover: if us-east has a failure it routes to us-west
      • This also comes with a DNS change so after propagation consumers are just redirected to another region
    • Dynamic routing
      • Route some users to a debug version of the API
      • Route a % of the traffic to a newer version
      • Managed via a web interface
    • Security and authentication!
    • Squeeze test
      • To allow performance testing, find breaking points, etc
  • Numbers:

    • Request ratio is 7:1 (each incoming request results in 7 internal service calls)
    • 5bi req/day
    • 30 dependent services
    • 0 of them have a 100% SLA
      • If they all have 99.99% uptime, together the system would only have 99.7% (99.99 ^ 30)
  • Hystrix

    • Toggle circuit breakers
      • They do percent-based breakers
      • Sample fallback: when the custom rating service is down they just show the average rates
    • Health of all dependencies
    • Really nice consended view of the health of the system:
      • Error rate
      • Queue status
      • Response times
  • Grepzilla

    • Realtime, distributed tail+grep
    • Seems to cover some of the use cases around Splunk, in the command line
  • Spinnaker

    • New tool Netflix is developing to manage AWS resources
  • "The possibilities are numerous once we decide to act and not react"

  • Reactive auto scaling

    • React to real time conditions
    • Respond to spikes/dips in metrics
      • eg: load averages, req/sec, etc
    • Excellent for many scenarios
    • But comes with challenges:
      • Policies can be inefficient when traffic patterns vary
      • Performance degradation during instance startup
      • Outages can trigger scale down events
      • Excess capacity
  • Scryer: predictive auto scaling system

    • Evaluate needs based on historical date
      • week over week, month over month
    • Adjust minimums (at any time, only set the minimum you're expecting, let the reactive auto scaler set the maximums)
    • Good results in production
      • During outages the drop in requests normally results in a spike due to pent up requests. By set the minimum Scryer made sure the system was ready to handle the load after it came back.
      • Saved money
  • Other takeaways:

    • Timeout and retry configuration require lots of attention
    • Fallbacks when circuit breakers are open are important (eg: how they show avg title rates instead of the user ones)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment