Skip to content

Instantly share code, notes, and snippets.

What would you like to do?
Maintaining a resilient front door at massive scale, from Netflix

Maintaining a resilient front door at massive scale, from Netflix

  • Netflix is responsible for about 1/3 of downstream traffic in NA

  • Responsible team in the company is called "edge engineering"

    • Apart from resiliency/scaling, also cares about high velocity product innovation and real time health insights
  • Basic architecture:

    • end-user devices make requests to ELBs, which delegates to zuul, which routes to origin servers serving APIs
  • Zuul

    • Multi-region resiliency
      • Cross-region failover: if us-east has a failure it routes to us-west
      • This also comes with a DNS change so after propagation consumers are just redirected to another region
    • Dynamic routing
      • Route some users to a debug version of the API
      • Route a % of the traffic to a newer version
      • Managed via a web interface
    • Security and authentication!
    • Squeeze test
      • To allow performance testing, find breaking points, etc
  • Numbers:

    • Request ratio is 7:1 (each incoming request results in 7 internal service calls)
    • 5bi req/day
    • 30 dependent services
    • 0 of them have a 100% SLA
      • If they all have 99.99% uptime, together the system would only have 99.7% (99.99 ^ 30)
  • Hystrix

    • Toggle circuit breakers
      • They do percent-based breakers
      • Sample fallback: when the custom rating service is down they just show the average rates
    • Health of all dependencies
    • Really nice consended view of the health of the system:
      • Error rate
      • Queue status
      • Response times
  • Grepzilla

    • Realtime, distributed tail+grep
    • Seems to cover some of the use cases around Splunk, in the command line
  • Spinnaker

    • New tool Netflix is developing to manage AWS resources
  • "The possibilities are numerous once we decide to act and not react"

  • Reactive auto scaling

    • React to real time conditions
    • Respond to spikes/dips in metrics
      • eg: load averages, req/sec, etc
    • Excellent for many scenarios
    • But comes with challenges:
      • Policies can be inefficient when traffic patterns vary
      • Performance degradation during instance startup
      • Outages can trigger scale down events
      • Excess capacity
  • Scryer: predictive auto scaling system

    • Evaluate needs based on historical date
      • week over week, month over month
    • Adjust minimums (at any time, only set the minimum you're expecting, let the reactive auto scaler set the maximums)
    • Good results in production
      • During outages the drop in requests normally results in a spike due to pent up requests. By set the minimum Scryer made sure the system was ready to handle the load after it came back.
      • Saved money
  • Other takeaways:

    • Timeout and retry configuration require lots of attention
    • Fallbacks when circuit breakers are open are important (eg: how they show avg title rates instead of the user ones)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.