pedro/netflix-maintaining-resilient-front-door.md

## netflix-maintaining-resilient-front-door.md

      
    Raw
  

              netflix-maintaining-resilient-front-door.md
            
          
    Maintaining a resilient front door at massive scale, from Netflix


Netflix is responsible for about 1/3 of downstream traffic in NA


Responsible team in the company is called "edge engineering"

Apart from resiliency/scaling, also cares about high velocity product innovation and real time health insights


Basic architecture:

end-user devices make requests to ELBs, which delegates to zuul, which routes to origin servers serving APIs


Zuul

Multi-region resiliency

Cross-region failover: if us-east has a failure it routes to us-west
This also comes with a DNS change so after propagation consumers are just redirected to another region


Dynamic routing

Route some users to a debug version of the API
Route a % of the traffic to a newer version
Managed via a web interface


Security and authentication!
Squeeze test

To allow performance testing, find breaking points, etc


Numbers:

Request ratio is 7:1 (each incoming request results in 7 internal service calls)
5bi req/day
30 dependent services
0 of them have a 100% SLA

If they all have 99.99% uptime, together the system would only have 99.7% (99.99 ^ 30)


Hystrix

Toggle circuit breakers

They do percent-based breakers
Sample fallback: when the custom rating service is down they just show the average rates


Health of all dependencies
Really nice consended view of the health of the system:

Error rate
Queue status
Response times


Grepzilla

Realtime, distributed tail+grep
Seems to cover some of the use cases around Splunk, in the command line


Spinnaker

New tool Netflix is developing to manage AWS resources


"The possibilities are numerous once we decide to act and not react"


Reactive auto scaling

React to real time conditions
Respond to spikes/dips in metrics

eg: load averages, req/sec, etc


Excellent for many scenarios
But comes with challenges:

Policies can be inefficient when traffic patterns vary
Performance degradation during instance startup
Outages can trigger scale down events
Excess capacity


Scryer: predictive auto scaling system

Evaluate needs based on historical date

week over week, month over month


Adjust minimums (at any time, only set the minimum you're expecting, let the reactive auto scaler set the maximums)
Good results in production

During outages the drop in requests normally results in a spike due to pent up requests. By set the minimum Scryer made sure the system was ready to handle the load after it came back.
Saved money


Other takeaways:

Timeout and retry configuration require lots of attention
Fallbacks when circuit breakers are open are important (eg: how they show avg title rates instead of the user ones)