peterwwillis/gist:ce2bfaba7fc72e4af44c28135ab3db1e

## gistfile1.md

      
    Raw
  

              gistfile1.md
            
          
    How to make Hacker News resistant to outages

This is an explanation of how Hacker News could be made resilient against network and infrastructure failures.
Step 1. DNS redundancy

Make sure you use a DNS nameserver provider that has multiple nameservers using multiple cloud hosting providers in multiple regions and zones. For added redundancy, use multiple nameserver providers, replicate your records between them, and make sure each uses different providers/regions.
Point your origin DNS record (origin.mydomain.com) at each of your origins, using CNAMEs or A records. Keep the TTL as low as you can, usually 60 seconds. Since only your CDN should be hitting this host, this shouldn't stress your nameserver. During an outage, one origin can be removed from DNS (if necessary).
Total cost: Free.
Step 2. Origin redundancy

For true redundancy, your content origin should be served in multiple regions by multiple providers. If you use multiple providers, just one unique region per provider will do. If you don't use multiple providers, use multiple regions in one provider. If you don't use multiple regions, use multiple availability zones.
You will need to duplicate all your production infrastructure that serves your content: S3 buckets, SQL databases, EC2 instances, ELB volumes, VPCs, etc. If your site is static using an S3 Static Website feature, just duplicate the S3 bucket. Enable S3 replication for added simplicity.
Load Balancer

If you use a CDN (CloudFlare is free so why not?) you might not need a load balancer to distribute your traffic to your origin. But if you have multiple web servers, a load balancer is recommended.
Technically you don't need a load balancer, as you can just list each server in the DNS record for your origin. But the records can take a long time to be changed in caching name servers (regardless of the TTL, whose typical minimum is 60 seconds) and some clients may continue to attempt a connection to your downed origin, causing sporatic degraded performance for users.
You should run multiple web servers with a load balancer so you can do zero-downtime rolling deployments, and for redundancy when one web server goes down. Each web server should sit in a different availability zone on a different physical server. (Note that "web server" does not mean an entire physical/virtual host, but a single HTTP daemon process)
One load balancer in each region, across two providers, would be a minimum of 2 load balancers.
Content Replication

The content needs to be available for each web server, so it needs to be replicated to each database/region/etc.
If the replication method is asynchronous, your secondary origin's webserver should only use the write master database (SQL) or only your primary origin should serve traffic (flat files).
SQL content


Configure database for replication. One instance is the production database, and the other is a read replica.
If you want to serve traffic from both origins:

Configure both webservers to connect to the one writeable database for writes.
You can optionally configure each webserver's web app to use its local db for reads if the replication is synchronous.


If you want to serve traffic from a single origin:

Configure each webserver to connect to its own local database.


In the event of a provider outage:

If the provider with your write master database is down:

Reconfigure the read replica into the new write master.
If the remaining webserver was pointing at the database that is down, reconfigure it to point at the new write master.


Flat-file content


Configure a network storage solution that does replication, such as DRBD, Mars, Cinder. (You could also build a custom replication system for a custom web application, but this is a lot of work with little payoff)
Configure all traffic to go to a single origin.
Configure local web server to access the local local files.
In the event of a provider outage:

Point CDN/origin DNS at the second origin.


S3 content


Enable S3 bucket replication.
AWS S3 is now strongly consistent, but some other S3-compatible object stores may not be. If not strongly consistent, this is considered asynchronous replication; see flat-files section for how to handle this.

Example


One VPS running on two VPS providers, each in a unique region, with the origin DNS name pointing at both VPS's (either CNAME or A record).
Each VPS runs an identical web app and SQL database.
Configure replication of data.
Configure CDN to serve traffic to one or both origins, depending on the above notes.
Configure webservers to query data from one database, if using a database and replication is not synchronous.

Lowest possible total cost: $10/month (2 VPSes @ $5/ea).
Step 3. CDN

Most CDNs should be fairly stable, assuming you don't use anything but the most basic features. For extra paranoia you can use multiple CDNs that point to the same origin(s).
If your CDN supports multiple origins for a single host/domain, use that rather than listing all your origins in one DNS record.
Total cost: Free (CloudFlare).