Advanced Network Resilience in VPC with Consul
At my company, we've been using AWS + VPC for three years or so. On day one of starting to build out an infrastructure within it we sent an email to our Amazon contact asking for 'a NAT equivalent of an Internet Gateway' - an AWS managed piece of infrastructure that would do NAT for us. We're still waiting.
In the mean time, we've been through a couple of approaches to providing network resilience for NAT devices. As we're now using Consul for service discovery everywhere, when we came to re-visiting how to provide resilience at the network layer, it made sense for us to utilise the feature-set it provides.
Autoscaling NAT/bastion instances
For our application to function, it needs to have outbound Internet connectivity at all times. Originally, we provided for this by having one NAT instance per AZ, and having healthchecks fail if this was not available. This meant that a failed NAT instance took down a whole AZ - something that the infrastructure had been designed to cope with, but not ideal, as it meant losing half or a third of capacity until the machine was manually re-provisioned.
The approach I set out below allows us to have NAT provided by instances in an autoscaling group, with minimal downtime in the event of instance failure. This means we now don't need to worry about machines 'scheduled for retirement', being able to terminate them at will.
In this example, we set up a three node consul cluster. One node will be elected as the NAT instance, and will take over NAT duties. A simplistic health check is provided to ensure this instance has Internet access; it sends a ping to google.com and checks for a response. In the event of the node failing in any way, another will quickly step in and take over routing.
In practice, if you already have a consul cluster, you would only need two NAT instances to be running and retain fast failover.
You can try out this setup by using the CloudFormation template at https://s3-eu-west-1.amazonaws.com/awsadvent2014/advent.json
The template only has AMIs defined for us-west-2 and eu-west-1, so you'll need to launch in one of those regions.
This setup relies on a python script ( https://s3-eu-west-1.amazonaws.com/awsadvent2014/instance.py ) as a wrapper around consul. It discovers the other nodes for consul to connect to via the AWS API, and uses consul session locking to get the cluster to agree on which machine should be the NAT device.
Hopefully this example gives you enough building blocks to go and implement something similar for your environment.