ryan-lane/gist:ecf49c1b39f6e6ba2e66 Secret

## gistfile1.md

      
    Raw
  

              gistfile1.md
            
          
    Every instance you launch should be in an autoscale group (ASG). Every single one. Even singletons. There's almost no downside to using an ASG for an instance. Almost. One downside is that when an instance is destroyed and comes back into the ASG its SSH key changes. This leads to a situation where people disable known_hosts validation for SSH. This is even more likely when you have an ASG that scales up and down frequently, since your nodes will basically always be new and none of your SSH keys will match.
Let me step back a bit first. One of the first things you need to tackle when using ASGs is how to SSH into the nodes at all. None of the nodes have known DNS names. You really don't want to assign known DNS names to the nodes either, since you need to handle that yourself and it's kind of a pain. The easiest method of handling this is to have SSH map fake hostnames to instances by looking up the ASG information. For instance, 'ssh servicea-development-iad-8ab1efa.example.com' should cause SSH to find the 'servicea-development-iad' ASG, find the i-8ab1efa in it, then map that to an IP address it can access. You can do some fun things with this, like servicea-development-iad-1.example.com maps to the first node you find, or servicea-development-iad.example.com gives you back a random node.
I won't give a full solution for this in this post, since it's not really the topic, but the gist of it is that you can use the ProxyCommand configuration option in SSH's config to call a script with the hostname. That script will lookup the instance and netcat into the node on port 22, which will proxy SSH. Make sure to do some basic caching here or you'll quickly hit rate limiting issues from EC2. Here's a simple example:
ProxyCommand asg-lookup %h

Of course, this means you'll have a bunch of bad entries in your known_hosts file. Each time you SSH into servicea-development-iad.example.com, it'll give you a random node, which will have a different SSH key. Or, servicea-development-iad-1.example.com will often be a different node, due to autoscaling.
To fix our issue, we can pre-generate an SSH key for an ASG, encrypt it with KMS and stick it into S3. When any node in an ASG starts, it would fetch the key from S3, decrypt it, replace the SSH key in /etc/ssh and then restart SSH. This can be orchestrated during the creation of the ASG. When the ASG is created, its SSH key gets generated and set in S3. Of course, you could also stick the encrypted key into cloud-init's user-data, but if you update the launch config often, you'll have to ensure you put the same key back into the new user-data. S3 is easier.
Thanks to KMS's encryption context and IAM policy, we can limit the decryption of the key to the autoscale group's IAM role. We can also limit the encryption of the key to a role that's only used by orchestration. If we wanted to get really fancy we could also just generate data keys without plaintext from orchestration, store that in S3 and use that as the SSH private key, but I'm too lazy to implement that for this post, so I'll leave that to you.
By pre-generating the keys, it does mean that every node in the autoscale group will have the exact same SSH keypair, but the concept of autoscale groups is that all the nodes in it are exactly the same, right?
Since we're pre-generating the keys when we orchestrate the autoscale group we can also have a pre-generated SSH known_hosts file that's stored in S3 with all of the keys trusted. Everyone that needs to SSH into nodes can use this pre-generated known_hosts. We can add the following to the system SSH configuration:
# Globally trust the pre-generated asg known hosts
GlobalKnownHostsFile /etc/ssh/asg_known_hosts

This gets us part of the way to our solution, but we can get more secure than that. People may be SSHing into a large number of hostnames, since we do magical host mapping. To avoid having a very large number of entries in the known_hosts file, we can also do some host key aliasing in the SSH config:
# Map all the ASG names down to manageable names:
Host servicea*.example.com
HostKeyAlias servicea-ssh.example.com

Now all hostnames for servicea will map to the same entry in the known_hosts file. For this to work properly, you'd need to have every ASG listed in the SSH config. If you're managing the global known_hosts, why not also manage the global ssh config, too?
With this approach it's possible to never get a warning when SSHing into any node in your cluster. Even better, it's possible to never even be asked to trust a key since they're all already trusted.