s0enke/article.md Secret

## article.md

      
    Raw
  

              article.md
            
          
    CloudFormation woes: Keep calm and use Ansible

No more outdated information, a single source of truth. Describing almost everything as code, isn’t this one of the DevOps dreams? Recent developments have made this dream even closer. In the Era of APIs, tools like TerraForm and Ansible have evolved which are able to codify the creation and maintenance of entire “organizational ecosystems”.
This blog post is a brief description of the steps we have taken to come closer to this goal at my employer Jimdo. Before we begin looking at particular implementations, let’s take the helicopter view and have a look at the current state and the problems with it.
Current state

We began to move to AWS in 2011 and have been using CloudFormation from the beginning. While we currently describe almost everything in CloudFormation, there are some legacy pieces which were just “clicked” through the AWS console. In order to to have some primitive auditing and documentation for those, we usually document all “clicked” settings with a Jenkins job, which runs Cucumber scenarios that do a live inspection of the settings (by querying the AWS APIs with a read-only user).
Here is an example of how we document our VPC subnets (which are currently not part of a CloudFormation stack) with Cucumber:

While this setup might not look that bad and has a basic level of codification, there are several drawbacks, especially with CloudFormation itself, which we are going to have a look at now.
Problems with the current state

Existing AWS resources cannot be managed by CloudFormation

Maybe you have experienced this same issue: You start off with some new technology or provider and initially use the UI to play around. And suddenly, those clicked spikes are in production. At least this is the story how we came to AWS at Jimdo ;-)
So you might say: “OK, then let’s rebuild the clicked resources into a CloudFormation stack.” Well, the problem is that we didn’t describe basic components like VPC and Subnets as CloudFormation stacks in the first place, and as other production setups rely on those resources, we cannot change this as easily anymore.
Not all AWS features are immediately available in CloudFormation

Here is another issue: The usual AWS feature release process is that a component team releases a new feature (e.g. ElastiCache replica groups), but the CloudFormation part is missing (the CloudFormation team at AWS is a separate team with its own roadmap). And since CloudFormation isn’t open source, we cannot add the missing functionality by ourselves.
So, in order to use those “Non-CloudFormation” features, we used to click the setup as a workaround, and then again document the settings with Cucumber. Here is an example of documented replica groups:

But the click-and-document-with-cucumber approach seems to have some drawbacks:

It’s not an enforced policy to document, so colleagues might miss the documentation step or see no value in it
It might be incomplete as not all clicked settings are documented
It encourages a “clicking culture”, which is the exact opposite of what we want to achieve

So we need something which could be extended as a CloudFormation stack with resources that we couldn't (yet) express in CloudFormation. And we need them to be grouped together semantically, as code.
Post processors for CloudFormation stacks

Some resources require post-processing in order to be fully ready. Imagine the creation of an RDS MySQL database with CloudFormation. The physical database was created by CloudFormation, but what about databases, users, and passwords? This cannot be done with CloudFormation, so we need to work around this as well.
Our current approaches vary from manual steps documented in a wiki to a combination of Puppet and hiera-aws: Puppet - running on some admin node - retrieves RDS instance endpoints by tags and then iterates over them and executes shell scripts. This is a form of post-processing entirely decoupled from the CloudFormation stack, actually in terms of time (hourly Puppet run) and in also in terms of “location” (it’s in another repository). A very complicated way just for the sake of automation.
Inconvenient toolset

Currently we use the AWS CLI tools in a plain way. Some coworkers use the old tools, some use the new ones. And I guess there are even folks with their own wrappers / bash aliases.
A “good” example is the missing feature of changing tags of CloudFormation stacks after creation. So if you forgot to do this in the first place, you’d need to recreate the entire stack! The CLI tools do not automatically add tags to stacks, so this is easily forgotten and should be automated. As a result we need to think of a wrapper around CloudFormation which automates those situations.
Hardcoded / copy and pasted data

The idea of “single source information” or “single source of truth” is to never have a representation of data saved in more than one location. In the database world, it’s called “database normalization”. This is a very common pattern which should be followed unless you have an excellent excuse.
But, if you may not know better,  you are under time pressure, or your tooling is still immature, it’s hard to keep the data single-sourced. This usually leads to copying and pasting hardcoding data.
Examples regarding AWS are usually resource IDs like Subnet-IDs, Security Groups or - in our case- our main VPC ID.
While this may not be an issue at first, it will come back to you in the future, e.g. if you want to rollout your stacks in another AWS region, perform disaster recovery, or you have to grep for hardcoded data in several codebases when doing refactorings, etc.
So we needed something to access information of other CloudFormation stacks and/or otherwise created resources (from the so called “clicked infrastructure”) without ever referencing IDs, Security Groups, etc. directly.
Possible solutions

Now we have a good picture of what our current problems are and we can actually look for solutions!
My research resulted in 3 possible tools: Ansible, TerraForm and Salt.
As of writing this Ansible seems to be the only currently available tool which can deal with existing CloudFormation stacks out of the box and also seems to meet the other criteria at first glance, I decided to move on with it.
Spiking the solution with Ansible

Describing an existing CloudFormation stack as Ansible Playbook

One of the mentioned problems are the inconvenient CloudFormation CLI tools: To create/update a stack, you would have to synthesize at least the stack name, template file name, and parameters, which is no fun and error-prone. For example:
$ cfn-[create|update]-stack webpool-saturn-dev-eu-west-1 --capabilities CAPABILITY_IAM --parameters "VpcID=vpc-123456" --template-file webpool-saturn-dev-eu-west-1.json --tags "jimdo:role=webpool,jimdo:owner=independence-team,jimdo:environment=dev"

With Ansible, we can describe a new or existing CloudFormation stack with a few lines as an Ansible Playbook, here one example:
---
- hosts: localhost
  connection: local
  gather_facts: no
  vars:
    jimdo_environment: dev
    aws_region: eu-west-1
    stack_name: "webpool-saturn-{{ jimdo_environment }}-{{ aws_region }}"

  tasks:
    - name: create CloudFormation stack
      cloudformation:
        stack_name: "{{ stack_name }}"
        state: "present"
        region: "{{ aws_region }}"
        template: "{{ stack_name }}.json"
        tags:
          "jimdo:role": "webpool"
          "jimdo:owner": "independence-team"
          "jimdo:environment": "{{ jimdo_environment }}"
Creating and updating (converging) the CloudFormation stack becomes as straightforward as:
$ ansible-playbook webpool-saturn-dev-eu-west-1.yml

Awesome! We finally have great tooling! The YAML syntax is machine and human readable and our single source of truth from now on.
Extending an existing CloudFormation stack with Ansible


As for added power, it should be easier to implement AWS functionality that's currently missing from CloudFormation as an Ansible module than a CloudFormation external resource [...] and performing other out of band tasks, letting your ticketing system know about a new stack for example, is a lot easier to integrate into Ansible than trying to wrap the cli tools manually.

— Dean Wilson
The above example stack uses the AWS ElastiCache feature of Redis replica groups, which unfortunately isn’t currently supported by CloudFormation. We could only describe the main ElastiCache cluster in CloudFormation. As a workaround, we used to click this missing piece and documented it with Cucumber as explained above.
A short look at the Ansible documentation reveals there is currently no support for ElastiCache replica groups in Ansible as well. But a quick research shows we have the possibility to extend Ansible with custom modules.
So I started spiking my own Ansible module to handle ElastiCache replica groups, inspired by the existing “elasticache” module. This involved the following steps:

Put the module under “library/”, e.g. elasticache_replication_group.py (I published the unfinished skeleton as a Gist for reference)
Add an output to the existing CloudFormation stack which is creating the ElastiCache cluster, in order to return the ID(s) of the cache cluster(s): We need them to create the read replica group(s). Register the output of the cloudformation Ansible task:

---
  tasks:
    - name: webpool saturn
      cloudformation:
        ...
      register: webpool_cfn

Extend the playbook to create the ElastiCache replica group by reusing the output of the cloudformation task:

    - name: ElastiCache replica groups
      elasticache_replication_group:
        state: "present"
        name: "saturn-dev-01n1"
        primary_cluster_id: "{{ webpool_cfn['stack_outputs']['WebcacheNode1Name'] }}"
Pretty awesome: Ansible works as a glue language while staying very readable. Actually it’s possible to read through the playbook and have an idea what’s going on.
Another great thing is that we can even extend core functionality of Ansible without any friction (as waiting for upstream to accept a commit, build/deploy new packages, etc) which should increase the tool acceptance across coworkers even more.
This topic touches another use-case: The possibility to “chain” CloudFormation stacks with Ansible: Reusing Outputs from Stacks as parameters for other stacks. This is especially useful to split big monolithic stacks into smaller ones which as a result can be managed and reused independently (separation of concerns).
Last but not least, it’s now easy to extend the Ansible playbook with post processing tasks (remember the RDS/Database example above).
Describing existing AWS resources as a “Stack”

As mentioned above, one issue with CloudFormation is a a way to import existing infrastructure into a stack. Luckily, Ansible supports most of the AWS functionality so we can create a playbook to express existing infrastructure as code.
To discover the possibilities, I converted a fraction of our current production VPC/subnet setup into an Ansible playbook:
---
- hosts: localhost
  connection: local
  gather_facts: no
  vars:
    aws_region: eu-west-1
  tasks:
    - name: Main shared Jimdo VPC
      ec2_vpc:
        state: present
        cidr_block: 10.5.0.0/16
        resource_tags: {"jimdo:environment": "prod", "jimdo:role": "shared_network", "jimdo:owner": "unassigned"}
        region: "{{ aws_region }}"
        dns_hostnames: no
        dns_support: yes
        instance_tenancy: default
        internet_gateway: yes
        subnets:
          - cidr: 10.5.151.96/27
            az:  "{{ aws_region }}a"
            resource_tags: {"Name": "template-team private"}
          - cidr: 10.5.151.128/27
            az:  "{{ aws_region }}b"
            resource_tags: {"Name": "template-team private"}
          - cidr: 10.5.151.160/27
            az:  "{{ aws_region }}c"
            resource_tags: {"Name": "template-team private"}
As you can see, there is not even a hardcoded VPC ID! Ansible identifies the VPC by a Tag-CIDR tuple, which meets our initial requirement of “no hardcoded data”.
To stress this, I changed the aws_region variable to another AWS region, and it was possible to create the basic VPC setup in another region, which is another sign for a successful single-source-of-truth.
Single source information

Now we want to reuse the information of the VPC which we just brought “under control” in the last example. Why should we do this? Well, in order to be fully automated (which is our goal), we cannot afford any hardcoded information.
Let’s start with the VPC ID, which should be one of the most requested IDs. Getting it is relatively easy because we can just extract it from the ec2_vpc module output and assign it as a variable with the set_fact Ansible module:
    - name: Assign main VPC ID
      set_fact:
        main_vpc_id: "{{ main_vpc['vpc_id'] }}"
OK, but we also need to reuse the subnet information - and to avoid hardcoding, we need to address them without using subnet IDs. As we tagged the subnets above, we could use the tuple (name-tag, Availability zone) to identify and group them.
With the awesome help from the #ansible IRC channel folks, I could make it work to extract one subnet by ID and Tag from the output:
    - name: Find the Template team private network subnet id in AZ 1a
      local_action:
        module: set_fact
        template_team_private_subnet_a: "{{ item.id }}"
      when: item['resource_tags']['Name'] == 'template-team private' and item['az'] == 'eu-west-1a'
      with_items: main_vpc['subnets']
While this satisfies the single source requirement, it doesn’t seem to scale very well with a bunch of subnets. Imagine you’d have to do this for each subnet (we already have more than 50 at Jimdo).
After some research I found out that it’s possible to add custom filters to Ansible that allow to manipulate data with Python code:
from ansible import errors, runner
import collections
def subnets(raw_subnets):
    subnets = {}
    for raw_subnet in raw_subnets:
        subnet_identifier = raw_subnet['resource_tags']['Name']
        subnets.setdefault(subnet_identifier, {})
    subnets[raw_subnet['resource_tags']['Name']][raw_subnet['az']] = raw_subnet['cidr']
    return subnets
class FilterModule (object):
    def filters(self):
        return {"subnets": subnets}
We can now assign the subnets for later usage like this in Ansible:
    - name: Assign subnets for later usage
      set_fact:
        main_vpc_subnets: "{{ main_vpc['subnets']|subnets()}}"
This is a great way to prepare the subnets for later usage, e.g. in iterations, to create RDS or ElastiCache subnet groups. Actually, almost everything in a VPC needs subnet information.
Those examples should be enough for now to give us confidence that Ansible is a great tool which fits our needs.
Takeaways
As of of writing this, Ansible and CloudFormation seem to be a perfect fit for me. The combination turns out to be a solid solution to the following problems:

Single source of information / no hardcoded data
Combining documentation and “Infrastructure as Code”
Powerful wrapper around basic AWS CLI tooling
Inception point for other orchestration software (e. g. CloudFormation)
Works with existing AWS resources
Easy to extend (Modules, Filters, etc: DSL weaknesses can be worked around by hooking in python code)

Next steps / Vision

After spiking the solution, I could imagine the following next steps for us:

Write playbooks for all existing stacks and generalize concepts by extracting common concepts (e.g. common tags)
Transform all the tests in Cucumber to Ansible playbooks in order to have a single source
Remove hardcoded IDs from existing CloudFormation stacks by parameterizing them via Ansible.
Remove AWS Console (write) access to our Production AWS account in order to enforce the “Infrastructure as Code” paradigm
Bring more clicked infrastructure / ecosystem under IaC-control by writing more Ansible modules (e.g. GitHub Teams and Users, Fastly services, Heroku Apps, Pingdom checks)
Spinning up the VPC including some services in another region in order to prove we are fully single-sourced (e. g. no hardcoded IDs) and automated.
Trying out Ansible Tower for:
Regular convergence runs in order to avoid configuration drift and maybe even revert clicked settings (similar to “Simian army” approach)
A “single source of Infrastructure updates”
Practices like Game Days to actually test Disaster recovery scenarios

I hope this blog post has brought some new thoughts and inspirations to the readers. Happy holidays!
Resources


Managing CloudFormation Stacks with Ansible
Dean Wilson: All blog posts regarding CloudFormation/Ansible
Automating Away the Regulatory Compliance Myth
AWS Tips I Wish I'd Known Before I Started
Moving away from Puppet: SaltStack or Ansible?
The Tao of HashiCorp
DevOps in a Regulated World - aka 'Ansible, AWS, and Jenkins'