smoy/2023-05-04-thoughts-on-iambic-and-terraform.md

## 2023-05-04-thoughts-on-iambic-and-terraform.md

      
    Raw
  

              2023-05-04-thoughts-on-iambic-and-terraform.md
            
          
    Preface: I am a developer in the IAMbic repo, so indeed my perspective is biased.
Developers have individual preferences. My response is to discuss the design
trade/offs why we decided to invest in IAMbic.

“possible with Terraform (albeit an extra step or utilizing something like terragrunt).“

During the design phase, we thought about writing a transpiler to go from YAML to HCL2 to
be terraform-compatible. However terraform constraints and our own terraform experience at
previous companies have steered us away from that option. Here are a few reasons as to why:


Terraform AWS provider is single AWS account oriented. Native support for multi-account is
a high design priority for security practitioners of larger environments in AWS.


Single public cloud account for the entire company is not an acceptable risk because
developers want high velocity and security team needs to ensure developers can move fast in
dev but not put customer or company data at risk. The ratio of security staff to other
developers is like 1 to 50. At a previous startup (series A), it was 1 to 35. At a public
company I worked at, it was 1:53. We established multi-account-support as a pillar in our
design to leverage AWS accounts as a security boundary.


Terragrunt is an improvement in Terraform. At a public company I worked at, that was
how the multi-AWS-account situation was handled. However, these are other challenges.
Terragrunt excels at keeping terraform DRY (don’t repeat yourself). The project is
limited by the terraform ecosystem, meaning HCL and other terraform constraints.
Setting up terragrunt is not particularly well known across the infrastructure team.
In an infrastructure org with 50+ developers, we had less than 5 developers with great
fluency in HCL and the execution flow knowledge during a terraform plan/apply.
Most of our other 600+ micro service developers that want to customize IAM privileges of
their service roles or team roles do not want to learn HCL. The bottleneck on the security
team of being able to use and expand terragrunt structure on our existing AWS multi-account
layout is very real.


we studied the workflow of security team


complete IaC workflow of defining and creating the IAM definition
creating new account should not require someone to extend terragrunt.hcl definition to enroll the newly created account.
ability to deal with drift section (between cloud and IaC)
ease of importing/detecting resources that are not currently defined in IaC

our analysis is


terraform / terragrunt is a by-product of existing infrastructure.
It’s difficult to replace but also difficult to extend. Let’s play out the scenario.
We build a terraform module that understands the AWS Organizations API and then we subsequently
use terragrunt to handle  multiple AWS accounts. First question:  how do we define
an IAM ROLE that should only exist in certain accounts? We will have to do something like

resource "aws_iam_role" "test_role" {
  count = var.test_role_should_exist ? 1 : 0
  name    = "${var.account_name}_test_role"
  # ... there are still more config for aws_iam_role
}

the count logic is ok for making terraform to do what we want, but it’s a high cognitive
burden to the micro service developer to say test_role should be created in account_search_team_dev,
account_search_team_stage, and account_search_team_prod. it will mean the micro service developer
needs to know how to extend terragrunt.hcl to declare the new variable. The microservice developer
will usually just wait for the security team to implement the HCL logic for them.


terraform imports single resources at a time. it’s manual. Using projects like terraformer
is difficult in practice - Try getting your newly detected resources to play nicely with your
existing Terraform configuration


Terraform configuration is extremely variable in organization. Every organization has to
customize how they use terraform. Any new tooling that relies on terraform has a high degree
of setup cost because of high customizability.


Learning HCL is not particularly fun. That’s. also a reason we have decided at this point not
to invent yet-another  domain-specific language for IAM because that would be another cost of
adopting the system. Nested in the IAMbic yaml structure is just vanilla AWS IAM permissions.


Whatever solution should play nice with existing terraform installation since existing terraform installations are like inertia. (hard to modify or replaced)


“at scale these templates will be more cumbersome to use”

That depends on what kind of action you want to achieve. These are the scenarios
we have thought of so far:

Large numbers of AWS accounts in the organization. My colleagues have experience
with 1000+ AWS accounts in another large public company. We know that a template
structure cannot require folks to individually list out all the accounts the resources
(like IAM roles) in the template should apply to. It needs some sort of regex-like matching,
and a future benefit would be matching on tags or other artifacts. We extend support like dev_*
to represent the template should be applied to account that match dev_ prefix. We also added
support for the exclude perspective, such that you can define pii_* to exclude application
to any accounts that designated with pii_ prefix. Scaling can go in many directions. I am curious to learn more what scaling challenges you worry about a solution like IAMbic has.


“written in YAML which is another negative mark for me”

I had worked in company with YAML configurations that had 1000+ lines.
I can emphasize with the sour taste when one works in such large YAML files.
(especially in a world with k8s and the large yaml). The large yaml problem would exist
if there are much customization in single template. Just like we don’t want to see 1000+
lines in a single python file, eventually we have to figure out how to do include fragments
from a different file. We are not there yet. We really have decide against yet another
domain-specific-language at the moment because we have not identified what situation
YAML are terrible to express the declarative state of IAM definition.
If you have a DSL in mind, I’d like to learn more.

“Also, when using more granular permission sets, it seems far easier and more manageable to have it with your IaC code so you can directly reference resource ids in the policy.”

Non-predictable resource-ids is a big problem when security team reviews IaC change
of IAM policies. For example, I have crafted many Data Lake policies that involve
what s3 buckets are accessible by data analytics micro services. Micro-service
developers will not inform security teams of the various name patterns they
are writing their data. Meaning security team needs to investigate every
resource declaration when they review an IAM policy pull request. The typical approach
is namespace (or prefix in ARN). It’s very frustrating for security teams when micoservice
developers keep their IAM IaC code in their micoservice repo because they have to hunt
down what suddenly create a new IAM role outside of the company-wide terraform repo.
In fact, terraform does not create about resources that is not declared in it state file,
is of the reason why we have decide against using terraform because security team wants
to know rouge IAM changes out of band.