andaryjo/azure-tales-vmss-lb-rules.md

## azure-tales-vmss-lb-rules.md

      
    Raw
  

              azure-tales-vmss-lb-rules.md
            
          
    Azure Tales: The Scale Set that cares too much about Load Balancer rules

I've seen the error CannotRemoveRuleUsedByProbeUsedByVMSS quite a few times now in my Terraform logs, but I've never came around to care enough to actually look into it. Instead, I acknowledged that likely some dependencies are not set up right, shrugged it off and nuked and rebuilt the whole infrastructure, because that was way faster than having to dive into yet another Azure Terraform problem. It was a problem for the future. Well, the future is now.
Some background information on what we're dealing with here. We are operating a Virtual Machine Scale Set that runs a mission-critial ingress for our large-scale IT platform (we had used Azure Application Gateway before, but its limitations made us decide to rather build something on our own). In front of that Scale Set is an Azure Load Balancer, with Load Balancing rules configured for every protocol and port we have a backend service for. Updates to the list of backend services in our ingress happen quite often, which is why also need to update the Load Balancing rules quite often (but their code gets rendered automatically, so that's not a problem).

The problem

The error in question now occurs when we are trying to delete a Load Balancing rule:
Failure sending request: StatusCode=400 -- Original Error: Code="CannotRemoveRuleUsedByProbeUsedByVMSS" Message="Load balancer rule [redacted]/rule-1 cannot be removed because the rule references the load balancer probe [redacted]/probe-health that is used as health probe by VM scale set [redacted]/vmss-ingress. To remove this rule, please update VM scale set to remove the reference to the probe.

The error message states that the rule cannot get deleted because the health probe it references is used by the Scale Set. That sounds weird, so let's break it down.
We have a health probe within the load balancer that looks like this:
resource "azurerm_lb_probe" "health" {
  loadbalancer_id = azurerm_lb.ingress.id
  name            = "probe-health"
  protocol        = "Http"
  port            = 1337
  request_path    = "/healthz"
}
We are referencing this health probe within the Load Balancing rules, so that the Load Balancer can determine whether some or all backends are unhealthy and distribute / stop traffic accordingly:
resource "azurerm_lb_rule" "rule_1" {
  loadbalancer_id                = azurerm_lb.ingress.id
  name                           = "rule-1"
  protocol                       = "Tcp"
  frontend_port                  = 80
  backend_port                   = 80
  frontend_ip_configuration_name = "frontend"
  backend_address_pool_ids       = [azurerm_lb_backend_address_pool.ingress.id]
  probe_id                       = azurerm_lb_probe.health.id
}
But we are also using this probe for our Scale Set. This way it can utilize the health status of its instances for rolling uprade policies or even to simply display the health status in the Azure Portal. This is also important if you want to create Application health metrics / alerts with Azure Monitoring. This alone is a weird concept to me (why does the Scale Set reuse probes from another resource?), but hold on, it will get much weirder.
resource "azurerm_linux_virtual_machine_scale_set" "ingress" {
  [...]
  health_probe_id = azurerm_lb_probe.health.id
  [...]
}
This makes the Terraform dependency tree quite clear:

azurerm_lb_rule depends on azurerm_lb_probe and
azurerm_linux_virtual_machine_scale_set depends on azurerm_lb_probe

But why in the world would we not be able to delete a rule which simply references a health probe and has no other obvious dependencies?
What's going on here?

Complying with what the error message says, removing the health probe reference from the Scale Set allows us to delete the Load Balancing rule. But why is that? What could be going on with these health probes, so that the Scale Set prohibits deleting Load Balancing rules that reference the same probe?
Looking into the JSON view of the Load Balancer resource shows that loadBalancingRules actually is a property of the resource type Microsoft.Network/loadBalancers/probes:
{
    "name": "probe-health",
    "id": "[redacted]",
    "etag": "[redacted]",
    "properties": {
        "provisioningState": "Succeeded",
        "protocol": "Http",
        "port": 1337,
        "requestPath": "/healthz",
        "intervalInSeconds": 15,
        "numberOfProbes": 2,
        "loadBalancingRules": [
            {
                "id": "[redacted]/rule-1"
            },
            {
                "id": "[redacted]/rule-2"
            },
            {
                "id": "[redacted]/rule-3"
            }
        ]
    },
    "type": "Microsoft.Network/loadBalancers/probes"
}
So any removal of a Load Balancing rule would be reflected here as well. Maybe changes to the probe resource are not allowed as long as it gets used by the Scale Set? A quick test confirms this:
Failure sending request: StatusCode=400 -- Original Error: Code="CannotModifyProbeUsedByVMSS" Message="Load balancer probe [redacted]/probe-health cannot be modified because it is used as health probe by VM scale set [redacted]/vmss-ingress. To make changes to this probe, please update VM scale set to remove the reference to the probe."

Weirdly enough, creating new rules referencing the probe is fine, even though they get added to the loadBalancingRules array of the probe. It's apparently only the removal from that array that's not allowed.
What to do now?

Note that this is not a limitation of the Azure Terraform Provider. You'll see the same behavior in the Azure Portal or the Azure CLI.
There are a few GitHub issues about this (1, 2, 3), but the only solution that gets proposed by HashiCorp / Microsoft engineers is to utilize the depends_on stanza in your Terraform Scale Set resource to make it dependent on all the Load Balancing rules. This would obviously work, but always destroy and recreate the whole Scale Set for a rule removal, resulting in downtime. This is obviously not a feasible solution for our critical infrastructure.
The other solution would be to do what we just did: Remove the health probe reference from the Scale Set and then add it after the rule has been deleted. But this always requires manual intervention (or a really hacky local-exec Azure CLI provisioner Terraform clusterfuck) and is not feasible either.
But could we not simply give the Scale Set its own health probe so that it does not care about the Load Balancing rules anymore? Unfortunately not:
Failure sending request: StatusCode=400 -- Original Error: Code="CannotUseInactiveHealthProbe" Message="VM scale set [redacted]/vmss-ingress cannot use probe [redacted]/probe-dummy as a HealthProbe because load balancing rules ([redacted]) that send traffic to the scale set IPs in backend address pools ([redacted]) do not use this probe." Details=[]

That is so weird. Why does the Scale Set care at all about these rules? I can't come up with any reasonable explanation for this. If you do have any, please let me know.
But the attentive reader will already know what comes next. The new probe needs a Load Balancing rule, so let's give it one:
resource "azurerm_lb_rule" "dummy" {
  loadbalancer_id                = azurerm_lb.thomas.id
  name                           = "rule-dummy"
  protocol                       = "Tcp"
  frontend_port                  = 1337
  backend_port                   = 1337
  frontend_ip_configuration_name = "frontend"
  backend_address_pool_ids       = [azurerm_lb_backend_address_pool.ingress.id]
  probe_id                       = azurerm_lb_probe.dummy.id
}
And this finally works. The Scale Set now uses a dedicated health probe for application health monitoring and we can safely remove our Load Balancing rules from the Load Balancer without needing to destroy the Scale Set or manual intervention. Don't forget to use the depends_on stanza to create a dependency between the Scale Set and the dummy rule in case you actually want to destroy your resources.
Your security guys might not like this solution, because it openes another frontend port to the Internet on the Load Balancer, but the traffic won't end up anywhere as long as you don't have any corresponding Network Security Group rules in place for your vnet (at least for Public IP addresses with Basic SKU - relevant Azure Shit).
Conclusion

The mystery why the Scale Set prohibits most changes to the referenced health probe remains largely unsolved, but at least now we have yet another hacky workaround to add to our Azure Terraform hell.