Skip to content

Instantly share code, notes, and snippets.

@vkz
Last active May 31, 2023 08:58
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save vkz/4b50ecb6e96e440429f3b22adb90fd1f to your computer and use it in GitHub Desktop.
Save vkz/4b50ecb6e96e440429f3b22adb90fd1f to your computer and use it in GitHub Desktop.
Curling behind the enemy lines: debugging web-resource connectivity - AWS VPC edition

Tailscaling git.ht: the game of curling

We continue with our tailscailing git.ht chronicle with this follow up hoot. I'll leave philosophising for later posts to keep this one short and sweet. Let us briefly address Q1 of the Homework I left for the curious reader.

To recap. We have a web-server deployed on-prem. We sprinkle a tiny bit of tailscale magic to make AWS Application Load Balancer route requests to that on-prem resource as if it were a normal load-balancing target, which really it is. And, when I say "server" what I mean, obviously, is an old laptop on my desk, duh. If you want to try it at home, go back to the previous episode in the series. Towards the end I left you with a couple of questions, so today we'll address the first one:

Are we done yet? Will this setup work as expected? Let me predict that if you were to set this up with out of the box ALB in London, you'd see requests time out exactly 2 out of 3 times. Now, why do you think would that be? Experiment. See if you can figure it out.

Preamble

Both this hoot and the previous installment are meant to be hands on. Especially if you're not a sysadmin and, like me, have had limited exposure to infrastructure issues particularly networking. If nothing else this has been an illuminating journey - one that taught me more about AWS and networking than any documentation or textbook ever could. So, yeah, have no fear - go ahead and pointy-click yourself into a corner - that's how we learn best.

The setup

If you need a refresher about load-balancing in AWS, the official how elastic load balancing works is quite good and I wholeheartedly recommend it. First time you read it you'll probably overlook that ELB requires at least two nodes, but ideally as many as there are availability zones in your region. Availability zones is a fancy way of calling datacenters when you mean they are in the same geographical region. If one burns down others will continue to function, hopefully. London has 3, so lets go with that. What this means is that ELB is in fact 3 EC2 instances doing the load-balancing. Let me just pause for a second, cause you need to take this in. When you hear or read "Application Load Balancer" do you immediately think that this is really multiple EC2 instances i.e. multiple machines? Me neither. Have I mentioned how AWS is nothing but EC2 instances with lots of wires?

Our setup:

  1. ALB in eu-west-2 (London) with 3 availability zones, each with an instance of our ALB.
  2. Web server serving our website on-prem accessible via Tailnet IP address.
  3. Each availability zone runs a tailscale relay nano-instance, so that ALB may have access to our on-prem server via Tailnet.
  4. Each subnet (each availability zone gets its own subnet by default) has a routing table (defined via VPC console) that routes Tailnet 100.64.0.0/10 to a tailnet-relay instance deployed in that subnet.
  5. Load-balancing target group with our taillnet on-prem IP address as target that allows traffic from ALB's security group.

The why of the setup

Why do we need 3 and 4, when by default all three subnets have default route that lets them access each other? Because AWS VPC routing tables won't let you route to an IP address. Let's be generous and assume it was an honest omission - not deliberate sneakiness on their part to force you into buying more AWS magic. They only allow you to specify an instance or a network interface, but really either of these end up pointing to a specific network interface, which can only exist in one subnet. This is the only reason we need 3 tailscale relays - one per subnet; and separate tables routing to our tailnet. How annoying.

Game on

In the following replace example.com whith a domain you control and can experiment with. Assuming the above setup was performed correctly, it should just work and e.g. in my case curl -Ikl example.com should succeed no matter how many times we run it. However, note that because we load-balance among 3 availability zones, what this really means is that we have 3 routes - 3 ways for request to get to our on-prem resource. We can easily confirm that by checking:

dig example.com

# Which should respond with exactly 3 IP addresses (one per availability zone - which is to say one per datacenter location)

;; ANSWER SECTION:
example.com.		60	IN	A	18.134.50.86
example.com.		60	IN	A	3.11.23.187
example.com.		60	IN	A	35.176.164.170

If there is a problem with a single route, we'd expect our above curl -Ikl example.com to succeed exactly twice and fail exactly once, when we run it 3 times in a row. Precisely, because ALB defaults to round-robin routing: it simply cycles all 3 routes (or IP addresses, or availability zones). See, how there's no magic behind of it. Sweet.

We can now answer our homework question. In my first hoot about tailscaling git.ht I had you deploy only one tailscale relay in a single availablity zone. With round-robin load-balancing and no way to reach our tailnet from the remaining two availability zones, every two out of three ALB requests would timeout.

Game of curling

None of the above was obvious when I did it. However, once you see the above output from dig a reasonable idea comes to mind, that is you probably want to try and investigate each route. But there is only so much control you can exert over ALB. Obviously, our load-balancer needs rules for ports 80 and 443 that match our host example.com (and possibly its subdomains) and forward HTTP requests over to our webserver - on-prem resource in our case.

How do we investigate a case of unavailable connection that with ALB typically manifests in 504 or 503 HTTP error. Skimming the official how to troubleshoot 503 AWS post maybe worth your time, but if the issue is connectivity, firewalls and permissions, then our main tool is curl --resolve (man curl on your box maybe a great start).

Essentially, by supplying --resolve to curl we can force it to resolve our web resource as specific IP, so e.g. for our 3 routes above we could force request to take each route:

curl -IkL --resolve '*:443:35.176.164.170' https://example.com
curl -IkL --resolve '*:443:3.11.23.187' https://example.com
curl -IkL --resolve '*:443:18.134.50.86 https://example.com

notice how we resolved against each load-balancer address we have serving us, effectively diverting our request to each of our 3 availability zones.

The very same technique could be used to see if routing within your subnet or between subnets actually works as you expect. What you need is to spawn a tiny EC2 instance inside the subnet (or its neighbour in another availability zone), ssh in, then then use the --resolve trick above. ping, traceroute, etc are all there of course, but remember AWS VPC defaults to closing everything, so even a host in the same subnet won't be able to talk to one another (even ICMP traffic that's required for ping and tarceroute, won't pass): you'll want to tweak your instances' security groups - most likely their respictive INBOUND rules to allow such traffic from other hosts of interest.

Coda

Did you know I'm running a (mostly) Clojure SWAT team in London at fullmeta.co.uk? We are always on the lookout for interesting contracts. We have two exceptional engineers available as I type these words. Now that git.ht is up, which you should go ahead and check out rigth now, I'll be going back to CTOing and contracting, which makes me available for hire, too. Get in touch directly or find us on LinkedIn. We aren't cheap, but we're very good. We can also perform competently in modern Java, ES6/TS, Go, C#, etc.

Any comments?

As always this hoot is nothing more than a GitHub gist. Please, use its respective comments section if you have something to say. Till next time.

@dne
Copy link

dne commented May 30, 2023

It's definitely not required to have one relay per subnet – the target instance/interface for a route table entry can be in a different subnet.

E.g. I've used a single tailnet relay for a VPC that was shared between 4 separate AWS accounts, routing between multiple subnets/AZs/accounts and the tailnet.

@vkz
Copy link
Author

vkz commented May 31, 2023

That was certainly my expectation going in. I spent a good one hour banging my head on the table, cause no matter what I tried it didn't work. IIRC problem isn't with routing in general i.e. when you have a bunch of instances you control possible in different subnets and AZs but with routing traffic for the ALB specifically, over which you have only limited control - certainly not how it routes. That's why IMO calling it an Application Load Balancer (singular) is misleading - it really is as many instances as you have availability zones and you control none of these machines. Whatever firewall rules they have in place seem to limit their traffic to the subnet they are in. I.e. changing VPC routing table won't effect that. Could be an oversight or deliberate architecture decision. End result is the same, you're forced to have relay in each subnet your ALB has its instance in.

I am more than happy to be overruled on this. Could you provide a recipe for this usecase?

PS this limitation is another reason why I'd strongly consider running my own proxy in place of ALB.

@dne
Copy link

dne commented May 31, 2023

I see, of course it might well be an ALB limitation. Could you just deploy ALB in a single AZ, and not all of them? The relay instance would be a SPOF anyway…

In my case the relay was used for access to EC2/RDS instances, AmazonMQ etc.; never tried it with ALB as in your example.

@vkz
Copy link
Author

vkz commented May 31, 2023

Alas, ALB requires the minimum of 2 AZs - that's hard AWS requirement. I actually attempted to break it and was overruled. You won't be able to deploy without at least 2 subnets effectively. I mean, it kinda makes sense for higher availability scenario i.e. one data center going up in flames, but well it certainly got in my way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment