vkz/hoot.md

## hoot.md

      
    Raw
  

              hoot.md
            
          
    Tailscaling git.ht: the game of curling


We continue with our tailscailing git.ht chronicle with this follow up hoot. I'll leave philosophising for later posts to keep this one short and sweet. Let us briefly address Q1 of the Homework I left for the curious reader.
To recap. We have a web-server deployed on-prem. We sprinkle a tiny bit of tailscale magic to make AWS Application Load Balancer route requests to that on-prem resource as if it were a normal load-balancing target, which really it is. And, when I say "server" what I mean, obviously, is an old laptop on my desk, duh. If you want to try it at home, go back to the previous episode in the series. Towards the end I left you with a couple of questions, so today we'll address the first one:

Are we done yet? Will this setup work as expected? Let me predict that if you were to set this up with out of the box ALB in London, you'd see requests time out exactly 2 out of 3 times. Now, why do you think would that be? Experiment. See if you can figure it out.

Preamble

Both this hoot and the previous installment are meant to be hands on. Especially if you're not a sysadmin and, like me, have had limited exposure to infrastructure issues particularly networking. If nothing else this has been an illuminating journey - one that taught me more about AWS and networking than any documentation or textbook ever could. So, yeah, have no fear - go ahead and pointy-click yourself into a corner - that's how we learn best.
The setup

If you need a refresher about load-balancing in AWS, the official how elastic load balancing works is quite good and I wholeheartedly recommend it. First time you read it you'll probably overlook that ELB requires at least two nodes, but ideally as many as there are availability zones in your region. Availability zones is a fancy way of calling datacenters when you mean they are in the same geographical region. If one burns down others will continue to function, hopefully. London has 3, so lets go with that. What this means is that ELB is in fact 3 EC2 instances doing the load-balancing. Let me just pause for a second, cause you need to take this in. When you hear or read "Application Load Balancer" do you immediately think that this is really multiple EC2 instances i.e. multiple machines? Me neither. Have I mentioned how AWS is nothing but EC2 instances with lots of wires?
Our setup:

ALB in eu-west-2 (London) with 3 availability zones, each with an instance of our ALB.
Web server serving our website on-prem accessible via Tailnet IP address.
Each availability zone runs a tailscale relay nano-instance, so that ALB may have access to our on-prem server via Tailnet.
Each subnet (each availability zone gets its own subnet by default) has a routing table (defined via VPC console) that routes Tailnet 100.64.0.0/10 to a tailnet-relay instance deployed in that subnet.
Load-balancing target group with our taillnet on-prem IP address as target that allows traffic from ALB's security group.

The why of the setup

Why do we need 3 and 4, when by default all three subnets have default route that lets them access each other? Because AWS VPC routing tables won't let you route to an IP address. Let's be generous and assume it was an honest omission - not deliberate sneakiness on their part to force you into buying more AWS magic. They only allow you to specify an instance or a network interface, but really either of these end up pointing to a specific network interface, which can only exist in one subnet. This is the only reason we need 3 tailscale relays - one per subnet; and separate tables routing to our tailnet. How annoying.
Game on

In the following replace example.com whith a domain you control and can experiment with. Assuming the above setup was performed correctly, it should just work and e.g. in my case curl -Ikl example.com should succeed no matter how many times we run it. However, note that because we load-balance among 3 availability zones, what this really means is that we have 3 routes - 3 ways for request to get to our on-prem resource. We can easily confirm that by checking:
dig example.com

# Which should respond with exactly 3 IP addresses (one per availability zone - which is to say one per datacenter location)

;; ANSWER SECTION:
example.com.		60	IN	A	18.134.50.86
example.com.		60	IN	A	3.11.23.187
example.com.		60	IN	A	35.176.164.170

If there is a problem with a single route, we'd expect our above curl -Ikl example.com to succeed exactly twice and fail exactly once, when we run it 3 times in a row. Precisely, because ALB defaults to round-robin routing: it simply cycles all 3 routes (or IP addresses, or availability zones). See, how there's no magic behind of it. Sweet.
We can now answer our homework question. In my first hoot about tailscaling git.ht I had you deploy only one tailscale relay in a single availablity zone. With round-robin load-balancing and no way to reach our tailnet from the remaining two availability zones, every two out of three ALB requests would timeout.
Game of curling

None of the above was obvious when I did it. However, once you see the above output from dig a reasonable idea comes to mind, that is you probably want to try and investigate each route. But there is only so much control you can exert over ALB.
Obviously, our load-balancer needs rules for ports 80 and 443 that match our host example.com (and possibly its subdomains) and forward HTTP requests over to our webserver - on-prem resource in our case.
How do we investigate a case of unavailable connection that with ALB typically manifests in 504 or 503 HTTP error.
Skimming the official how to troubleshoot 503 AWS post maybe worth your time, but if the issue is connectivity, firewalls and permissions, then our main tool is curl --resolve (man curl on your box maybe a great start).
Essentially, by supplying --resolve to curl we can force it to resolve our web resource as specific IP, so e.g. for our 3 routes above we could force request to take each route:
curl -IkL --resolve '*:443:35.176.164.170' https://example.com
curl -IkL --resolve '*:443:3.11.23.187' https://example.com
curl -IkL --resolve '*:443:18.134.50.86 https://example.com

notice how we resolved against each load-balancer address we have serving us, effectively diverting our request to each of our 3 availability zones.
The very same technique could be used to see if routing within your subnet or between subnets actually works as you expect. What you need is to spawn a tiny EC2 instance inside the subnet (or its neighbour in another availability zone), ssh in, then then use the --resolve trick above. ping, traceroute, etc are all there of course, but remember AWS VPC defaults to closing everything, so even a host in the same subnet won't be able to talk to one another (even ICMP traffic that's required for ping and tarceroute, won't pass): you'll want to tweak your instances' security groups - most likely their respictive INBOUND rules to allow such traffic from other hosts of interest.
Coda


Did you know I'm running a (mostly) Clojure SWAT team in London at fullmeta.co.uk? We are always on the lookout for interesting contracts. We have two exceptional engineers available as I type these words. Now that git.ht is up, which you should go ahead and check out rigth now, I'll be going back to CTOing and contracting, which makes me available for hire, too. Get in touch directly or find us on LinkedIn. We aren't cheap, but we're very good. We can also perform competently in modern Java, ES6/TS, Go, C#, etc.

Any comments?

As always this hoot is nothing more than a GitHub gist. Please, use its respective comments section if you have something to say. Till next time.