This hoot you're about to read started its life as a gist of technical minutae I had to discover to Tailscale cloud and on-prem resources. It'd be too boring for me to just publish it, so I thought I'd give a little bit of context as a preamble. I get to rant some and you can simply skip to the technical sections below if you'd rather not indulge me.
How much of what AWS gives you do you really truly need? It's become Kubernetes this, Docker that. I rarely question it when contracting because contractors usually arrive after the fact to pick up and carry the pieces. You get to target the infrastructure that may as well have been installed by alients before humankind ascended in Amazon. All in the cloud, can't touch, jumpbox this, terraform that, re-create the world every time you sneeze, etc. It's different when you're hired for a greenfield project and get to call the infra shots. You don't get to be too creative with it cause deadlines and chances are you'd recreate half of Heroku badly. You must decide, so you settle on something of a middleground, unless overruled by coleagues trying to beef up their resumes with latest buzzwords. With the exception of, in my case, Yandex where we really had scale I am yet to meet a business that could outgrow something like AWS Beanstalk or whatever Heroku-clone Google and Microsoft offer these days.
Long story short initial prototype of git.ht used AWS Beanstalk. It was ok. Except there'd always be a kind of impedance mismatch between development and deployment. Put differently - the extent to which you can exercise control in the most straightforward fashion. What I have in mind is the kind of live development you'd expect programming Clojure. To that end you kinda want to have as much control over your infrastructure pieces as you can get, and you want it live and you want fewer pieces. Think exposing REPLs and sockets, ssh everywhere without passwords, execute commands remotely right from Emacs, connect to remote Emacs sessions if needed, resolve everything by name, pick and choose where your load-balancer directs traffic on the fly without ever touching its rules, program and test your caching layer like CloudFront without redeploying anything, etc. I'll leave the setup I arrived at for later posts, cause it'd run long and would have to span a whole range of topics of product development. It ought to be holistic. Let's just say git.ht is minimalistic and had to reinvent a bunch of things its own way simply because I'm the only engineer on the project, so there's only so much grey matter and time I can bring to bear. Noone optimizes for the number of engineers required to maintain the thing without losing sleep, so far as I encountered. Anyway. Once you start stripping Beanstalk to the essentials you truly require, once you begin to question every bell and whistle AWS insists you can't run your business without, you find there's only a handful of things you actually want AWS to handle for you. For example, the hoot you're reading now describes what came about after realising that an old laptop lying idly on my desk can probably handle more load than any EC2 instance I can get at reasonable cost on AWS. It'll let you do everything an EC2 instance would but you could also pop open the lid and get hacking. A silly thing, but seriously. I also have a second hand Dell R730 in the closet that at 72 thread count, 256GB RAM, 16x1.5TB SAS SSDs in the front bay will most certainly outperform and outcost most anything outside of ML rigs in the cloud. So long as you have a reliable fiberoptic and decent up and downlink, you are all set. If ever you find you're about to outgrow this, chances are you've reached success, then, by golly, hire a devops team and pay up.
I'll spare you my personal take re AWS and its ilk. Gist of it is I view them as nothing more than utility companies (quite expensive at that) whose only job is to furnish me with compute instances, block and object storage, VPC, external IPs and a way to resolve names. Many of the "value added" niceties they try to shove up my throat, in my experience, never quite fit the problem and take a lifetime to master, or at least about the same time you'd spent learning and building the minimal solution from first principles. Examples are aplenty e.g. lambda functions, which are all the rage. The one legitimate use case I can think of, well, they're never used for that. Have people never heard of CGI? Do you really find lambdas natural to debug? I wish people would spend more time documenting the process, sharing examples of building things from scratch, rather than attempting to spare me the nitty-gritty by selling me black boxes. Nasty details emerge crucial more often than you'd expect, so you end up untangling the abstractions and wasting precious conginitive budget in all the wrong places. Same goes for libraries and DSLs. You simply cannot use OAUTH flow without understanding exactly how it works, so at best an off the shelf library that's there ostensibly to save you time, saves you nothing, at worst triggers hair loss and premature aging cause it failed to provide the particular quirky nonsense some external service decided to require. It would be less work to teach someone with examples how OAUTH works, but teaching others doesn't pay. It should.
Much of the value I see in AWS is the pointy clicky console, not because it gives you a way to pointy-click your infrastructure but because of visibility and feedback. It's an underappreciated way to experiment and learn about networking and other resources. All those other Unix cli utilities come later - you'll appreciate them more and learn faster having gone through the trouble of managing things in the console.
As you may well guess I'm a bit of a renagade, at least on my own time. I, for one, am quite convinced that CI/CD is a scam. Shudder. I'm not about to die on that hill and won't ever press the issue when contracted. Just felt like putting it out there. Something something rsync
...
Anyway. Let's get practical. First step in our journey is to bring AWS and on-prem resources together i.e. level the playing field and by doing so seize quite a bit of control from the grabby hands of our cloud overlords. Essentially you want everything networked without being exposed to the outside world. Pretty much we are talking VPN here, though in our case we want it to be as little setup and maintainance as possible yet affording us as much leverage as feasible. Do it right and you not only get all of your resources networked, you can implement security in depth. We'll go with tailscale for no particular reason other than it's straightforward to get started, binaries work almost out of the box on GNU Guix (only matters to me) and EC2 Linux and it comes with handy batteries like Magic DNS. This'd probably work just as well with zerotier whose ideas resonate with me and, IMO, may have more potential than tailscale. You could also roll out your own based on Wireguard or smth.
Finally, here's what you came for - boring technical minutae. Almost unchanged from the original gist. Enjoy and report back if you can think of an impromevemnt.
First, simply adding a bunch of EC2 instances to our Tailnet is as simple as installing Tailscale on each node. Tailscale then does all the heavy lifting, routing, etc. Only thing you may want to do in this instance would be to also add AWS DNS to your Tailnet so you may use the exact same names as AWS - that's optional.
A somewhat more challenging task is to expose your Tailnet to AWS "devices" like AWS ALB (Application Load Balancer) and such, where you have no control of the underlying EC2 instances and therefore unable to install Tailscale. Why would you want to do that? Continuing with our AWS ALB example, with your Tailnet resources available, you could then create a load-balancing target group with targets in your Tailnet e.g. a beefy server in your closet that would cost $$$s when rented on AWS. Once available you then add an ALB listener rule that sends your visitors over to a server in your Tailnet effectively on-prem.
Here's a gist of the necessary steps to make this happen.
- Use an EC2 instance as a "beacon" - tailscale subnet router that will relay traffic between AWS VPC and your Tailnet.
- Extend your VPC route table with your Tailnet route going via the above "beacon" instance.
- Create load-balancing target group with IP target in your Tailnet.
- Add ALB listener rule that sends visitors to that target group.
- Alter ALB security group: allow outbound traffic anywhere on PORT where your target is listening e.g. PORT 3000.
- Alter "beacon" security group: allow all inbound traffic from ALB's security group; allow all outbound traffic going anywhere.
- On "beacon" set
Actions - Networking - Cange source/destination check
tofalse
if you want Tailnet source / dest addresses preserved rather than NATed i.e. you start Tailscale with--snat-subnet-routes=false
. - Be extra careful with your target group "health" status checking i.e. ensure whatever health endpoint ALB is set to ping is indeed available and responds with 200, lest you'll experience some puzzling "Server time out" errors
Start an EC2 instance. Micro or even nano should be fine. Assuming its Amazon Linux 2 AMI follow the tailscale install instructions or install a tailscale binary.
Start tailscale:
sudo tailscale up --advertise-routes=172.31.32.0/20 --accept-dns=false --snat-subnet-routes=false --accept-routes --reset
Change --advertise-routes=YOUR_VPC_SUBNET
to your VPC subnet you intend to use or include every subnet separating with ,
. Typically ALB has subnet of every availability zone in its security group, but if not you may want to also add that subnet to your ALB subnets.
The ---snat-subnet-routes=false
flag is optional and only controls how exactly source destination addresses flow through our "beacon" relay - whether they are NATed or not. See tailscale site-to-site for more details on that. I don't think that's really necessary for our setup to work. If you go this route, then you may also need to disable Check source/destination
:
NB At this point tailscale is likely to complain that port forwarding is disabled. We must enable it! Follow their own tailscale ipforwarding instructions, which probably amount to:
echo 'net.ipv4.ip_forward = 1' | sudo tee -a /etc/sysctl.d/99-tailscale.conf
echo 'net.ipv6.conf.all.forwarding = 1' | sudo tee -a /etc/sysctl.d/99-tailscale.conf
sudo sysctl -p /etc/sysctl.d/99-tailscale.conf
Extend your default VPC routing table so that it knows to route Tailnet traffic via our "beacon" relay.
Create an IP target group for your tailnet resources
Notice that target group default port (8080 in the example) can be overriden by an actual port of a target (3000 in the example). Take care to specify correct health check endpoint that you know works on your target, else ALB will ping and time out, then wait for 30sec before trying again - it is extremely puzzling to observe your resource respond one second and then show 503 or whatever the next.
Allow traffic outgoing from your ALB over the PORT your server will be listening on:
Allow all inbound traffic from ALB's security group on our "beacon" tailscale relay:
Also allow outbound traffic going anywhere on "beacon".
Add ALB listener rule forwarding visitors to your new talinet target group:
That one's easy. Create an CNAME or A-alias record pointing to your ALB.
-
Are we done yet? Will this setup work as expected? Let me predict that if you were to set this up with out of the box ALB in London, you'd see requests time out exactly 2 out of 3 times. Now, why do you think would that be? Experiment. See if you can figure it out. I'll post an answer in an upcoming post and link to it later.
-
Assuming you've figured out problem 1 and fixed it, how are we doing for reliability. What would you expect happen with requests if a "beacon" went down? Yeah, its the AWS way or no way.
If nothing else the above strengthen my earlier point about buying into "hosted" services. First, git.ht doesn't really need the kind of load-balancing AWS provides but even if it did, having encountered some of the decisions and limitations of ALB, I'm starting to think that running your own proxy or relay may not be such a bad idea. We'd really be heading towards AWS being nothing more than a utility company at this point. What's even left? RDS?
tailscale relay discusses subnet routers in general and is helpful, while tailscale site-to-site makes it obvious you need to take care of routing. tailscale AWS RDS emphasises reaching AWS VPC resources from your Tailnet and is silent the other way around. tailscale AWS VPC would've been perfect were it not for the emphasis on NAT Gateway, which confuses everything for a non admin like myself. tailscale forum discussion about exactly what we need but is vague and asumes you already know how stuff works.
Did you know I'm running a (mostly) Clojure SWAT team in London at fullmeta.co.uk? We are always on the lookout for interesting contracts. We have two exceptional engineers available as I type these words. Now that git.ht is up, which you should go ahead and check out rigth now, I'll be going back to CTOing and contracting, which makes me available for hire, too. Get in touch directly or find us on LinkedIn. We aren't cheap, but we're very good. We can also perform competently in modern Java, ES6/TS, Go, C#, etc.
As always this hoot is nothing more than a GitHub gist. Please, use its respective comments section if you have something to say. Till next time.