Skip to content

Instantly share code, notes, and snippets.

@Snawoot
Last active January 30, 2024 20:12
Show Gist options
  • Star 33 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save Snawoot/b7065addf014d90f858dbd185d514cde to your computer and use it in GitHub Desktop.
Save Snawoot/b7065addf014d90f858dbd185d514cde to your computer and use it in GitHub Desktop.

Poor Man's Global Traffic Manager

Sometimes we need to add redundancy to some service or server which happen to be a public-facing entry point of our infrastructure. For example, imagine we want to add a high availability pair for a load balancer which sits on the edge of network and forwards traffic to alive backend servers.

                                             ┌─────────────┐
                                             │             │
                                      ┌─────►│  Backend 1  │
                                      │      │             │
                                      │      └─────────────┘
                                      │
                                      │
                                      │      ┌─────────────┐
                                      │      │             │
                    ┌────────────┐    ├─────►│  Backend 2  │
                    │            │    │      │             │
                    │            │    │      └─────────────┘
  Public traffic    │    Load    │    │
───────────────────►│            ├────┤
                    │  Balancer  │    │      ┌─────────────┐
                    │            │    │      │             │
                    │            │    ├─────►│     ...     │
                    └────────────┘    │      │             │
                                      │      └─────────────┘
                                      │
                                      │
                                      │      ┌─────────────┐
                                      │      │             │
                                      └─────►│  Backend N  │
                                             │             │
                                             └─────────────┘

We can't just add another load balancer in front of it because otherwise any kind of switch in front of our HA pair will become a single point of failure itself too. But we still need to switch traffic between load balancer instances:

                                             ┌─────────────┐
                                             │             │
                                      ┌─────►│  Backend 1  │
                    ┌────────────┐    │      │             │
                    │            │    │      └─────────────┘
                    │            │    │
  Public traffic    │    Load    │    │
───────────────────►│            ├────┤      ┌─────────────┐
         ▲          │ Balancer 1 │    │      │             │
         │          │            │    ├─────►│  Backend 2  │
         │          │            │    │      │             │
                    └────────────┘    │      └─────────────┘
     Switching?                       │
                                      │
         │          ┌────────────┐    │      ┌─────────────┐
         │          │            │    │      │             │
         ▼          │            │    ├─────►│     ...     │
  Public traffic    │    Load    │    │      │             │
─ ─ ─ ─ ─ ─ ─ ─ ─ ─►│            ├────┤      └─────────────┘
                    │ Balancer 2 │    │
                    │            │    │
                    │            │    │      ┌─────────────┐
                    └────────────┘    │      │             │
                                      └─────►│  Backend N  │
                                             │             │
                                             └─────────────┘

Rich Wealthy Men's Solutions

There are various solutions for this problem exist for a long time. Basically, all of them mess with the network switching at some level in order to direct incoming public traffic to both or only one load balancer.

VRRP, CARP, Virtual IP, Floating IP, ...

Essentially assigns one or few IP addresses to one or few active load balancer instances. IP addresses (re-)attached to operational instances on failure. Such methods ultimately use local network equipment to switch traffic to operational load balancers.

It is worth noting that network equipment in question does not necessarily has any redundancy. For example, there may be perfectly good VRRP pair of two load balancers, still connected to single Ethernet switch, acting as a single point of failure. Even redundant switches may be prone to simultaneous failures due to similar conditions and common broadcast domain.

This solution is suitable for local traffic management only, e. g. for load balancers within single datacenter.

Anycast, BGP, other methods based on dynamic routing

Usually not used alone, but in conjunction with local traffic management within point of presence (datacenter, availability zone, whatever). Sends announces of single IP block from multiple locations, effectively making traffic to some IPs served by machines in multiple locations. These methods ultimately use network neighbors and neighbors of their neighbors as a switch in front of their infrastructure, announcing or not anouncing specific blocks from given location.

This particular method is available only to fairly large network operators, presumably even operating their own autonomous systems.

DNS-based methods

Does traffic switching at DNS level, driving client to correct destination server. There are following options widely known:

  • Round-robin DNS. Actually, non-option, because it just exposes all potentially available instances, hoping client either will be lucky to connect to the right one or persistent enough to try until it finds working one.
  • Dynamic DNS, tracking state of origin servers (AWS Route 53, Cloudflare DNS LB, PowerDNS dnsdist, ...). Keeps track on healthy destination servers and responds to address requests with just one address which belongs to currently healthy server.

Interesting fact about such cloud DNS load balancing services is that these are billed on per-request basis, but we basically have no way to control incoming flow of requests or a way to check if these DNS requests actually happened.

Wealthy Men's Problems

Some of these methods are hard to implement properly. Even for keepalived it is recommended to run VRRP protocol on separate link between servers. Otherwise maxed out bandwidth of single link will interfere with master election. Some of them are easy as plug and play (DNS GTMs), but may become quite costy.

Solutions above imply that traffic forwarding target (load balancer or other origin server) is either healthy or faulty, which is quite an assumption. It may be not always the case, especially for global traffic management solutions. For example, AWS Route 53 makes periodic healthcheck probes from few locations to ensure target server is available. But it may be not necessarily available from some remote locations while other origin servers are - connectivity on the Internet is not binary.

Poor Man's Global Traffic Manager doesn't make such assumptions, not limited to single datacenter only, doesn't have moving parts and costs you basically nothing. With it you can spin up global-scale fault-tolerant services quickly and dedicate more time to make living.

Layout

Usually DNS resolving process works like this:

  ┌────────────┐ A? example.com ┌───────────────┐
  │            │      (1)       │               │
  │            ├───────────────►│ DNS recursive │
  │   Client   │                │               │
  │            │◄───────────────┤   resolver    │
┌─┤            │ A  example.com │               │
│ └────────────┘      (8)       └┬────┬────┬────┘
│                                │ ▲  │ ▲  │ ▲
│  ┌─────────────────────────────┘ │  │ │  │ │
│  │A? example.com (2)             │  │ │  │ │
│  │ ┌─────────────────────────────┘  │ │  │ │
│  ▼ │NS com (3)                      │ │  │ │
│ ┌──┴──────────┐    ┌────────────────┘ │  │ │
│ │             ├┐   │A? example.com (4)│  │ │A? example.com (6)
│ │    ROOT     ││   │NS example.com (5)│  │ │A  example.com (7)
│ │             ││   ▼ ┌────────────────┘  │ │
│ │ nameservers ││  ┌──┴──────────┐        │ │
│ │             ││  │             ├┐       │ │
│ └┬────────────┘│  │    .COM     ││       │ │
│  └─────────────┘  │             ││       ▼ │
│                   │ nameservers ││  ┌──────┴──────┐
│                   │             ││  │             ├┐
│                   └┬────────────┘│  │ example.com ││
│                    └─────────────┘  │             ││
│                                     │ nameservers ││
│                                     │             ││
│                                     └┬────────────┘│
│                                      └─────────────┘
│                  ┌─────────────┐
│                  │             ├┐
│Actual connection │ EXAMPLE.COM ││
└─────────────────►│             ││
         (9)       │   servers   ││
                   │             ││
                   └┬────────────┘│
                    └─────────────┘

Client wants to establish connection with some host specified by its domain name. Client asks DNS resolver (usually it's DNS servers provided by ISP, residing in the same network). DNS resolver, if has no record in cache, follows all hierarchy of authoritative DNS servers. On each step it either gets redirected to more specific nameserver, where requsted domain is delegated, or finally retrieves requested resource record.

Note that on each step usually there are multiple nameservers available for DNS recursor to make a requests. DNS has native fault tolerance mechanisms and if some nameserver is not available, it will request another nameserver in that resource record set. For example, right now there are 13 nameservers available to serve .COM zone:

a.gtld-servers.net.
b.gtld-servers.net.
...
m.gtld-servers.net.

Each of these nameservers can be requested for nameserver of example.com domain and DNS recursor will try to contact another one if it will receive no response on the first attempt. We can use this property to build DNS-based traffic switching between working servers. The idea is following: we can deploy two or more authoritative nameservers for our domain and make each of them return its own IP address. This way there will be a causal relationship between address of nameserver, which DNS recursor has reached, and IP address used to contact actual service.

Unlike dynamic DNS GTMs we do not try to figure out which server is operational, we do not make any active probes. We just let DNS recursor to figure out which NS server is reachable and it will direct client to the same machine which successfully provided DNS response. Effectively it shifts probing and switching to client's DNS recursor, allowing us to get away with two simple DNS server instances with static configuration. Diagram of interactions may look like this:

  ┌────────────┐ A? example.com ┌───────────────┐
  │            │      (1)       │               │
  │            ├───────────────►│ DNS recursive │
┌─┤   Client   │                │               │
│ │            │◄───────────────┤   resolver    │
│ │            │ A  example.com │               │
│ └────────────┘   (9) (=LB2)   └┬────┬────┬─┬──┘
│                                │ ▲  │ ▲  │ │ ▲
│  ┌─────────────────────────────┘ │  │ │  │ │ │
│  │A? example.com (2)             │  │ │  │ │ │
│  │ ┌─────────────────────────────┘  │ │  │ │ │
│  ▼ │NS com (3)                      │ │  │ │ │
│ ┌──┴──────────┐    ┌────────────────┘ │  │ │ │
│ │             ├┐   │A? example.com (4)│  │ │ │
│ │    ROOT     ││   │NS example.com (5)│  │ │ │ A example.com (8)
│ │             ││   ▼ ┌────────────────┘  │ │ │
│ │ nameservers ││  ┌──┴──────────┐        │ │ │ (=LB2)
│ │             ││  │             ├┐       │ │ │
│ └┬────────────┘│  │    .COM     ││       │ │ │
│  └─────────────┘  │             ││       │ │ │
│                   │ nameservers ││       │ │ │
│                   │             ││       │ │ │
│                   └┬────────────┘│       │ │ │
│                    └─────────────┘       │ │ │
│    A? example.com (6)                    │ │ │
│    ┌─xxxxxxxxxxxxxxxx────────────────────┘ │ │
│    │                     A? example.com (7)│ │
│    │ ┌──xxxxxxxxxxxxx     ┌────────────────┘ │
│    ▼ │                    ▼                  │
│   ┌──┴──────────────┐    ┌─────────────────┐ │
│   │                 │    │                 ├─┘
│   │   example.com   │    │   example.com   │
│   │                 │    │                 │
│   │  nameserver1 &  │    │  nameserver2 &  │
│   │                 │    │                 │
│   │  loadbalancer1  │    │  loadbalancer2  │
│   │                 │    │                 │
│   │    (FAULTY)     │    │    (HEALTHY)    │
│   │                 │    │                 │
│   └─────────────────┘    └─────────────────┘
│                                   ▲
│ Actual connection (10)            │
└───────────────────────────────────┘

As diagram indicates, DNS recursor descends hierarchy as usual. When it comes to resolving of actual address record of example.com resource, it tries to contact first1 nameserver, which is also the first loadbalancer providing example.com service. First server is faulty and doesn't provides response to DNS recursor. DNS recursor tries to contact another nameserver and succeeds. Second (name)server responds with its own IP address as always. It makes DNS recursor to return IP of alive nameserver which is also an IP address of alive loadbalancer of example.com service.

Implementation

Let's consider a bit more practical example where we need to loadbalance some hostname, but we don't want to delegate entire zone to our own authoritative nameservers. We will take example.com domain and ensure high availability for hostname api.example.com which points to load balancers.

Step 1. Prepare servers

Prepare two servers for incoming traffic. They can even reside in different datacenters and forward traffic to their local group of workers. We will assume we have two servers with IP addresses: 198.51.100.10 and 203.0.113.20.

Validation: check if you're able to login to these servers and these are reachable on designated addresses.

Step 2. Install payload on servers

Setup service or load balancers providing actual service on these IP addresses.

Validation: depends on payload.

Step 3. Install catch-all authoritative DNS server

At this step we need to install authoritative DNS server on each machine and make it respond on any request with IP address of its machine. Almost any DNS server can do this job, even simple dnsmasq. But reasonably good option for this is CoreDNS.

Install CoreDNS on each server and apply following configuration:

First server

/etc/coredns/Corefile:

example.com {
    template IN A {
        answer "{{ .Name }} 30 IN A 198.51.100.10"
    }
    template IN SOA {
        answer  "{{ .Name }} 3600 IN	SOA lb1.example.com. adminemail.example.com. 2022102100 1200 180 1209600 30"
    }
    template IN NS {
        answer  "{{ .Name }} 30 IN NS lb1.example.com."
        answer  "{{ .Name }} 30 IN NS lb2.example.com."
    }
    template ANY ANY {
    }
}

Second server

/etc/coredns/Corefile:

example.com {
    template IN A {
        answer "{{ .Name }} 30 IN A 203.0.113.20"
    }
    template IN SOA {
        answer  "{{ .Name }} 3600 IN	SOA lb1.example.com. adminemail.example.com. 2022102100 1200 180 1209600 30"
    }
    template IN NS {
        answer  "{{ .Name }} 30 IN NS lb1.example.com."
        answer  "{{ .Name }} 30 IN NS lb2.example.com."
    }
    template ANY ANY {
    }
}

Validation: command dig +short api.example.com @198.51.100.10 should return address 198.51.100.10. Command dig +short api.example.com @203.0.113.20 should return address 203.0.113.20

Step 4. Add A-records for servers in DNS

Create following DNS records in example.com zone:

lb1.example.com.	300	IN	A	198.51.100.10
lb2.example.com.	300	IN	A	203.0.113.20

DNS zone edit process depends where you're hosting it. Sometimes it's Godaddy control panel, sometimes Cloudflare. You should know better.

Validation: command dig +short lb1.example.com should return 198.51.100.10. Command dig +short lb2.example.com should return 203.0.113.20.

Step 5. Finally delegate hostname to loadbalancers/nameservers

Remove all existing DNS records for name api.example.com. Add following ones:

api.example.com. 300 IN	NS	lb1.example.com.
api.example.com. 300 IN	NS	lb2.example.com.

Done! After few minutes you will be able to reach domain api.example.com via two load balancer we set up.

Validation: command dig +trace api.example.com should produce output indicating lb1 or lb2 were contacted and resolve name to one of their addresses.

Maintenance

If you need to do maintenance on one of servers or server is misbehaving, just stop coredns on that server and wait TTL (30 seconds in our example).

Footnotes

  1. For sake of clarity. Actual order is not guaranteed.

@joshenders
Copy link

joshenders commented Aug 21, 2022

This is a great write up but please don’t do this 🙂.

Authoritative nameserver selection by recursive resolvers is extremely unreliable and implementation specific. This method may actually be worse than health checked records and low TTLs and could leave your site completely unconnectable for uncomfortably long periods of time.

See this paper for more details about nameserver selection of common recursive resolvers, and so called SRTT : https://irl.cs.ucla.edu/data/files/papers/res_ns_selection.p...

Also this presentation, https://youtu.be/z7Jl1sjr9jM

One of the best and most robust solutions for GSLB/GTM is pointing your main entrypoint IP (e.g. apex record/www/api) at an anycasted proxy (Cloudflare/Fastly/Google/AWS, etc), and using the active health check features of that service to the static unicasted (and firewalled!) load balancer IPs of your origin.

These services are not expensive but if you can’t afford them, the second best method would be to round robin A/AAAA records with low (60s) TTLs of a large pool of ingress load balancer IPs—which is exactly how AWS ALB/ELB/NLB operate!

When loadbalancing via DNS, you’ll still contend with misbehaving recursive resolvers (JVM clients for example) but you’ll strand less traffic for less time than you would when withdrawing authoritative nameservers due to unpredictable and chaotic resolver implementations.

@Snawoot
Copy link
Author

Snawoot commented Aug 21, 2022

@joshenders Hi!

This is a great write up but please don’t do this

We are running it in production since December of 2021 with no issues and also live-tested multiple failure scenarios. Your concerns did not came true.

One of the best and most robust solutions for GSLB/GTM is pointing your main entrypoint IP (e.g. apex record/www/api) at an anycasted proxy (Cloudflare/Fastly/Google/AWS, etc),

Rooting for US-based cloud services is understandable, but it is not something that everyone wants.

These services are not expensive but if you can’t afford them, the second best method would be to round robin A/AAAA records with low (60s) TTLs of a large pool of ingress load balancer IPs—which is exactly how AWS ALB/ELB/NLB operate!

Nice, but I'd rather not.

@joshenders
Copy link

joshenders commented Aug 21, 2022

@joshenders Hi!

This is a great write up but please don’t do this

We are running it in production since December of 2021 with no issues and also live-tested multiple failure scenarios. Your concerns did not came true.

Unfortunately, due to the diversity of resolvers and user-agents on the Internet, it's just not something that's easily tested until you're in the disaster situation that you're trying to prevent. I've had several experiences administering high scale sites (1M+/rps) of very diverse user-agents where traffic from failed nameserver IPs didn't follow until after the serial of our SOA record was updated and the failed authoritative server was removed entirely. As the site was under .com. this took about 48 hours until DNS became consistent across respecting resolvers. After 48h, we still had a 1-15% drop in traffic but made the decision not to investigate further.

The moral of the story is that authoritative nameserver selection and timeouts/failover by recursive resolvers and various client libraries is highly unpredictable and as such, untestable, and unreliable. Negative caching and glue record caching is somewhat non-standard as well and can come into play. Relying on this mechanism for failover is very uncommon and as such, is entirely untested, unreliable, and unpredictable. In the links I shared, you'll see that many resolvers will continue to send a non-trivial portion of recursive resolver traffic to unreachable (failed) name servers as a test or even ignore and "hotspot", ruining any semblance of fair queuing. A system reliant upon this kind of chaotic behavior is not something I would trust with my reputation.

What is a lot more common and relied upon by those with a massive sample size of real users (hyperscalers like Cloudflare, Fastly, Akamai, Amazon, Google, Microsoft) is DNS RR-based failover. Second to Anycast, this is a reasonable solution for GSLB.

Best of luck!

@Snawoot
Copy link
Author

Snawoot commented Aug 21, 2022

Relying on this mechanism for failover is very uncommon and as such, is entirely untested, unreliable, and unpredictable.

Sorry, but that's a FUD. It's clearly specified in standard that recursive resolver has to retry another NS to obtain address. Therefore it eventually will reach one of nameservers/loadbalancers which is still alive and that nameserver will respond with its own address, effectively directing end user there.

@joshenders
Copy link

joshenders commented Aug 21, 2022

I understand how it could come across as FUD but I'm sharing it from a place of experience operating authoritative DNS infrastructure at scale and providing evidence to support my claims. You don't have to take my advice but you should probably not ignore it either.

What I'm trying to say here is, leaving well-exercised and well-tested code paths lead to unexpected behavior. This isn't FUD, this is bug-hunting 101. RFC1035 does not cover authoritative DNS selection for recursive resolvers and that is exactly why the behavior is implementation specific and there are so many different algorithms.

Client libraries handle connection timeouts to A/AAAA responses better than recursive resolvers handle authoritative nameserver failures. But don't take my word for it — here is more evidence supporting the fact that nameserver selection and fault tolerance varies drastically among common implementations: https://www.dns-oarc.net/files/workshop-201203/OARC-workshop-London-2012-NS-selection.pdf see slide 18.

Another issue you may run into with a configuration like this is that it may not scale beyond 8 nameservers, which is yet another frustrating and seemingly arbitrary limitation imposed by authors of common recursors and gTLDs.

Again, just looking out for you and anyone following along. Best of luck!

@babs
Copy link

babs commented Aug 22, 2022

May I mention gdnsd with multifo ?
It does what big cloud dns mentionned does, can be self hosted (no cost per req but only traffic) and output all healthy backends (with maintenance mode etc), you have hand on ttl therefore to some extent the qps you'll receive. Ttl is handled a pretty neat way depending in the healthcheck frequency of the service and the up/down count.
Lot's of other plugins available too

@Snawoot
Copy link
Author

Snawoot commented Aug 23, 2022

@babs Nice to know, thanks! Back then when I used such solutions there was some complex configuration for PowerDNS to do that. Hopefully now things became simplier

@lucidyan
Copy link

Great article and diagrams! Are you generating them with ascii?

@Snawoot
Copy link
Author

Snawoot commented Aug 29, 2022

Great article and diagrams! Are you generating them with ascii?

@lucidyan Thanks! Yes, it's asciiflow.com, slightly edited with vim.

@hisusqristos
Copy link

how did you generate these?

image

@Snawoot
Copy link
Author

Snawoot commented Sep 10, 2023

@hisusqristos
Copy link

thank you <3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment