Skip to content

Instantly share code, notes, and snippets.

@aojea
Last active April 13, 2024 14:52
Show Gist options
  • Save aojea/32aeaa86aacebcdd93596ecb70fcba4f to your computer and use it in GitHub Desktop.
Save aojea/32aeaa86aacebcdd93596ecb70fcba4f to your computer and use it in GitHub Desktop.
Kubernetes DNS: Headless services with large number of endpoints

Kubernetes DNS at scale

It seems that is a common practice in HPC and AI/ML environments that use MPI applications to populate a hosts files with all the nodes on the cluster and copy it over all the nodes, ref https://help.ubuntu.com/community/MpichCluster

It is my observation that in Kubernetes, Headless Services are used to implement this Service Discovery This is very handy because it allows to reference a pod by hostname without having to copy over a generace /etc/hosts.

There must also be an A record of the following form for each ready endpoint with hostname of and IPv4 address . If there are multiple IPv4 addresses for a given hostname, then there must be one such A record returned for each IP.

Record Format: ...svc.. IN A Question Example: my-pet.headless.default.svc.cluster.local. IN A Answer Example: my-pet.headless.default.svc.cluster.local. 4 IN A 10.3.0.100

But these environments may operate at large scales (+10k nodes) and it seems that the DNS protocol has some limitations, mainly the number of records that fit into a DNS answer.

How many A records per RRSet

My team mate Robert Evans (Google) made a very detailed and good analysis of this problem, quoting some snippets of his report here:

- 4,080 is the maximum upper limit for real-world use:
- 4,094 in the maximum encodable message with a short name.
- Take one away for each additional 16 octets of name encoding.
- Take 4 away for room for EDNS(0) OPT record and one or more options.
- Take 10 away for a DNSSEC signature with short signer name.
- Take one away for each additional 16 octets of signer name encoding.
- 4,000 is a practical guideline leaving additional space for encoding and expansion.
- 4,368 is the maximum number of encodable records in a non-sensical message.
- 88 is the maximum limit for a 1440 byte UDP response (not allowing any space for OPT or DNSSEC records).

DNS messages are limited to 65,535 octets:
UDP size limit is 16 bits [RFC 768]
TCP message header size limit [RFC 1035 s. 4.2.2]
DNS messages have a 12 byte header [RFC 1035 s. 4.1.1]:

DNS messages have a 12 byte header [RFC 1035 s. 4.1.1]:
                                    1  1  1  1  1  1
      0  1  2  3  4  5  6  7  8  9  0  1  2  3  4  5
    +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
    |                      ID                       |
    +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
    |QR|   Opcode  |AA|TC|RD|RA|   Z    |   RCODE   |
    +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
    |                    QDCOUNT                    |
    +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
    |                    ANCOUNT                    |
    +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
    |                    NSCOUNT                    |
    +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
    |                    ARCOUNT                    |
    +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+

Each record has a variable length domain name followed by a 10 byte header [RFC 1035 s. 4.1.3] followed by variable length RDATA:
                                    1  1  1  1  1  1
      0  1  2  3  4  5  6  7  8  9  0  1  2  3  4  5
    +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
    |                                               |
    /                                               /
    /                      NAME                     /
    |                                               |
    +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
    |                      TYPE                     |
    +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
    |                     CLASS                     |
    +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
    |                      TTL                      |
    |                                               |
    +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
    |                   RDLENGTH                    |
    +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--|
    /                     RDATA                     /
    /                                               /
    +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+

Question records are the same format without TTL, RDLENGTH, and RDATA. They consist of a variable length domain name followed by fixed 4 bytes of type and class. [RFC 1035 s 4.1.2].
Domain names are encoded as 1 byte length labels followed by that many label octets terminated by a NUL octet (which represents a zero-length label). [RFC 1035 s 3.3] The maximum encoded length is 255 octets. [RFC 1035 s. 2.3.4]
Domain names or suffixes can also be compressed by including a 2 octet offset to the domain name suffix in first 16,383 bytes of the same message. [RFC 1035 s 4.1.4]
The smallest possible reasonable A record representation is 16 bytes:
2 bytes for the domain name (a fully compressed name)
10 bytes record header
4 bytes RDATA
The message with the largest number of A records contains a message header, question, and maximum number of minimal-sized compressed A records in the remaining space:
12 bytes for the header
N+4 bytes for the question where N is the length of the domain name
Followed by (65535 - 12 - (N+4)) / 16 A records to fill up the remaining space
This omits EDNS(0) OPT record and any extensions therein.
For google.com N=7+4+1=12. This yields 4,094 A records for a 65,532 byte message.
Any domain name encoded to 15 octets or fewer can fit 4,094 A records. Above that one fewer A record fits for each 16 octets of name length up to a minimum of 4,079 records for a name that is 255 octets long for a total message size of 65,535 octets.
For a maximum sized UDP response (1,440 octets), any domain name encoded to 16 octets or fewer can fit 88 A records. Above that one fewer A record fits for each 16 octets of name length up to a minimum of 73 for a name that is 255 octets long for a total message size of 1,439 octets.
...

The Kubernetes Problem

My team mate Magda Kolybacz (Google) pointed out that this presents an interesting challenge to implementations when there are more records than the allowed in a single response.

When an user query the Service name we need to truncate the response, is this specified in a DNS RFC? should we try to normalize this behavior? how is this working today?

Playing with CoreDNS

I wanted to understand how this is working today, so I decided to use KIND and a small golang program to test this out

  1. Create cluster: kind create cluster
  2. Get one of the CoreDNS pods IPs and the Node where it is running
$  kubectl get pods -A -o wide | grep coredns
kube-system          coredns-5fdfb969c5-s4lss                     1/1     Running   0             13d    10.244.1.3     kind-worker          <none>           <none>
kube-system          coredns-5fdfb969c5-szpr4                     1/1     Running   0             13d    10.244.2.5     kind-worker2         <none>           <none>
  1. I'm lazy so I add a route from the host to the Pod (I'm using Linux, with Mac or Windows copy the golang binary in side the kind node)
   kubectl get nodes -o wide kind-worker
NAME          STATUS   ROLES    AGE   VERSION                                    INTERNAL-IP   EXTERNAL-IP   OS-IMAGE                         KERNEL-VERSION         CONTAINER-RUNTIME
kind-worker   Ready    <none>   51d   v1.29.0-alpha.2.932+a8b7e1953f010c-dirty   192.168.8.2   <none>        Debian GNU/Linux 11 (bullseye)   6.5.6-1rodete4-amd64   containerd://1.7.1
ip route add 10.244.1.3 via 192.168.8.2
  1. And now I can run my simple golang program
go run . -dnsserver 10.244.1.3 -endpoints 1000 -name svc-test
Created slice svc-test-wxyz0 with 1000 endpoints
expected 1000 IP for svc got 100
expected 1000 IP for svc got 100
expected 1000 IP for svc got 100
Got 1000 records for svc on 6.047929902s
Stats:
Latencies (us) 50 percentile: 32398
Different records 50 percentile: 0
Total 1000 records
Stats:
Latency (us) 50 percentile: 413
Total 1000 records
go run . -dnsserver 10.244.1.3 -endpoints 4000
Created slice svc-test-wxyz0 with 1000 endpoints
Created slice svc-test-wxyz1 with 1000 endpoints
Created slice svc-test-wxyz2 with 1000 endpoints
Created slice svc-test-wxyz3 with 1000 endpoints
Got 4000 records for svc on 136.84242ms
Stats:
Latencies (us) 50 percentile: 127056
Different records 50 percentile: 0
Total 4000 records
Stats:
Latency (us) 50 percentile: 442
Total 4000 records

image

CoreDNS sets the truncated bit in TCP and fills as much as records as possible in the response, this can be observed using dig

dig +tcp @10.244.1.3 svc-test.default.svc.cluster.local | grep 172.16 | wc
   4091   20455  234507

when using an userspace tool like ping , the tool is able to resolve the ip addresses, that means the glibc resolver returns an address from the TCP truncated answer

ping svc-test -c 1
PING svc-test.default.svc.cluster.local (172.16.0.184) 56(84) bytes of data.

We can compile a small program using getaddrinfo as the one using in the CGO resolver, compiling the attached program we can see it returns all records gcc -o getaddr getaddrinfo.c -static

  ./getaddr svc-test.default.svc.cluster.local | head -5
Result 172.16.4.64
Result 172.16.18.157
Result 172.16.37.244
Result 172.16.5.240
Result 172.16.14.241
root@kind-control-plane:/# ./getaddr svc-test.default.svc.cluster.local | wc
   4092    8184   83119

but the golang library keeps retrying and ends answering with no hosts

go run . -dnsserver 10.244.1.3 -endpoints 10000 -name svc-test
Created slice svc-test-wxyz0 with 1000 endpoints
Created slice svc-test-wxyz1 with 1000 endpoints
Created slice svc-test-wxyz2 with 1000 endpoints
Created slice svc-test-wxyz3 with 1000 endpoints
Created slice svc-test-wxyz4 with 1000 endpoints
Created slice svc-test-wxyz5 with 1000 endpoints
Created slice svc-test-wxyz6 with 1000 endpoints
Created slice svc-test-wxyz7 with 1000 endpoints
Created slice svc-test-wxyz8 with 1000 endpoints
Created slice svc-test-wxyz9 with 1000 endpoints
Could not get IPs for host svc-test.default.svc.cluster.local. on server 10.244.1.3: lookup svc-test.default.svc.cluster.local. on 127.0.0.1:53: no such host
expected 4000 IP for svc got 0
Could not get IPs for host svc-test.default.svc.cluster.local. on server 10.244.1.3: lookup svc-test.default.svc.cluster.local. on 127.0.0.1:53: no such host
expected 4000 IP for svc got 0

However, if we modify the program using the CGO dns resolver

     resolver := net.Resolver{
                PreferGo: false,

it resolves the records

GODEBUG=netdns=cgo go run main.go -dnsserver 10.244.0.3 -endpoints 10000 -name svc-test
Created slice svc-test-wxyz0 with 1000 endpoints
Created slice svc-test-wxyz1 with 1000 endpoints
Created slice svc-test-wxyz2 with 1000 endpoints
Created slice svc-test-wxyz3 with 1000 endpoints
Created slice svc-test-wxyz4 with 1000 endpoints
Created slice svc-test-wxyz5 with 1000 endpoints
Created slice svc-test-wxyz6 with 1000 endpoints
Created slice svc-test-wxyz7 with 1000 endpoints
Created slice svc-test-wxyz8 with 1000 endpoints
Created slice svc-test-wxyz9 with 1000 endpoints
Got 4092 records for svc on 294.252898ms

Both resolvers should have the same behavior, fix in https://go-review.googlesource.com/c/go/+/552418

Notes

There was some golang problems https://go.dev/issue/64896, the golang go resolver does not return any records if the TCP response has the truncated bit set

dig +tcp @10.244.1.3 svc-test.default.svc.cluster.local | grep 172.16 |wc
  4091   20455  238575

And you can watch the slices

kubectl get endpointslices
NAME                 ADDRESSTYPE   PORTS     ENDPOINTS                                                 AGE
affinity-svc-h958d   IPv4          <unset>   <unset>                                                   2d23h
kubernetes           IPv4          6443      192.168.8.4                                               51d
svc-test-wxyz0       IPv4          8080      172.16.0.1,172.16.0.2,172.16.0.3 + 997 more...            45s
svc-test-wxyz1       IPv4          8080      172.16.3.233,172.16.3.234,172.16.3.235 + 997 more...      45s
svc-test-wxyz2       IPv4          8080      172.16.7.209,172.16.7.210,172.16.7.211 + 997 more...      45s
svc-test-wxyz3       IPv4          8080      172.16.11.185,172.16.11.186,172.16.11.187 + 997 more...   45s
svc-test-wxyz4       IPv4          8080      172.16.15.161,172.16.15.162,172.16.15.163 + 997 more...   45s
svc-test-wxyz5       IPv4          8080      172.16.19.137,172.16.19.138,172.16.19.139 + 997 more...   45s
svc-test-wxyz6       IPv4          8080      172.16.23.113,172.16.23.114,172.16.23.115 + 997 more...   45s
svc-test-wxyz7       IPv4          8080      172.16.27.89,172.16.27.90,172.16.27.91 + 997 more...      45s
svc-test-wxyz8       IPv4          8080      172.16.31.65,172.16.31.66,172.16.31.67 + 997 more...      45s
svc-test-wxyz9       IPv4          8080      172.16.35.41,172.16.35.42,172.16.35.43 + 997 more...      45s

The golang program is very hack and dirty, if something goes wrong just delete everything and start over

kubectl delete service svc-test; for i in $(seq 0 10 ); do kubectl delete endpointslice svc-test-wxyz$i; done

Conclusions

It seems coreDNS returns always the same list of records, is this problematic? is DNS the right protocol for asking for a list of 10K endpoints?

Independently of this, I think is clear that we MUST keep generating one A record per hostname, as the applications depend on resolving the hostnames

my-pod11001.headless.default.svc.cluster.local. 4 IN A 10.3.0.100

package main
import (
"context"
"flag"
"fmt"
"net"
"os"
"path/filepath"
"sort"
"time"
v1 "k8s.io/api/core/v1"
discovery "k8s.io/api/discovery/v1"
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
"k8s.io/client-go/kubernetes"
"k8s.io/client-go/tools/clientcmd"
"k8s.io/client-go/util/homedir"
netutils "k8s.io/utils/net"
"k8s.io/utils/ptr"
)
var (
kubeconfig *string
name *string
namespace *string
endpoints *int
dnsserver *string
)
func main() {
if home := homedir.HomeDir(); home != "" {
kubeconfig = flag.String("kubeconfig", filepath.Join(home, ".kube", "config"), "(optional) absolute path to the kubeconfig file")
} else {
kubeconfig = flag.String("kubeconfig", "", "absolute path to the kubeconfig file")
}
name = flag.String("name", "svc-test", "name of the headless service to be created")
namespace = flag.String("namespace", "default", "namespace for the headless service to be created")
endpoints = flag.Int("endpoints", 1000, "number of endpoints for service")
dnsserver = flag.String("dnsserver", "", "IP address of the DNS server")
flag.Parse()
// use the current context in kubeconfig
config, err := clientcmd.BuildConfigFromFlags("", *kubeconfig)
if err != nil {
panic(err.Error())
}
// create the clientset
clientset, err := kubernetes.NewForConfig(config)
if err != nil {
panic(err.Error())
}
// Create a headless service with the number of endpoint addresses specified
// generate a base service object
svc := &v1.Service{
ObjectMeta: metav1.ObjectMeta{
Name: *name,
Namespace: *namespace,
},
Spec: v1.ServiceSpec{
ClusterIP: v1.ClusterIPNone,
Type: v1.ServiceTypeClusterIP,
},
}
_, err = clientset.CoreV1().Services(*namespace).Create(context.TODO(), svc, metav1.CreateOptions{})
if err != nil {
panic(err.Error())
}
defer clientset.CoreV1().Services(*namespace).Delete(context.TODO(), svc.Name, metav1.DeleteOptions{})
// generate a base endpoint slice object
baseEp := netutils.BigForIP(net.ParseIP("172.16.0.1"))
epPort := 8080
epBase := &discovery.EndpointSlice{
ObjectMeta: metav1.ObjectMeta{
Name: *name + "-wxyz",
Namespace: *namespace,
Labels: map[string]string{
discovery.LabelServiceName: *name,
},
},
AddressType: discovery.AddressTypeIPv4,
Endpoints: []discovery.Endpoint{},
Ports: []discovery.EndpointPort{{
Name: ptr.To(fmt.Sprintf("%d", epPort)),
Port: ptr.To(int32(8080)),
Protocol: ptr.To(v1.ProtocolTCP),
}},
}
chunkSize := 1000
for i := 0; i < *endpoints; {
eps := epBase.DeepCopy()
eps.Name = epBase.Name + fmt.Sprintf("%d", i/chunkSize)
n := min(chunkSize, *endpoints-i)
for j := 0; j < n; j++ {
ipEp := netutils.AddIPOffset(baseEp, i+j)
eps.Endpoints = append(eps.Endpoints, discovery.Endpoint{
Addresses: []string{ipEp.String()},
Hostname: ptr.To(fmt.Sprintf("pod%d", i+j)),
})
}
i = i + n
_, err = clientset.DiscoveryV1().EndpointSlices(*namespace).Create(context.TODO(), eps, metav1.CreateOptions{})
if err != nil {
panic(err.Error())
}
fmt.Printf("Created slice %s with %d endpoints\n", eps.Name, n)
defer clientset.DiscoveryV1().EndpointSlices(*namespace).Delete(context.TODO(), eps.Name, metav1.DeleteOptions{})
}
// 4000 is a safe value, the limit of A records in a single answer is a bit larger
maxN := min(*endpoints, 4000)
// wait until DNS server is updated
now := time.Now()
nrecords := 0
for {
n, _ := lookUp(*dnsserver, "tcp", *name+"."+*namespace+".svc.cluster.local.")
nrecords = len(n)
if len(n) >= maxN {
break
}
fmt.Printf("expected %d IP for svc got %d\n", maxN, len(n))
time.Sleep(5*time.Second)
}
fmt.Printf("Got %d records for svc on %v\n", nrecords, time.Since(now))
// get full record stats
iterations := 100
latencies := make([]int64, iterations)
diffs := make([]int, iterations)
oldips := netutils.IPSet{}
for i := 0; i < iterations; i++ {
n, t := lookUp(*dnsserver, "tcp", *name+"."+*namespace+".svc.cluster.local.")
if len(n) != maxN {
fmt.Printf("[stats] expected %d IP for svc got %d\n", maxN, len(n))
continue
}
latencies[i] = t.Microseconds()
ips := netutils.IPSet{}
for _, ip := range n {
ips.Insert(ip.IP)
}
if i != 0 {
diffs[i] = oldips.Difference(ips).Len()
}
oldips = ips
}
sort.Slice(latencies, func(i, j int) bool { return latencies[i] < latencies[j] })
sort.Ints(diffs)
fmt.Println("Stats:")
fmt.Printf("Latencies (us) 50 percentile: %d\n", latencies[len(latencies)/2])
fmt.Printf("Different records 50 percentile: %d\n", diffs[len(diffs)/2])
fmt.Printf("Total %d records\n", maxN)
// get per endpoint stats
latencies = make([]int64, *endpoints)
for j := 0; j < *endpoints; j++ {
n, t := lookUp(*dnsserver, "udp", fmt.Sprintf("pod%d", j)+"."+*name+"."+*namespace+".svc.cluster.local.")
if len(n) != 1 {
fmt.Printf("expected 1 IP for pod%d got %d\n", j, len(n))
continue
}
latencies[j] = t.Microseconds()
}
sort.Slice(latencies, func(i, j int) bool { return latencies[i] < latencies[j] })
fmt.Println("Stats:")
fmt.Printf("Latency (us) 50 percentile: %d\n", latencies[len(latencies)/2])
fmt.Printf("Total %d records\n", *endpoints)
}
func lookUp(dnsserver, proto, host string) (ips []net.IPAddr, t time.Duration) {
start := time.Now()
resolver := net.Resolver{
PreferGo: true,
Dial: func(ctx context.Context, network, address string) (net.Conn, error) {
d := net.Dialer{}
return d.DialContext(ctx, proto, dnsserver+":53")
},
}
n, err := resolver.LookupIPAddr(context.TODO(), host)
if err != nil {
fmt.Fprintf(os.Stderr, "Could not get IPs for host %s on server %s: %v\n", host, dnsserver, err)
return
}
ips = n
t = time.Since(start)
return
}
#include <arpa/inet.h>
#include <netdb.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/socket.h>
#include <sys/types.h>
#include <unistd.h>
#define BUF_SIZE 500
int
main(int argc, char *argv[])
{
int sfd, s;
char buf[BUF_SIZE];
size_t len;
ssize_t nread;
struct addrinfo hints;
struct addrinfo *result, *rp;
if (argc != 2) {
fprintf(stderr, "Usage: %s host\n", argv[0]);
exit(EXIT_FAILURE);
}
/* Obtain address(es) matching host/port. */
memset(&hints, 0, sizeof(hints));
hints.ai_family = AF_UNSPEC; /* Allow IPv4 or IPv6 */
hints.ai_socktype = SOCK_STREAM; /* Stream socket */
hints.ai_flags = 0;
hints.ai_protocol = 0; /* Any protocol */
s = getaddrinfo(argv[1], "http", &hints, &result);
if (s != 0) {
fprintf(stderr, "getaddrinfo: %s error: %d\n", gai_strerror(s),s);
exit(EXIT_FAILURE);
}
/* getaddrinfo() returns a list of address structures.
Try each address until we successfully connect(2).
If socket(2) (or connect(2)) fails, we (close the socket
and) try the next address. */
for (rp = result; rp != NULL; rp = rp->ai_next) {
struct sockaddr_in *addr;
addr = (struct sockaddr_in *)rp->ai_addr;
printf("Result %s\n",inet_ntoa((struct in_addr)addr->sin_addr));
}
freeaddrinfo(result); /* No longer needed */
exit(EXIT_SUCCESS);
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment